Elsevier

Expert Systems with Applications

Volume 45, 1 March 2016, Pages 131-141
Expert Systems with Applications

Weighted joint-based human behavior recognition algorithm using only depth information for low-cost intelligent video-surveillance system

https://doi.org/10.1016/j.eswa.2015.09.035Get rights and content

Highlights

  • Human joint estimation and behavior recognition algorithms are presented.

  • Only depth information is used and can be executed on a low cost computing platform.

  • The proposed system can be used with any subject instantly without pre-calibration.

  • Experiments to verify the proposed algorithms have been conducted.

Abstract

Recent advances in 3D depth sensors have created many opportunities for security, surveillance, and entertainment. The 3D depth sensors provide more powerful monitoring systems for dangerous situations irrespective of lighting conditions in buildings or production facilities. To robustly recognize emergency actions or hazardous situations of workers at a production facility, we present human joint estimation and behavior recognition algorithms that solely use depth information in this paper. To estimate human joints on a low cost computing platform, we propose a human joint estimation algorithm that integrates a geodesic graph and a support vector machine (SVM). The human feature points are extracted within a range of geodesic distance from a geodesic graph. The geodesic graph is used for optimizing the estimation result. The SVM-based human joint estimator uses randomly selected human features to reduce computation. Body parts that typically involve many motions are then estimated by the geodesic distance value. The proposed algorithm can work for any human without calibration, and thus the system can be used with any subject immediately even with a low cost computing platform. In the case of the behavior recognition algorithm, the algorithm should have a simple behavior registration process, and it also should be robust to environmental changes. To meet these goals, we propose a template matching-based behavior recognition algorithm. Our method creates a behavior template set that consists of weighted human joint data with scale and rotation invariant properties. A single behavior template consists of the joint information that is estimated per frame. Additionally, we propose adaptive template rejection and a sliding window filter to prevent misrecognition between similar behaviors. The human joint estimation and behavior recognition algorithms are evaluated individually through several experiments and the performance is proven through a comparison with other algorithms. The experimental results show that our method performs well and is applicable in real environments.

Introduction

In recent years, 3D depth information-based human behavior recognition with human joints has become an important topic in the areas of human computer interaction (Aggarwal & Xia, 2014). Recent advances in 3D depth sensors such as Microsoft Kinect have meanwhile created many opportunities for security, surveillance, and entertainment (Zhang, 2012). Among these applications, the video-surveillance area has been extensively studied. To maintain the security of both people and infrastructure, new technologies are contributing to the realization of more powerful systems that detect dangerous situations (Castro, Delgado, Medina, & Ruiz-Lozano, 2011). In order to detect dangerous situations irrespective of lighting conditions in buildings or production facilities, researchers have extensively attempted to use depth information from 3D sensors. Kinect is a low-cost 3D depth sensor based on structured light technology but is limited to indoor use (Freedman, Shpunt, Machline, Arieli et al., 2008). A structured light sensor infers the depth at any location by projecting a known infrared light pattern onto a scene and evaluating the distortion of the projected pattern. In spite of these weaknesses, there are potential avenues for improving the current methods of human behavior recognition for surveillance systems (Escalera, 2012, Ren, Yuan, Meng, Zhang, 2013, Schwarz, Mkhitaryan, Mateus, Navab, 2012).

In order to robustly recognize emergency actions or dangerous situations of workers at a production facility, the authors develop a human joint-based behavior recognition algorithm that uses depth information only. The human joint estimation algorithm can recognize human behavior easily even in complex environments such as offices and factories. Han, Shao, Xu, and Shotton (2013) briefly introduced the recent developments in Kinect sensor-based technologies including human joint estimation and behavior recognition algorithms. In the case of human joint estimation algorithms, the traditional methods can be classified into model-based and model-free algorithms, depending upon whether a priori information about the object shape is employed (Poppe, 2007). There is a vast body of research on human joint estimation and the area has been surveyed by Escalera (2012), Poppe (2007), Moeslund, Hilton, and Krüger (2006), and Shotton et al. (2013). Currently, human joint estimation methods can be classified into three categories. The first category is graph-based approaches: most graph-based approaches use a geodesic distance for the graph generation. The geodesic distance is calculated along the graph node as opposed to the Euclidean distance, which does not use graph information (Alpaydin, 2004). If the graph representation method is used, it is easy to represent the 3D information. It is also possible to reduce the noise of the 3D data by using various optimization techniques. Plagemann, Ganapathi, Koller, and Thrun (2010) detected points of interest, based on identifying the maximum geodesic distance value on a 3D point cloud mesh, and they coincide with salient points of the body that can be classified, such as the hands, feet, and head, using local shape descriptors. Visutsak and Prachumrak (2011) generated a human joint of a 3D meshed model in a Riemannian space, based on Blum’s medial axis transform and geodesic distance algorithm. Schwarz et al. (2012) proposed a full-body joint estimation algorithm that robustly detects anatomical landmarks in a geodesic graph using depth information and fits a skeleton body model using constrained inverse kinematics. Another graph-based algorithm uses a skeletal graph extracted from a volumetric representation of the human body (Straka, Hauswiesner, Rüther, & Bischof, 2011). The skeletal graph is a tree that has the same topology as the human body (arms, legs, and body). The second category is machine learning-based approaches: multi-class problems such as human joint estimation can be solved effectively by using various machine learning algorithms. After the publication of Shotton et al. (2011), several studies extended this work or focused on efficient use of parallel processing. Rogez, Rihan, Orrite-Uruñuela, and Torr (2012) proposed a multi-class joint detector that uses random forests that can classify joints based on histograms of orientated gradient features. Random forests are a combination of decision tree predictors (Breiman, 2001). Hernández-Vela et al. (2012) extended the work of Shotton et al. (2011) using graph-cut optimization. The graph-cut is an energy minimization framework, and it has been widely applied in image segmentation. Buys et al. (2014) estimated human joints using a random forest algorithm from RGB-D sensor information. The proposed system adapts online to difficult unstructured scenes taken from a moving camera. It thus does not require background subtraction.

Other notable approaches are as follows: Jain, Subramanian, Das, and Mittal (2011) proposed an upper-body joint estimation algorithm using a weighted distance transform map and human joint ratio. del Rincón, Makris, Uruñuela, and Nebel (2011) introduced a framework for visual tracking of lower body parts using Kalman and particle filters. Sheasby, Warrell, Zhang, Crook, and Torr (2012) proposed a formulation for solving the problems of human segmentation and joint estimation, using a single energy function. Zhang, Soon Seah, Kwang Quah, and Sun (2013) introduced a generative sampling algorithm with a refinement step of local optimization for body joint tracking. This multi-layer search method does not rely on strong motion priors and generalizes well to general human motions. Tran and Trivedi (2012) presented upper body pose tracking using upper body extremities and a kinematic model of the upper body in 3D with multiple cameras. Toshev and Szegedy (2014) introduced a deep neural network-based human joint estimation algorithm for RGB images. The proposed algorithm can extract the human pose information irrespective of clothing style and body type. Jain, Tompson, LeCun, and Bregler (2014) also proposed a deep learning-based human joint estimation algorithm with a new human body pose dataset (FLIC-motion). The proposed algorithm uses the RGB and optical flow information as input data of the convolutional neural network. Following the work reported in Shotton et al. (2013), the results of Shotton’s work represent the best performance to date, but it is currently difficult to run their algorithms on a low-cost platform. Thus, the conventional algorithms cannot be applied to embedded video-surveillance systems. Behavior recognition algorithms, meanwhile, have been extensively discussed in recent decades, and are expected to be the next generation solutions for human machine interaction (HMI) challenges. With the popularity of 3D depth sensors, many researchers have used depth information and a human joint model for behavior recognition (Celebi, Aydin, Temiz, Arici, 2013, Lai, Konrad, Ishwar, 2012, Megavannan, Agarwal, Babu, 2012, Tran, Trivedi, 2012). However, with most of the proposed algorithms, it is difficult to perform the human behavior registration process and the recognition rate is strongly affected by environmental changes (Mitra, Acharya, 2007, Poppe, 2010).

Human behavior recognition methods can be classified into three categories. Machine learning-based approaches belong to the first category: Sigalas, Baltzakis, and Trahanias (2010) presented upper body part tracking and combined a multi-layer perceptron and radial basis function neural network classifiers for human behavior recognition. Biswas and Basu (2011) proposed a human behavior recognition algorithm using a support vector machine (SVM) with depth difference information. Dubey, Ni, and Moulin (2012) introduced a fall recognition system using an SVM with RGB-D information. Lai et al. (2012) proposed a close-range human behavior recognition algorithm using feature vectors from a human joint model and nearest-neighbor classification. Liu and Shao (2013) introduced an adaptive learning method with spatio-temporal features which simultaneously fuse the RGB and depth information for hand gesture recognition. They also proposed a restricted graph-based genetic programming approach to evolve discriminative spatio-temporal features for visual recognition tasks. Wu and Shao (2014) proposed an action recognition algorithm using human pose information that can be obtained by a deep neural network algorithm. By using a hidden Markov model-based hierarchical parametric model, they showed improved action recognition performance. Fan et al. (2015) proposed a three dimensional human activity recognition algorithm with spatio-temporal local texture features. They estimated the human action by using the k-nearest neighbor and the hidden Markov model algorithm with integrated features from a local binary pattern operator. The first category requires training time to generate the classifier. The second category includes matching-based approaches: Megavannan et al. (2012) presented human action recognition using the motion dynamics of an object from depth difference and average depth information. Wu, Konrad, and Ishwar (2013) proposed a dynamic time-warping-based user identification and gesture recognition framework from human joint data. Celebi et al. (2013) proposed a weighted dynamic time-warping method that weights joints by optimizing a discriminant ratio. These methods exhibit different performance depending on the environment in which they are used. The computation cost, however, is lower than that of other methods. The following studies belong to the last category: Reale, Canavan, Yin, Hu, and Hung (2011) presented a human computer interaction system that integrates control components using multiple behaviors, including eye gaze, head pose, hand pointing, and mouth motion. Xu and Lee (2012) presented a hand gesture recognition algorithm using a hidden Markov model and a fuzzy neural network. Song, Demirdjian, and Davis (2012) proposed a continuous body and hand gesture recognition algorithm. They generate 3D body postures using a generative model-based approach with a particle filter and estimate hand position for use as input features for behavior recognition with an SVM. Yang, Jang, Beh, Han, and Ko (2012) proposed stochastic hand tracking and a hidden Markov model-based behavior recognition algorithm. Wang, Liu, Wu, and Yuan (2012) proposed an actionlet ensemble model, which is trained to represent each action and to capture the intra-class variance for human action recognition. Song, Chandra, and Torresen (2013) introduced a behavior recognition algorithm using an ant learning algorithm. The ant learning algorithm is aimed at reducing the number of training instances and maintaining high recognition accuracy. Yang, Zicheng, and Hong (2013) proposed a human activity recognition algorithm using human interest points from RGB and depth information.

Since the existing methods cannot be directly applied to video-surveillance systems, this paper presents human joint estimation and behavior recognition algorithms that can resolve these problems. For human joint estimation, the aim of the algorithm is operation on a low cost platform without calibration for any human so that the system can be used with any subject immediately. To this end, we propose a method that mixes a geodesic graph and an SVM-based human joint estimator from depth information only. The proposed method uses a small amount of randomized human feature points on a geodesic graph to estimate body parts. Body parts that typically involve a lot of movement are then estimated by the value of the geodesic distance. In the case of the behavior recognition algorithm, the algorithm should have a simple behavior registration process and it also should be robust to environmental changes. To achieve these goals, we propose a weighted joint-based behavior recognition algorithm using human joint data. This algorithm creates a behavior template set that consists of weighted human joint data with scale and rotation invariant properties.

The main contributions of this paper are three-fold. Firstly, the proposed human joint estimation algorithm does not require any calibration process unlike OpenNI and Microsoft Kinect SDK (Kinect for Windows) Microsoft. This means that the human joints can be estimated without delay, and the proposed algorithm is suitable for recognizing emergency actions. Secondly, the proposed algorithms are designed to be operated on low-cost systems such as embedded boards and mobile platforms without exploiting GPUs (graphic processing units). Lastly, the proposed behavior recognition algorithm has a simple gesture registration process with scale and rotation invariant properties. Thus this method can easily be applied to various indoor facilities.

The remaining sections are organized as follows. In Sections 2 and 3, the proposed human joint estimation algorithm and behavior recognition algorithm are described in detail. In Section 4, experimental evaluations are provided. Finally, we present concluding remarks in Section 5.

Section snippets

Human joint estimation algorithm

Our human joint estimation algorithm utilizes geodesic distance and an SVM, and operates solely with depth information without RGB information. For comparison with OpenNI, we divide the human body into 15 parts: head, neck, torso, L/R (left/right) shoulders, L/R elbows, L/R hands, L/R hips, L/R knees, and L/R feet. The proposed algorithm locates these 15 body parts in depth images.

A flow diagram of the proposed human joint estimation algorithm is provided in Fig. 1. Since our video-surveillance

Human behavior recognition algorithm

An overview of the proposed algorithm is illustrated in Fig. 4. In order to recognize behaviors robustly under illumination changes, the proposed algorithm used human joint data and depth information. When a behavior template from human joint data is created, the depth information of the input data is used to identify the scale and rotation changes. For effective behavior recognition, we apply a weight to each joint in the behavior template creation based on the various actions. The next step

Experiments

In this section, we describe the experiments performed to evaluate our human joint estimation and behavior recognition methods. We begin by describing the performance of the human joint estimation algorithm in comparison with OpenNI using the ground truth joint information from a motion capture system. The scalable markerless motion capture system uses 3D depth sensors to track human joints and produce 3D animation (iPi Soft LLC, 2011). The motion capture system provides human joint data with

Conclusion

In this paper, we have presented human joint estimation and behavior recognition algorithms using depth information only. In the case of the human joint estimation algorithm, we proposed a method that integrates a geodesic graph and an SVM-based human joint estimator. The SVM-based human joint estimator uses a small amount of randomized human features. The proposed algorithm can be executed on a low cost platform and the system can be used with any subject immediately without prior calibration.

Acknowledgment

This research was financially supported by Samsung S1 Cooperation. This work was also supported in part by the Technology Innovation Program, 10045252, Development of robot task intelligence technology, supported by the Ministry of Trade, Industry, and Energy (MOTIE, Korea). The students are supported by Ministry of Land, Infrastructure and Transport (MoLIT) as U-City Master and Doctor Course Grant Program.

References (57)

  • BiswasK. et al.

    Gesture recognition using Microsoft Kinect®

    Proceedings of international conference on automation, robotics and applications (ICARA)

    (2011)
  • BoykovY. et al.

    Graph cuts and efficient ND image segmentation

    International Journal of Computer Vision

    (2006)
  • BoykovY. et al.

    Graph cuts in vision and graphics: Theories and applications

    (2006)
  • BoykovY.Y. et al.

    Interactive graph cuts for optimal boundary & region segmentation of objects in ND images

    Proceedings of IEEE international conference on computer vision (ICCV)

    (2001)
  • BreimanL.

    Random forests

    Machine Learning

    (2001)
  • BurgesC.J.

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • CelebiS. et al.

    Gesture recognition using skeleton data with weighted dynamic time warping

    Proceedings of international conference on computer vision theory and applications (VISAPP)

    (2013)
  • CortesC. et al.

    Support-vector networks

    Machine Learning

    (1995)
  • DubeyR. et al.

    A depth camera based fall recognition system for the elderly

    Proceedings of international conference on image analysis and recognition

    (2012)
  • EscaleraS.

    Human behavior analysis from depth maps

    International conference on articulated motion and deformable objects

    (2012)
  • FanC. et al.

    3D human behavior recognition based on spatiotemporal texture features

    Proceedings of international conference on human system interactions (HSI)

    (2015)
  • Freedman, B., Shpunt, A., Machline, M., & Arieli, Y. (2008). Depth mapping using projected patterns. WO Patent...
  • HanJ. et al.

    Enhanced computer vision with microsoft kinect sensor: a review

    IEEE Transactions on Cybernetics

    (2013)
  • Hernández-VelaA. et al.

    Graph cuts optimization for multi-limb human segmentation in depth maps

    Proceedings of IEEE conference on computer vision and pattern recognition (CVPR)

    (2012)
  • HsuC.-W. et al.

    A comparison of methods for multiclass support vector machines

    IEEE Transactions on Neural Networks

    (2002)
  • JainA. et al.

    Modeep: a deep learning framework using motion features for human pose estimation

    Proceedings of the 12th asian conference on computer vision

    (2014)
  • JainH.P. et al.

    Real-time upper-body human pose estimation using a depth camera

    Proceedings of the international conference on computer vision/computer graphics collaboration techniques

    (2011)
  • LaiK. et al.

    A gesture-driven computer interface using Kinect

    Proceedings of the IEEE southwest symposium on image analysis and interpretation (SSIAI)

    (2012)
  • Cited by (53)

    • GssMILP for anomaly classification in surveillance videos

      2022, Expert Systems with Applications
      Citation Excerpt :

      Zhang, Jia, Gong, Sun, and Song (2018) proposed an SSL-based predictive linear classifier for violence detection in surveillance videos by learning a temporal feature dictionary. Cheng, Chen, and Fang (2016) and Kim et al. (2016) employed Bayesian theory and a simple SVM classifier for abnormal video classification. However, the performance of these classifiers deteriorates whenever Anomaly Classification in Surveillance Videos (ACSV) data has few extreme characteristics such as (i) High Class-Imbalance: dataset has more normal frames compared to the abnormal frames, (ii) Low Degree-of-Supervision: dataset has more unlabeled frames than labeled frames, and (iii) Weekly-labeled data: dataset has a set of labeled bags and each bag contains a set of unlabeled instances.

    • A review of building occupancy measurement systems

      2020, Energy and Buildings
      Citation Excerpt :

      However, the cumulative errors problem is not solved as Section 2.1.1 mentioned above. What's more, due to the existing building surveillance video systems [5,11,49] are not equipped with depth cameras, and the depth cameras are more expensive than ordinary cameras, there will be so much cost if using depth cameras for occupancy measurement. These studies based on depth cameras indoor often focus on only using depth information or joint RGB data and depth data, i.e. RGB-D images.

    • Adaptive recognition method of human skeleton action with spatial-temporal tensor fusion

      2023, Yi Qi Yi Biao Xue Bao/Chinese Journal of Scientific Instrument
    • Spatio-temporal based video anomaly detection using deep neural networks

      2023, International Journal of Information Technology (Singapore)
    View all citing articles on Scopus
    View full text