Abstract
In recent years, human activity recognition from video has been getting considerable research attentions by computer vision researchers due to its prominent applications in various fields such as surveillance environments, human computer interactions, and smart home healthcare. For instance, activity recognition can be used in a surveillance environment to alert the related authority of potential dangerous behaviors. Similarly, the activity recognition can improve the human computer interaction (HCI) in an entertainment environment such as the automatic recognition of different player’s actions in a game so as to create an avatar to play on behalf for the player. Furthermore, the activity recognition can help the rehabilitation of patients in a healthcare system where patient’s action recognition can help to facilitate the rehabilitation processes. Basically, a video-based activity recognition system consists of many prominent goals, one of which is to provide information based on people’s behavior in order to allow the system to proactively assist them with their tasks. A novel approach is proposed here for depth video based human activity recognition, using joint-based spatiotemporal features of depth body shapes and hidden Markov models. From depth video, different body parts of human activities are first segmented using a trained random forest. Spatial features consisting of the 3-D body joint pair angles, the mean of the depth values, the variance of the depth values, and the area of each segmented body part are combined with the motion features representing the magnitude and direction of each joint in the next frame to build the spatiotemporal features in a frame. The activity features are then further enhanced using generalized discriminant analysis to classify them nonlinearly in order to convert them to more robust features. Finally, the features are utilized for training distinguished activity hidden Markov models that can be later used for recognition. The proposed approach shows superior recognition performance compared to other conventional activity recognition approaches.
Similar content being viewed by others
References
Althloothi S, Mahoor MH, Zhang X, Voyles RM (2014) Human activity recognition using multi-features and multiple kernel learning. Pattern Recogn 47(5):1800–1812
Baum E, Eagon J (1967) An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull Am Math Soc 73:360–363
Bosch A, Zisserman A and Munoz X (2007) Image classification using random forests and ferns. IEEE Int Conf Comput Vis 1–8
Breitenstein MD, Jensen J, Hoilund C, Moeslund TB and Van Gool L (2009) Head pose estimation from passive stereo images. In: Proceedings of 16th Scandinavian Conference on Image Analysis, p 219–228
Breitenstein MD, Kuettel D, Weise T, Van Gool L and Pfister H (2008) Real-time face pose estimation from single range images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 1–8
Breuer P, Eckes C and Muller S (2007) Hand gesture recognition with a novel IR time-of-flight range camera: a pilot study. In: Proceedings of the 3rd International Conference on Computer vision/computer graphics collaboration techniques, p 247–260
Cai Q, Gallup D, Zhang C and Zhang Z (2010) 3D deformable face tracking with a commodity depth camera. In: Proceeding of European Conference on Computer Vision, p 229–242
Chang KI, Bowyer KW, Flynn PJ (2006) Multiple nose region matching for 3d face recognition under varying facial expression. IEEE Trans Pattern Anal Mach Intell 28(10):1695–1700
Derpanis K, Wildes R, Tsotsos J (2004) Hand gesture recognition within a linguistics-based framework. In: Proceedings of European Conference on Computer Vision, p 282–296
Dreuw P, Ney H, Martinez G, Crasborn O, Piater J, Moya JM and Wheatley M (2010) The signspeak project - bridging the gap between signers and speakers. In: Proceedings of International Conference on Language Resources and Evaluation, p 476–481
Dreuw P, Ney H, Martinez G, Crasborn O, Piater J, Moya JM, Wheatley M (2010) The signspeak project - bridging the gap between signers and speakers. In: Proceedings of International Conference on Language Resources and Evaluation, p 476–481
Fanelli G, Gall J and Van Gool L (2011) Real time head pose estimation with random regression forests. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 617–624
Ferrari V, Jimenez M-M, Zisserman A (2009) 2D human pose estimation in TV shows, visual motion analysis. LNCS 5604:128–147
Hamer H, Gall J, Weise T and Van Gool L (2010) An object-dependent hand pose prior from sparse training data. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 671–678
Hamer H, Gall J, Weise T, Van Gool L (2010) An object-dependent hand pose prior from sparse training data. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 671–678
Hamer H, Schindler K, Koller-Meier E, and Van Gool L (2009) Tracking a hand manipulating an object. In: Proceedings of IEEE International Conference on Computer Vision, p 1475–1482
Hamer H, Schindler K, Koller-Meier E, Van Gool L (2009) Tracking a hand manipulating an object. In: Proceedings of IEEE International Conference on Computer Vision, p 1475–1482
Iddan GJ, Yahav G (2001) 3D imaging in the studio (and elsewhere…). Proc SPIE 4298:48–55
Jalal A, Uddin MZ, Kim JT, Kim TS (2011) Recognition of human home activities via depth silhouettes and transformation for smart homes. Indoor Built Environ 21(1):184–190
Jalal A, Uddin MZ, Lee JJ, Kim T-S (2012) Recognition of human home activities via depth silhouettes and R transformation for smart home. Indoor Built Environ 21(1):184–190
Kanungu T, Mount DM, Netanyahu N, Piatko C, Silverman R and Wu AY (2000) The analysis of a simple k-means clustering algorithm. 16th ACM Symposium on Computational Geometry, p 101–109
Knossow D, Ronfard R, Horaud R (2008) Human motion tracking with a kinematic parameterization of extremal contours. Int J Comput Vis 79(3):247–269
Kollorz E, Penne J, Hornegger J, Barke A (2008) Gesture recognition with a time-of-flight camera. Int J Intell Syst Technol Appl 5:334–343
Koppula HS, Gupta R, Saxena A (2013) Human activity learning using object affordances from rgb-d videos. Int J Robot Res 32(8):951–970
Lahamy H, Litchi D (2010) Real-time hand gesture recognition using range cameras. In: Proceedings of Canadian Geomatics Conference
Lawrence R, Rabiner A (1989) Tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Lei J, Ren X and Fox D (2012) Fine-grained kitchen activity recognition using rgb-d. In: Proceedings of ACM Conference on Ubiquitous Computing, p 208-211
Lepetit V, Fua P (2006) Keypoint recognition using randomized trees. IEEE Trans Pattern Anal Mach Intell 28:1465–1479
Li Z and Jarvis R (2009) Real time hand gesture recognition using a range camera. In: Proceedings of Australasian Conference on Robotics and Automation
Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circuits Syst Video Technol 18(11):1499–1510
Li W, Zhang Z and Liu Z (2010) Action recognition based on a bag of 3d points. In: Proceedings of Workshop on Human Activity Understanding from 3D Data, p 9–14
Linde Y, Buzo A, Gray R (1980) An algorithm for vector quantizer design. IEEE Trans Commun 28(1):84–94
Liu X and Fujimura K (2004) Hand gesture recognition using depth data. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, p 529–534
Liu L, Shao L (2013) Learning discriminative representations from RGB-D video data. International Joint Conference on Artificial Intelligence (IJCAI), p 1493–1500
Lu X and Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, p 2834–2841, IEEE, Portland
Lu X and Jain AK (2006) Automatic feature extraction for multiview 3d face recognition. In: Proceedings of 7th International Conference on Automatic Face and Gesture Recognition, p 585–590
Lu H, Plataniotis KN, Venetsanopoulos AN (2008) A full-body layered deformable model for automatic model-based gait recognition. EURASIP J Adv Signal Proc 2008:1–13
Luo J, Wang W and Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: Proceedings of IEEE International Conference on Computer Vision, p 1809–1816
Luong DD, Lee S and Kim T-S (2013) Human computer interface using the recognized finger parts of hand depth silhouette via random forests. In: Proceedings of 13th International Conference on Control, Automation and Systems, p 905–909
Marnik J (2007) The polish finger alphabet hand postures recognition using elastic graph matching. Comput Recog Syst 2 45:454–461
Martinez-Camarena M, Oramas MJ and Tuytelaars T (2015) Towards sign language recognition based on body parts relations. In: Proceedings of IEEE International Conference on Image Processing (ICIP), p 2454–2458
McCallum A, Freitag D and Pereira FCN (2000) Maximum entropy markov models for information extraction and segmentation. In: Proceedings of International Conference on Machine Learning, p 591–598
Mian A, Bennamoun M and Owens R (2006) Automatic 3d face detection, normalization and recognition. In: Proceedings of Third International Symposium on 3D Data Processing, Visualization, and Transmission, p 735–742
Microsoft Corporation, “Kinect for Xbox 360-Xbox.com”, [Online]. Available: http://www.xbox.com/en-GB/kinect/, [2014, August 28]
Mo Z and Neumann U (2006) Real-time hand pose recognition using low-resolution depth images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 1499–1505
Morency LP, Sundberg P and Darrell T (2003) Pose estimation using 3d view-based eigenspaces. In: Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures, p 45–52
Nair P, Cavallaro A (2009) 3-d face detection, landmark localization, and registration using a point distribution model. IEEE Trans Multimedia 11(4):611–623
Nishimura H and Tsutsumi M (2001) Off-line hand-written character recognition using integrated 1DHMMs based on feature extraction filters. Sixth International Conference on Document Analysis and Recognition, p 417–421
Ohn-Bar E and Trivedi M (2013) Joint angles similarities and hog2 for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), p 465–470
Oikonomidis I, Kyriazis N and Argyros AA (2012) Tracking the articulated motion of two strongly interacting hands. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 1862–1869
Oikonomidis I, Kyriazis N, Argyros AA (2012) Tracking the articulated motion of two strongly interacting hands. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 1–8
Ong S, Ranganath S (2005) Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans Pattern Anal Mach Intell 27(6):873–891
Oreifej O and Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 716–723
Oreifej O and Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, p 716–723, Portland
Pei T, Starner T, Hamilton H, Essa I and Rehg J (2009) Learnung the basic units in american sign language using discriminative segmental feature selection. In: Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing, p 4757–4760
Pei T, Starner T, Hamilton H, Essa I, Rehg J (2009) Learnung the basic units in american sign language using discriminative segmental feature selection. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, p 4757–4760
Penne J, Soutsche, Fedorowicz L and Hornegger J (2008) Robust real-time 3d time-of-flight based gesture navigation. In: Proceedings of International Conference on Automatic Face and Gesture Recognition, p 1–2
Seemann E, Nickel K and Stiefelhagen R (2004) Head pose estimation using stereo vision for human-robot interaction. In: Proceedings of Sixth IEEE International Conference on Automatic Face and Gesture Recognition, p 626–631
Segundo M, Silva L, Bellon O, Queirolo C (2010) Automatic face segmentation and facial landmark detection in range images. IEEE Trans Syst Man Cybern Part B Cybern 40(5):1319–1330
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2013) Real-time human pose recognition in parts from single depth images. Mach Learn Comput Vis 411:119–135
Simari P, Nowrouzezahrai D, Kalogerakis E, Singh K (2009) Multi-objective shape segmentation and labeling. Eurographics Symp Geom Process 28:1415–1425
Song YM, Noh S, Yu J, Park CW, Lee BG (2014) Background subtraction based on Gaussian mixture models using color and depth information. International Conference on Control, Automation and Information Sciences (ICCAIS), p 132–135
Soutschek S, Penne J, Hornegger J and Kornhuber J (2008) 3-D gesture-based scene navigation in medical imaging applications using time-of-flight cameras. In: Proceedings of Workshop on Time of Flight Camera based Computer Vision, p 1–6
Starner T, Weaver J, Pentland A (1998) Real-time American Sign Language recognition using desk and wearable computer based video. IEEE Trans Pattern Anal Mach Intell 20(12):1371–1375
Sun Y and Yin L (2008) Automatic pose estimation of 3d facial models. In: Proceedings of International Conference on Pattern Recognition, p 1–4
Sung J, Ponce C, Selman B and Saxena A (2012) Unstructured human activity detection from rgbd images. In: Proceedings of IEEE International Conference on Robotics and Automation, p 842–849
Takimoto H, Yoshimori S, Mitsukura Y and Fukumi M (2010) Classification of hand postures based on 3d vision model for human-robot interaction. In: Proceedings of International Symposium on Robot and Human Interactive Communication, p 292–297
Theodorakis S, Pitsikalis V, Maragos P (2010) Model-level data-driven sub-units for signs in videos of continuous sign language. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, p 2262–2265
Uddin MZ, Hassan MM (2013) A depth video-based facial expression recognition system using radon transform, generalized discriminant analysis, and hidden Markov model. Multimed Tools Appl. doi:10.1007/s11042-013-1793-1
Uddin MZ, Kim T-S (2014) A 3-D body joint-specific HMM-based approach for human activity recognition from stereo posture image sequence. Multimed Tools Appl. doi:10.1007/s11042-014-2225-6
Uddin MZ, Kim DH, Kim JT, Kim T-S (2013) An indoor human activity recognition system for smart home using local binary pattern features with hidden Markov models. Indoor Built Environ 22:289–298
Uddin MZ, Kim T-S, Kim J-T (2013) A spatiotemporal robust approach for human activity recognition. Int J Adv Robot Syst. doi:10.5772/57054
Uddin MZ, Lee JJ, Kim T-S (2010) Independent shape component-based human activity recognition via hidden Markov model. J Appl Intell 33:193–206
Van den Bergh M and Van Gool L (2011) Combining rgb and tof cameras for real-time 3d hand gesture interaction. In: Proceedings of IEEE Workshop on Applications of Computer Vision, p 66–72
Vieira A, Nascimento E, Oliveira G, Liu Z and Campos M (2012) Stop: space-time occupancy patterns for 3d action recognition from depth map sequences. In: Proceedings of Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, p 252–259
Vieira AW, Nascimento ER, Oliveira GL, Liu Z, Campos MFM (2012) STOP: Space-time occupancy patterns for 3D action recognition from depth map sequences. Lect Notes Comput Sci 7441:252–259
Vo VH, Ly NQ, Son TT and Hoang PM (2015) Multiple kernel learning and optical flow for action recognition in RGB-D video. In: Proceedings of Seventh International Conference on Knowledge and Systems Engineering (KSE), p 222–227
Wang Y, Huang K and Tan T (2007) Human activity recognition based on r transform. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 1–8
Wang J, Liu Z, Chorowski J, Chen Z and Wu Y (2012) Robust 3d action recognition with random occupancy patterns. In: Proceedings of European Conference on Computer Vision, p 872–885
Wang J, Liu Z, Chorowski J, Chen Z, Wu Y (2012) Robust 3D action recognition with random occupancy patterns. Lect Notes Comput Sci 7573:872–885
Wang J, Liu Z, Wu Y and Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, p 1290–1297, IEEE, Providence
Wang C, Wang Y, Yuille A (2013) An approach to pose-based action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p 915–922
Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. ACM Trans Graph 30(4):1–10, article 77
Weise T, Leibe B and Van Gool L (2007) Fast 3d scanning with automatic motion compensation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, p 1–8
Wright J and Hua G (2009) Implicit elastic matching with random projections for pose-variant face recognition. IEEE conf Comput Vis Pattern Recognit 1502–1509
Yang HD, Lee SW (2007) Reconstruction of 3D human body pose from stereo image sequences based on top-down learning. J Pattern Recognit 40(11):3120–3131
Yang HD, Sclaroff S, Lee SW (2009) Sign language spotting with a threshold model based on conditional random fields. IEEE Trans Pattern Anal Mach Intell 31(7):1264–1277
Yang X and Tian Y (2012) Eigenjoints-based action recognition using naive-bayesnearest-neighbor. In: Proceedings of Workshop on Human Activity Understanding from 3D Data, p 14–19
Yang X, Zhang C and Tian Y (2012) Recognizing actions using depth motion mapsbased histograms of oriented gradients. In: Proceedings of ACM International Conference on Multimedia, p 1057–1060
Zafrulla Z, Brashear H, Hamilton H, Starner T (2010) A novel approach to American sign language (asl) phrase verification using reversed signing. In: Proceedings of IEEE Workshop on CVPR for Human Communicative Behavior Analysis, p 48–55
Acknowledgments
This work was supported by the Samsung Research Fund, Sungkyunkwan University, 2015.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Uddin, M.Z. Human activity recognition using segmented body part and body joint features with hidden Markov models. Multimed Tools Appl 76, 13585–13614 (2017). https://doi.org/10.1007/s11042-016-3742-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-3742-2