Abstract
Human action recognition is a very active research topic in computer vision and pattern recognition. Recently, it has shown a great potential for human action recognition using the three-dimensional (3D) depth data captured by the emerging RGB-D sensors. Several features and/or algorithms have been proposed for depth-based action recognition. A question is raised: Can we find some complementary features and combine them to improve the accuracy significantly for depth-based action recognition? To address the question and have a better understanding of the problem, we study the fusion of different features for depth-based action recognition. Although data fusion has shown great success in other areas, it has not been well studied yet on 3D action recognition. Some issues need to be addressed, for example, whether the fusion is helpful or not for depth-based action recognition, and how to do the fusion properly. In this article, we study different fusion schemes comprehensively, using diverse features for action characterization in depth videos. Two different levels of fusion schemes are investigated, that is, feature level and decision level. Various methods are explored at each fusion level. Four different features are considered to characterize the depth action patterns from different aspects. The experiments are conducted on four challenging depth action databases, in order to evaluate and find the best fusion methods generally. Our experimental results show that the four different features investigated in the article can complement each other, and appropriate fusion methods can improve the recognition accuracies significantly over each individual feature. More importantly, our fusion-based action recognition outperforms the state-of-the-art approaches on these challenging databases.
- J. K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 43, 3 (2011), 16. Google ScholarDigital Library
- F. M. Alkoot and J. Kittler. 1999. Experimental evaluation of expert fusion strategies. Pattern Recogn. Lett. 20, 11 (1999), 1361--1369. Google ScholarDigital Library
- P. K. Atrey, M. A. Hossain, A. El S., and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 6 (2010), 345--379. Google ScholarDigital Library
- H. Bay, T. Tuytelaars, and G. Luc Van. 2006. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. Springer, 404--417. Google ScholarDigital Library
- S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz. 1999. Fusion of face and speech data for person identity verification. IEEE Trans. Neural Networks 10, 5 (1999), 1065--1074. Google ScholarDigital Library
- L. Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32. Google ScholarDigital Library
- G. Brown, A. Pocock, M. J. Zhao, and M. Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13 (2012), 27--66. Google ScholarDigital Library
- K. Chang, K. Bowyer, and P. Flynn. 2003. Face recognition using 2D and 3D facial data. In Proceedings of the ACM Workshop on Multimodal User Authentication. 25--32.Google Scholar
- L. L. Chen, H. Wei, and J. Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recogn. Lett. 34, 15 (2013), 1995--2006. Google ScholarDigital Library
- M. C. Da C. A. and M. Fairhurst. 2009. Analyzing the benefits of a novel multiagent approach in a multimodal biometrics identification task. IEEE Syst. J. 3, 4 (2009), 410--417.Google ScholarCross Ref
- D. L. Donoho and others. 2000. High-dimensional data analysis: The curses and blessings of dimensionality. In Proceedings of the AMS Math Challenges Lecture. 1--32.Google Scholar
- F. Fleuret. 2004. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5 (2004), 1531--1555. Google ScholarDigital Library
- D. L. Hall and J. Llinas. 1997. An introduction to multisensor data fusion. Proc. IEEE 85, 1 (1997), 6--23.Google ScholarCross Ref
- J. Kittler. 1998. Combining classifiers: A theoretical framework. Pattern Anal. Appl. 1, 1 (1998), 18--27. Google ScholarDigital Library
- A. Klaser, M. Marszałek, C. Schmid, and others. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference.Google ScholarCross Ref
- T. Kobayashi and N. Otsu. 2012. Motion recognition using local auto-correlation of space--time gradients. Pattern Recog. Lett. 33, 9 (2012), 1188--1195. Google ScholarDigital Library
- H. S. Koppula, R. Gupta, and A. Saxena. 2013. Learning human activities and object affordances from RGB-D videos. Int. J. Rob. Res. 32, 8 (2013), 951--970. Google ScholarDigital Library
- L. I. Kuncheva. 2002. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell. 24, 2 (2002), 281--286. Google ScholarDigital Library
- L. I. Kuncheva, J. C. Bezdek, and R. PW Duin. 2001. Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recog. 34, 2 (2001), 299--314.Google ScholarCross Ref
- I. Laptev and T. Lindeberg. 2004. Velocity adaptation of space-time interest points. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 1. 52--56. Google ScholarDigital Library
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--8.Google Scholar
- W. Q. Li, Z. Y. Zhang, and Z. C. Liu. 2010. Action recognition based on a bag of 3D points. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 9--14.Google Scholar
- L. Liu and L. Shao. 2013. Learning discriminative representations from RGB-D video data. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). AAAI Press, 1493--1500. Google ScholarDigital Library
- L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. W. Vieira, and M. F. M. Campos. 2012. Real-time gesture recognition from depth data through key poses learning and decision forests. In Proceedings of the 25th SIBGRAPI Conference on Graphics, Patterns and Images. 268--275. Google ScholarDigital Library
- B. B. Ni, G. Wang, and P. Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the IEEE Computer Vision Workshops (ICCV’11). 1147--1153.Google Scholar
- E. Ohn-Bar and M. M. Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 465--470. Google ScholarDigital Library
- O. Oreifej and Z. C. Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarDigital Library
- R. Poppe. 2010. A survey on vision-based human action recognition. Image Vision Comput. 28, 6 (2010), 976--990. Google ScholarDigital Library
- M. Reyes, G. Domínguez, and S. Escalera. 2011. Featureweighting in dynamic timewarping for gesture recognition in depth data. In Proceedings of the IEEE Computer Vision Workshops. 1182--1188.Google Scholar
- A. Ross and A. K. Jain. 2003. Information fusion in biometrics. Pattern Recog. Lett. 24, 13 (2003), 2115--2125. Google ScholarDigital Library
- A. A Ross and R. Govindarajan. 2005. Feature level fusion of hand and face biometrics. In Defense and Security. International Society for Optics and Photonics, 196--204.Google Scholar
- L. Seidenari, V. Varano, S. Berretti, P. Pala, and B. Alberto Del. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In Proceedings of CVPR International Workshop on Human Activity Understanding from 3D Data (HAU3D’13). 479--485. Google ScholarDigital Library
- S. Sempena, N. U. Maulidevi, and P. R. Aryan. 2011. Human action recognition using dynamic time warping. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI). 1--5.Google Scholar
- A. H. Shabani, D. A. Clausi, and J. S. Zelek. 2012. Evaluation of local spatio-temporal salient feature detectors for human action recognition. In Proceedings of the 9th IEEE Conference on Computer and Robot Vision. 468--475. Google ScholarDigital Library
- J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1297--1304. Google ScholarDigital Library
- J. Sung, C. Ponce, B. Selman, and A. Saxena. 2012. Unstructured human activity detection from RGBD images. In Proceedings of the IEEE International Conference on Robotics and Automation. 842--849.Google Scholar
- I. Theodorakopoulos, D. Kastaniotis, G. Economou, and S. Fotopoulos. 2014. Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Rep. 25, 1 (Jan. 2014), 12--23. Google ScholarDigital Library
- P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea. 2008. Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1473--1488. Google ScholarDigital Library
- V. N. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.Google Scholar
- A. Vieira, E. Nascimento, G. Oliveira, Z. C. Liu, and M. Campos. 2012. Stop: Space-time occupancy patterns for 3D action recognition from depth map sequences. Prog. Pattern Recog., Image Anal., Comput. Vis., Appl. (2012), 252--259.Google Scholar
- C. Y. Wang, Y. Z. Wang, and A. L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922. Google ScholarDigital Library
- J. Wang, Z. C. Liu, J. Chorowski, Z. Y. Chen, and Y. Wu. 2012a. Robust 3D action recognition with random occupancy patterns. In Proceedings of the European Conference on Computer Vision. Springer, 872--885. Google ScholarDigital Library
- J. Wang, Z. C. Liu, Y. Wu, and J. S. Yuan. 2012b. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297. Google ScholarDigital Library
- D. Weinland, R. Ronfard, and E. Boyer. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vision Image Understanding 115, 2 (2011), 224--241. Google ScholarDigital Library
- G. Willems, T. Tuytelaars, and Luc Van G. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. 650--663. Google ScholarDigital Library
- L. Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2834--2841. Google ScholarDigital Library
- L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 20--27.Google Scholar
- L. Xu, A. Krzyzak, and C. Y. Suen. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Sys., Man Cybern. 22, 3 (1992), 418--435.Google ScholarCross Ref
- H. Yang and J. Moody. 1999. Feature selection based on joint mutual information. In Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis. 22--25.Google Scholar
- X. D. Yang and Y. L. Tian. 2014. Effective 3D action recognition using eigenjoints. J. Visual Commun. Image Represent. 25, 1 (2014), 2--11. Google ScholarDigital Library
- X. D. Yang and Y. L. Tian. 2012. Eigenjoints-based action recognition using naive-Bayes-nearest-neighbor. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops (CVPRW’12). 14--19.Google Scholar
- X. D. Yang, C. Y. Zhang, and Y. L. Tian. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia. 1057--1060. Google ScholarDigital Library
- Y. Zhao, Z. C. Liu, L. Yang, and H. Cheng. 2012. Combing RGB and depth map features for human activity recognition. In Proceedings of the 2012 Asia-Pacific Signal Information Processing Association Annual Summit and Conference (APSIPA ASC’12). 1--4.Google Scholar
- Y. Zhu, W. B. Chen, and G. D. Guo. 2013. Fusing spatiotemporal features and joints for 3D action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’13). 486--491. Google ScholarDigital Library
Index Terms
- Fusing Multiple Features for Depth-Based Action Recognition
Recommendations
Multifeature Selection for 3D Human Action Recognition
In mainstream approaches for 3D human action recognition, depth and skeleton features are combined to improve recognition accuracy. However, this strategy results in high feature dimensions and low discrimination due to redundant feature vectors. To ...
Evaluating spatiotemporal interest point features for depth-based action recognition
Human action recognition has lots of real-world applications, such as natural user interface, virtual reality, intelligent surveillance, and gaming. However, it is still a very challenging problem. In action recognition using the visible light videos, ...
Automatic 3D face recognition from depth and intensity Gabor features
As is well known, traditional 2D face recognition based on optical (intensity or color) images faces many challenges, such as illumination, expression, and pose variation. In fact, the human face generates not only 2D texture information but also 3D ...
Comments