skip to main content
research-article

Fusing Multiple Features for Depth-Based Action Recognition

Published:31 March 2015Publication History
Skip Abstract Section

Abstract

Human action recognition is a very active research topic in computer vision and pattern recognition. Recently, it has shown a great potential for human action recognition using the three-dimensional (3D) depth data captured by the emerging RGB-D sensors. Several features and/or algorithms have been proposed for depth-based action recognition. A question is raised: Can we find some complementary features and combine them to improve the accuracy significantly for depth-based action recognition? To address the question and have a better understanding of the problem, we study the fusion of different features for depth-based action recognition. Although data fusion has shown great success in other areas, it has not been well studied yet on 3D action recognition. Some issues need to be addressed, for example, whether the fusion is helpful or not for depth-based action recognition, and how to do the fusion properly. In this article, we study different fusion schemes comprehensively, using diverse features for action characterization in depth videos. Two different levels of fusion schemes are investigated, that is, feature level and decision level. Various methods are explored at each fusion level. Four different features are considered to characterize the depth action patterns from different aspects. The experiments are conducted on four challenging depth action databases, in order to evaluate and find the best fusion methods generally. Our experimental results show that the four different features investigated in the article can complement each other, and appropriate fusion methods can improve the recognition accuracies significantly over each individual feature. More importantly, our fusion-based action recognition outperforms the state-of-the-art approaches on these challenging databases.

References

  1. J. K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 43, 3 (2011), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. M. Alkoot and J. Kittler. 1999. Experimental evaluation of expert fusion strategies. Pattern Recogn. Lett. 20, 11 (1999), 1361--1369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. K. Atrey, M. A. Hossain, A. El S., and M. S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 6 (2010), 345--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Bay, T. Tuytelaars, and G. Luc Van. 2006. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. Springer, 404--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz. 1999. Fusion of face and speech data for person identity verification. IEEE Trans. Neural Networks 10, 5 (1999), 1065--1074. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Breiman. 2001. Random forests. Machine Learning 45, 1 (2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Brown, A. Pocock, M. J. Zhao, and M. Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13 (2012), 27--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Chang, K. Bowyer, and P. Flynn. 2003. Face recognition using 2D and 3D facial data. In Proceedings of the ACM Workshop on Multimodal User Authentication. 25--32.Google ScholarGoogle Scholar
  9. L. L. Chen, H. Wei, and J. Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recogn. Lett. 34, 15 (2013), 1995--2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. C. Da C. A. and M. Fairhurst. 2009. Analyzing the benefits of a novel multiagent approach in a multimodal biometrics identification task. IEEE Syst. J. 3, 4 (2009), 410--417.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. L. Donoho and others. 2000. High-dimensional data analysis: The curses and blessings of dimensionality. In Proceedings of the AMS Math Challenges Lecture. 1--32.Google ScholarGoogle Scholar
  12. F. Fleuret. 2004. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5 (2004), 1531--1555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. L. Hall and J. Llinas. 1997. An introduction to multisensor data fusion. Proc. IEEE 85, 1 (1997), 6--23.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Kittler. 1998. Combining classifiers: A theoretical framework. Pattern Anal. Appl. 1, 1 (1998), 18--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Klaser, M. Marszałek, C. Schmid, and others. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. Kobayashi and N. Otsu. 2012. Motion recognition using local auto-correlation of space--time gradients. Pattern Recog. Lett. 33, 9 (2012), 1188--1195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. S. Koppula, R. Gupta, and A. Saxena. 2013. Learning human activities and object affordances from RGB-D videos. Int. J. Rob. Res. 32, 8 (2013), 951--970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. I. Kuncheva. 2002. A theoretical study on six classifier fusion strategies. IEEE Trans. Pattern Anal. Mach. Intell. 24, 2 (2002), 281--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. I. Kuncheva, J. C. Bezdek, and R. PW Duin. 2001. Decision templates for multiple classifier fusion: An experimental comparison. Pattern Recog. 34, 2 (2001), 299--314.Google ScholarGoogle ScholarCross RefCross Ref
  20. I. Laptev and T. Lindeberg. 2004. Velocity adaptation of space-time interest points. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 1. 52--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--8.Google ScholarGoogle Scholar
  22. W. Q. Li, Z. Y. Zhang, and Z. C. Liu. 2010. Action recognition based on a bag of 3D points. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 9--14.Google ScholarGoogle Scholar
  23. L. Liu and L. Shao. 2013. Learning discriminative representations from RGB-D video data. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI’13). AAAI Press, 1493--1500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A. W. Vieira, and M. F. M. Campos. 2012. Real-time gesture recognition from depth data through key poses learning and decision forests. In Proceedings of the 25th SIBGRAPI Conference on Graphics, Patterns and Images. 268--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. B. Ni, G. Wang, and P. Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the IEEE Computer Vision Workshops (ICCV’11). 1147--1153.Google ScholarGoogle Scholar
  26. E. Ohn-Bar and M. M. Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 465--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. O. Oreifej and Z. C. Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Poppe. 2010. A survey on vision-based human action recognition. Image Vision Comput. 28, 6 (2010), 976--990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Reyes, G. Domínguez, and S. Escalera. 2011. Featureweighting in dynamic timewarping for gesture recognition in depth data. In Proceedings of the IEEE Computer Vision Workshops. 1182--1188.Google ScholarGoogle Scholar
  30. A. Ross and A. K. Jain. 2003. Information fusion in biometrics. Pattern Recog. Lett. 24, 13 (2003), 2115--2125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. A Ross and R. Govindarajan. 2005. Feature level fusion of hand and face biometrics. In Defense and Security. International Society for Optics and Photonics, 196--204.Google ScholarGoogle Scholar
  32. L. Seidenari, V. Varano, S. Berretti, P. Pala, and B. Alberto Del. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In Proceedings of CVPR International Workshop on Human Activity Understanding from 3D Data (HAU3D’13). 479--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Sempena, N. U. Maulidevi, and P. R. Aryan. 2011. Human action recognition using dynamic time warping. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI). 1--5.Google ScholarGoogle Scholar
  34. A. H. Shabani, D. A. Clausi, and J. S. Zelek. 2012. Evaluation of local spatio-temporal salient feature detectors for human action recognition. In Proceedings of the 9th IEEE Conference on Computer and Robot Vision. 468--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1297--1304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Sung, C. Ponce, B. Selman, and A. Saxena. 2012. Unstructured human activity detection from RGBD images. In Proceedings of the IEEE International Conference on Robotics and Automation. 842--849.Google ScholarGoogle Scholar
  37. I. Theodorakopoulos, D. Kastaniotis, G. Economou, and S. Fotopoulos. 2014. Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Rep. 25, 1 (Jan. 2014), 12--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea. 2008. Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1473--1488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. V. N. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.Google ScholarGoogle Scholar
  40. A. Vieira, E. Nascimento, G. Oliveira, Z. C. Liu, and M. Campos. 2012. Stop: Space-time occupancy patterns for 3D action recognition from depth map sequences. Prog. Pattern Recog., Image Anal., Comput. Vis., Appl. (2012), 252--259.Google ScholarGoogle Scholar
  41. C. Y. Wang, Y. Z. Wang, and A. L. Yuille. 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915--922. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Wang, Z. C. Liu, J. Chorowski, Z. Y. Chen, and Y. Wu. 2012a. Robust 3D action recognition with random occupancy patterns. In Proceedings of the European Conference on Computer Vision. Springer, 872--885. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Wang, Z. C. Liu, Y. Wu, and J. S. Yuan. 2012b. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Weinland, R. Ronfard, and E. Boyer. 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vision Image Understanding 115, 2 (2011), 224--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Willems, T. Tuytelaars, and Luc Van G. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. 650--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. L. Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2834--2841. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops. 20--27.Google ScholarGoogle Scholar
  48. L. Xu, A. Krzyzak, and C. Y. Suen. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Sys., Man Cybern. 22, 3 (1992), 418--435.Google ScholarGoogle ScholarCross RefCross Ref
  49. H. Yang and J. Moody. 1999. Feature selection based on joint mutual information. In Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis. 22--25.Google ScholarGoogle Scholar
  50. X. D. Yang and Y. L. Tian. 2014. Effective 3D action recognition using eigenjoints. J. Visual Commun. Image Represent. 25, 1 (2014), 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. X. D. Yang and Y. L. Tian. 2012. Eigenjoints-based action recognition using naive-Bayes-nearest-neighbor. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshops (CVPRW’12). 14--19.Google ScholarGoogle Scholar
  52. X. D. Yang, C. Y. Zhang, and Y. L. Tian. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia. 1057--1060. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Y. Zhao, Z. C. Liu, L. Yang, and H. Cheng. 2012. Combing RGB and depth map features for human activity recognition. In Proceedings of the 2012 Asia-Pacific Signal Information Processing Association Annual Summit and Conference (APSIPA ASC’12). 1--4.Google ScholarGoogle Scholar
  54. Y. Zhu, W. B. Chen, and G. D. Guo. 2013. Fusing spatiotemporal features and joints for 3D action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’13). 486--491. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fusing Multiple Features for Depth-Based Action Recognition

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM Transactions on Intelligent Systems and Technology
                    ACM Transactions on Intelligent Systems and Technology  Volume 6, Issue 2
                    Special Section on Visual Understanding with RGB-D Sensors
                    May 2015
                    381 pages
                    ISSN:2157-6904
                    EISSN:2157-6912
                    DOI:10.1145/2753829
                    • Editor:
                    • Huan Liu
                    Issue’s Table of Contents

                    Copyright © 2015 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 31 March 2015
                    • Accepted: 1 March 2014
                    • Revised: 1 December 2013
                    • Received: 1 July 2013
                    Published in tist Volume 6, Issue 2

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article
                    • Research
                    • Refereed

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader