Elsevier

Pattern Recognition

Volume 60, December 2016, Pages 86-105
Pattern Recognition

RGB-D-based action recognition datasets: A survey

https://doi.org/10.1016/j.patcog.2016.05.019Get rights and content

Highlights

  • A detailed review and in-depth analysis of 44 publicly available RGB-D-based action datasets.

  • Recommendations on the selection of datasets and evaluation protocols for use in future research.

  • Identification of some limitations of these datasets and evaluation protocols.

  • Recommendations on future creation of datasets and use of evaluation protocols.

Abstract

Human action recognition from RGB-D (Red, Green, Blue and Depth) data has attracted increasing attention since the first work reported in 2010. Over this period, many benchmark datasets have been created to facilitate the development and evaluation of new algorithms. This raises the question of which dataset to select and how to use it in providing a fair and objective comparative evaluation against state-of-the-art methods. To address this issue, this paper provides a comprehensive review of the most commonly used action recognition related RGB-D video datasets, including 27 single-view datasets, 10 multi-view datasets, and 7 multi-person datasets. The detailed information and analysis of these datasets is a useful resource in guiding insightful selection of datasets for future research. In addition, the issues with current algorithm evaluation vis-á-vis limitations of the available datasets and evaluation protocols are also highlighted; resulting in a number of recommendations for collection of new datasets and use of evaluation protocols.

Introduction

Human action recognition is an active research topic in Computer Vision. Prior to the release of Microsoft Kinect™, research has mainly focused on learning and recognising actions from conventional two-dimensional (2D) video [1], [2], [3], [4]. There are many publicly available 2D video datasets dedicated to action recognition. Review papers categorising and summarising their characteristics are available to help researchers in evaluating their algorithms [5], [6], [7]. The introduction of low-cost integrated depth sensors (such as Microsoft Kinect™) that can capture both RGB (red, green and blue) video and depth (D) information has significantly advanced the research of human action recognition. Since the first work reported in 2010 [8], many benchmark datasets have been created to facilitate the development and evaluation of new action recognition algorithms. However, available RGB-D-based datasets have insofar only been briefly summarised or enumerated without comprehensive coverage and in-depth analysis in the survey papers, such as [9], [10], that mainly focus on the development of RGB-D-based action recognition algorithms. The lack of comprehensive reviews on RGB-D datasets motivated the focus of this paper.

Datasets are important for the rapid development and objective evaluation and comparison of algorithms. To this end, they should be carefully created or selected to ensure effective evaluation of the validity and efficacy of any algorithm under investigation. The evaluation of each task-specific algorithm depends not only on the underlying methods but also on the factors captured by each dataset. However, it is currently difficult to select the most appropriate dataset from among the many Kinect sensor captured RGB-D datasets available and establish the most appropriate evaluation protocol. There is also the possibility of creating a new but redundant dataset because of the lack of comprehensive survey on what is available. This paper fills this gap by providing comprehensive summaries and analysis of existing RGB-D action datasets and the evaluation protocols that have been used in association with these datasets.

The paper focuses on action and activity datasets. “Gesture datasets” are excluded from this survey since, unlike actions and activities that usually involve motion of the entire human body, gesture involves only hand movement and gesture recognition is often considered as a research topic independent of action and activity recognition. For details of the available gesture datasets, readers are referred to the survey paper by Ruffieux et al. [7].

The rest of the survey is organised as follows. Section 2 summarises characteristics of publicly available and commonly used RGB-D datasets; the summaries (44 in total) are categorised under single-view activity/action datasets, multi-view action/activity datasets and interaction/multi-person activity datasets. Section 3 provides a comparative analysis of the reviewed datasets with regard to the applications, complexity, state-of-the-art results, and commonly employed evaluation protocols. In addition, some recommendations are provided to aid the future usage of datasets and evaluation protocols. Discussions on the limitations of current RGB-D action datasets and commonly used evaluation methods are presented in Section 4. At the same time, we provide some recommendations on requirements for future creation of datasets and selection of evaluation protocols. In Section 5, a brief conclusion is drawn.

Section snippets

RGB-D action/activity datasets

This section summarises most of the publicly available RGB-D action datasets, including the creation date, creation institution, number of actions, number of subjects involved, action repetition times, action classes, total number of video samples, capture settings, background and environment.

The datasets are categorised into three classes namely: single-view action/activity, multi-view action/activity, and human–human interaction/multi-person activity. In the single-view action/activity

Analysis

The analysis presented in this section is framed by consideration for (i) the category of application scenarios, (ii) characteristics of dataset acquisition and presentation format, (iii) dependence of algorithm evaluation on dataset acquisition modes, (iv) complexity of the environmental factors inherent in dataset, (v) evaluation protocols commonly used for algorithm development and testing, and (vi) state-of-the-art results obtained to date with the datasets. Naturally, the discussions

Discussion

In this section, we point out the limitations of both current RGB-D action datasets and commonly used evaluation protocols on action recognition. Our aim is to provide guidance on future creation of datasets and establishment of standard evaluation protocols for specific purposes.

Conclusion

A comprehensive review of commonly used and publicly available RGB-D-based datasets for action recognition has been provided. The detailed descriptions and analyses, highlights of their characteristics and potential applications should be useful for researchers designing action recognition algorithms. This is especially so, when selecting datasets for algorithm development and evaluation as well as creating new datasets to fill identified gaps. Most of the datasets collected to date are meant

Jing Zhang received the B.S. degree in electronic science and technology from Nankai University Binhai College, China, in 2010, and the M.S. degree in information and communication engineering from Tianjin University, China, in 2014, and she is currently working toward the Ph.D. degree at the Advanced Multimedia Research Laboratory in the University of Wollongong, Australia. Her research interests include computer vision and pattern recognition.

References (112)

  • Y. Zhu et al.

    Evaluating spatiotemporal interest point features for depth-based action recognition

    Image Vis. Comput.

    (2014)
  • J. Beh et al.

    Hidden Markov model on a unit hypersphere space for gesture trajectory recognition

    Pattern Recognit. Lett.

    (2014)
  • Z. Gao et al.

    Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition

    Neurocomputing

    (2015)
  • S. Vishwakarma et al.

    A survey on activity recognition and behavior understanding in video surveillance

    Vis. Comput.

    (2013)
  • T. Hassner, A critical review of action recognition benchmarks, in: Proceedings of the IEEE Conference on Computer...
  • S. Ruffieux, D. Lalanne, E. Mugellini, O.A. Khaled, A survey of datasets for human gesture recognition, in: M. Kurosu...
  • W. Li, Z. Zhang, Z.Liu, Action recognition based on a bag of 3D points, in: Proceedings of the IEEE Computer Society...
  • R. Lun, W. Zhao, A survey of applications and human motion recognition with microsoft kinect, Int. J. Pattern Recognit....
  • B. Ni, G. Wang, P. Moulin, RGBD-HuDaAct: A color-depth video database for human daily activity recognition, in:...
  • J. Sung, C. Ponce, B. Selman, A. Saxena, Human activity detection from RGBD images, in: Proceedings of the AAAI...
  • S. Fothergill, H. Mentis, P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in:...
  • J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: Proceedings...
  • L. Xia, C.C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3D joints, in: Proceedings...
  • V. Bloom, D. Makris, V. Argyriou, G3D: A gaming action dataset and real time action recognition evaluation framework,...
  • V. Bloom, V. Argyriou, D. Makris, Dynamic feature selection for online action recognition, in: Human Behavior...
  • Y.C. Lin, M.C. Hu, W.H. Cheng, Y.H. Hsieh, H.M. Chen, Human action recognition and retrieval using sole depth...
  • M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, in: Proceedings of the IEEE...
  • C. Zhang, Y. Tian, RGB-D camera-based daily living activity recognition, J. Comput. Vis. Image Process. 2 (4), 2012,...
  • O. Oreifej, Z. Liu, HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences, in:...
  • H.S. Koppula et al.

    Learning human activities and object affordances from RGB-D videos

    Int. J. Robot. Res.

    (2013)
  • F. Negin, F. Özdemir, C. B. Akgül, K. A. Yüksel, A. Erçil, A decision forest based feature selection framework for...
  • P. Wei, N. Zheng, Y. Zhao, S.-C. Zhu, Concurrent action detection with structural prediction, in: Proceedings of the...
  • M. Munaro, S. Michieletto, E. Menegatti, An evaluation of 3D motion flow and 3D pose estimation for human action...
  • C. Ellis et al.

    Exploring the trade-off between accuracy and observational latency in action recognition

    Int. J. Comput. Vis.

    (2013)
  • A. Mansur et al.

    Inverse dynamics for action recognition

    IEEE Trans. Cybern.

    (2013)
  • M. Karg, A. Kirsch, Simultaneous plan recognition and monitoring (SPRAM) for robot assistants, in: Proceedings of the...
  • Y. Zhao et al.

    RGB-depth feature for 3D human activity recognition

    China Commun.

    (2013)
  • V. Carletti, P. Foggia, G. Percannella, A. Saggese, M. Vento, Recognition of human actions from RGB-D videos using a...
  • A. Liu, W. Nie, Y. Su, L. Ma, T. Hao, Z. Yang, Coupled hidden conditional random fields for RGB-D human action...
  • D. Huang, S. Yao, Y. Wang, F.D.L. Torre, Sequential max-margin event detectors, in: Computer Vision—ECCV 2014, Lecture...
  • I. Lillo, A. Soto, J.C. Niebles, Discriminative hierarchical modeling of spatio-temporally composable human activities,...
  • G. Yu, Z. Liu, J. Yuan, Discriminative orderlet mining for real-time recognition of human-object interaction, in:...
  • C. Wu, J. Zhang, S. Savarese, A. Saxena, Watch-n-patch: Unsupervised understanding of actions and relations, in:...
  • J.-F. Hu, W.-S. Zheng, J. Lai, J. Zhang, Jointly learning heterogeneous features for RGB-D activity recognition, in:...
  • C. Chen, R. Jafari, N. Kehtarnavaz, Utd-mad: A multimodal dataset for human action recognition utilizing a depth camera...
  • Z. Cheng, L. Qin, Y. Ye, Q. Huang, Q. Tian, Human daily action analysis with multi-view and color-depth data, in:...
  • Z. Zhang, W. Liu, V. Metsis, V. Athitsos, A viewpoint-independent statistical method for fall detection, in:...
  • F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Berkeley MHAD: A comprehensive multimodal human action database,...
  • S.M. Amiri, M.T. Pourazad, P. Nasiopoulos, V.C.M. Leung, Non-intrusive human activity monitoring in a smart home...
  • L. Chen et al.

    Readingact RGB-D action dataset and human action recognition from local features

    Pattern Recognit. Lett.

    (2013)
  • Cited by (0)

    Jing Zhang received the B.S. degree in electronic science and technology from Nankai University Binhai College, China, in 2010, and the M.S. degree in information and communication engineering from Tianjin University, China, in 2014, and she is currently working toward the Ph.D. degree at the Advanced Multimedia Research Laboratory in the University of Wollongong, Australia. Her research interests include computer vision and pattern recognition.

    Wanqing Li received his Ph.D. in electronic engineering from The University of Western Australia. He is an Associate Professor and Co-Director of Advanced Multimedia Research Lab (AMRL) of University of Wollongong, Australia. His research areas are 3D computer vision, 3D multimedia signal processing and medical image analysis. Dr. Li is a Senior Member of IEEE.

    Philip O. Ogunbona earned his Ph.D., DIC (Electrical Engineering) from Imperial College, London. His Bachelors degree was in Electronic and Electrical Engineering from the University of Ife, Nigeria (now named Obafemi Awolowo University). He is a Professor and Co-Director of the Advanced Multimedia Research Lab, University of Wollongong, Australia. His research interests include computer vision, pattern recognition and machine learning. He is a Senior Member of IEEE and Fellow of Australian Computer Society

    Pichao Wang received the B.E. degree in network engineering from Nanchang University, Nanchang, China, in 2010, and received the M.S. degree in communication and information system from Tianjin University, Tianjin, China, in 2013. He is currently pursuing the Ph.D. degree with the School of Computer Science and Software Engineering, University of Wollongong, Australia. His current research interests include computer vision and machine learning.

    Chang Tang received the B.Eng. degree in school of electronic engineering from Tianjin University of Technology and Education, Tianjin, China, in 2010. He is currently pursuing the Ph.D. degree in communication engineering from the School of Electronic Information Engineering, Tianjin University. His current research interests include computer vision, computer graphics and 3-D image/video processing.

    View full text