RGB-D-based action recognition datasets: A survey
Introduction
Human action recognition is an active research topic in Computer Vision. Prior to the release of Microsoft Kinect™, research has mainly focused on learning and recognising actions from conventional two-dimensional (2D) video [1], [2], [3], [4]. There are many publicly available 2D video datasets dedicated to action recognition. Review papers categorising and summarising their characteristics are available to help researchers in evaluating their algorithms [5], [6], [7]. The introduction of low-cost integrated depth sensors (such as Microsoft Kinect™) that can capture both RGB (red, green and blue) video and depth (D) information has significantly advanced the research of human action recognition. Since the first work reported in 2010 [8], many benchmark datasets have been created to facilitate the development and evaluation of new action recognition algorithms. However, available RGB-D-based datasets have insofar only been briefly summarised or enumerated without comprehensive coverage and in-depth analysis in the survey papers, such as [9], [10], that mainly focus on the development of RGB-D-based action recognition algorithms. The lack of comprehensive reviews on RGB-D datasets motivated the focus of this paper.
Datasets are important for the rapid development and objective evaluation and comparison of algorithms. To this end, they should be carefully created or selected to ensure effective evaluation of the validity and efficacy of any algorithm under investigation. The evaluation of each task-specific algorithm depends not only on the underlying methods but also on the factors captured by each dataset. However, it is currently difficult to select the most appropriate dataset from among the many Kinect sensor captured RGB-D datasets available and establish the most appropriate evaluation protocol. There is also the possibility of creating a new but redundant dataset because of the lack of comprehensive survey on what is available. This paper fills this gap by providing comprehensive summaries and analysis of existing RGB-D action datasets and the evaluation protocols that have been used in association with these datasets.
The paper focuses on action and activity datasets. “Gesture datasets” are excluded from this survey since, unlike actions and activities that usually involve motion of the entire human body, gesture involves only hand movement and gesture recognition is often considered as a research topic independent of action and activity recognition. For details of the available gesture datasets, readers are referred to the survey paper by Ruffieux et al. [7].
The rest of the survey is organised as follows. Section 2 summarises characteristics of publicly available and commonly used RGB-D datasets; the summaries (44 in total) are categorised under single-view activity/action datasets, multi-view action/activity datasets and interaction/multi-person activity datasets. Section 3 provides a comparative analysis of the reviewed datasets with regard to the applications, complexity, state-of-the-art results, and commonly employed evaluation protocols. In addition, some recommendations are provided to aid the future usage of datasets and evaluation protocols. Discussions on the limitations of current RGB-D action datasets and commonly used evaluation methods are presented in Section 4. At the same time, we provide some recommendations on requirements for future creation of datasets and selection of evaluation protocols. In Section 5, a brief conclusion is drawn.
Section snippets
RGB-D action/activity datasets
This section summarises most of the publicly available RGB-D action datasets, including the creation date, creation institution, number of actions, number of subjects involved, action repetition times, action classes, total number of video samples, capture settings, background and environment.
The datasets are categorised into three classes namely: single-view action/activity, multi-view action/activity, and human–human interaction/multi-person activity. In the single-view action/activity
Analysis
The analysis presented in this section is framed by consideration for (i) the category of application scenarios, (ii) characteristics of dataset acquisition and presentation format, (iii) dependence of algorithm evaluation on dataset acquisition modes, (iv) complexity of the environmental factors inherent in dataset, (v) evaluation protocols commonly used for algorithm development and testing, and (vi) state-of-the-art results obtained to date with the datasets. Naturally, the discussions
Discussion
In this section, we point out the limitations of both current RGB-D action datasets and commonly used evaluation protocols on action recognition. Our aim is to provide guidance on future creation of datasets and establishment of standard evaluation protocols for specific purposes.
Conclusion
A comprehensive review of commonly used and publicly available RGB-D-based datasets for action recognition has been provided. The detailed descriptions and analyses, highlights of their characteristics and potential applications should be useful for researchers designing action recognition algorithms. This is especially so, when selecting datasets for algorithm development and evaluation as well as creating new datasets to fill identified gaps. Most of the datasets collected to date are meant
Jing Zhang received the B.S. degree in electronic science and technology from Nankai University Binhai College, China, in 2010, and the M.S. degree in information and communication engineering from Tianjin University, China, in 2014, and she is currently working toward the Ph.D. degree at the Advanced Multimedia Research Laboratory in the University of Wollongong, Australia. Her research interests include computer vision and pattern recognition.
References (112)
- et al.
Fuzzy human motion analysisa review
Pattern Recognit.
(2015) - et al.
Recent developments in human motion analysis
Pattern Recognit.
(2003) - et al.
A survey on still image based human action recognition
Pattern Recognit.
(2014) - et al.
A survey of video datasets for human action and activity recognition
Comput. Vis. Image Understand.
(2013) - et al.
Human activity recognition from 3D dataa review
Pattern Recognit. Lett.
(2014) - et al.
3D flow estimation for human action recognition from colored point clouds
Biol. Inspired Cognit. Archit.
(2013) - et al.
Pose-based human action recognition via sparse representation in dissimilarity space
J. Vis. Commun. Image Represent.
(2014) - et al.
Evaluation of video activity localizations integrating quality and quantity measurements
Comput. Vis. Image Understand.
(2014) - et al.
Triviewsa general framework to use 3D depth data effectively for action recognition
J. Vis. Commun. Image Represent.
(2015) - et al.
Spatio-temporal feature extraction and representation for RGB-D human action recognition
Pattern Recognit. Lett.
(2014)
Evaluating spatiotemporal interest point features for depth-based action recognition
Image Vis. Comput.
Hidden Markov model on a unit hypersphere space for gesture trajectory recognition
Pattern Recognit. Lett.
Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition
Neurocomputing
A survey on activity recognition and behavior understanding in video surveillance
Vis. Comput.
Learning human activities and object affordances from RGB-D videos
Int. J. Robot. Res.
Exploring the trade-off between accuracy and observational latency in action recognition
Int. J. Comput. Vis.
Inverse dynamics for action recognition
IEEE Trans. Cybern.
RGB-depth feature for 3D human activity recognition
China Commun.
Readingact RGB-D action dataset and human action recognition from local features
Pattern Recognit. Lett.
Cited by (0)
Jing Zhang received the B.S. degree in electronic science and technology from Nankai University Binhai College, China, in 2010, and the M.S. degree in information and communication engineering from Tianjin University, China, in 2014, and she is currently working toward the Ph.D. degree at the Advanced Multimedia Research Laboratory in the University of Wollongong, Australia. Her research interests include computer vision and pattern recognition.
Wanqing Li received his Ph.D. in electronic engineering from The University of Western Australia. He is an Associate Professor and Co-Director of Advanced Multimedia Research Lab (AMRL) of University of Wollongong, Australia. His research areas are 3D computer vision, 3D multimedia signal processing and medical image analysis. Dr. Li is a Senior Member of IEEE.
Philip O. Ogunbona earned his Ph.D., DIC (Electrical Engineering) from Imperial College, London. His Bachelors degree was in Electronic and Electrical Engineering from the University of Ife, Nigeria (now named Obafemi Awolowo University). He is a Professor and Co-Director of the Advanced Multimedia Research Lab, University of Wollongong, Australia. His research interests include computer vision, pattern recognition and machine learning. He is a Senior Member of IEEE and Fellow of Australian Computer Society
Pichao Wang received the B.E. degree in network engineering from Nanchang University, Nanchang, China, in 2010, and received the M.S. degree in communication and information system from Tianjin University, Tianjin, China, in 2013. He is currently pursuing the Ph.D. degree with the School of Computer Science and Software Engineering, University of Wollongong, Australia. His current research interests include computer vision and machine learning.
Chang Tang received the B.Eng. degree in school of electronic engineering from Tianjin University of Technology and Education, Tianjin, China, in 2010. He is currently pursuing the Ph.D. degree in communication engineering from the School of Electronic Information Engineering, Tianjin University. His current research interests include computer vision, computer graphics and 3-D image/video processing.