Journal of Visual Communication and Image Representation
Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition
Introduction
Human motion analysis has remained as one of the most important areas of research in computer vision. Over the last few decades, a large number of methods have been proposed for human motion analysis (see the surveys by Moeslund et al. [1], [2] and Turaga et al. [3] and most recently by Aggarwal and Ryoo [4] for a comprehensive analysis). In general all methods use a parametric representation of human motion and develop algorithms for comparing and classifying different instances of human activities under these representations.
One of the most common and intuitive methods for representation of human motion is a temporal sequence of approximate human skeletal configurations. The skeletal configurations represent hierarchically arranged joint kinematics with body segments reduced to straight lines. In the past, extracting accurate skeletal configurations from monocular videos was a difficult and unreliable process, especially for arbitrary human poses. Motion capture systems on the other hand could provide very accurate skeletal configurations of human actions based on active or passive markers positioned on the body; however, the data acquisition was limited to controlled indoor environments. Methods for human motion analysis that relied heavily on accurate skeletal data, therefore, became less popular over the years as compared to the image feature-based activity recognition methods. In the latter, spatio-temporal interest points are extracted from monocular videos and the recognition is based on learned statistics on large datasets [5], [6], [7]. Recently, with the release of several low-cost and relatively accurate 3D capturing systems, such as the Microsoft Kinect, real-time 3D data collection and skeleton extraction have become much easier and more practical for the applications of natural human computer interaction, gesture recognition and animation, thus reviving interest in the skeleton-based action representation.
Existing skeleton-based methods for human action recognition are primarily focused on modeling the dynamics of either the full skeleton or a combination of body segments. To represent the dynamics of normalized 3D positions of joints or joint angle configurations, most of the methods use linear dynamical systems (LDS) or non-linear dynamical systems (NLDS), e.g., in [8], [9], [10], or hidden Markov models (HMM), see, e.g., the earlier work by Yamato et al. [11] and a review of several others in [12]. Recently Taylor et al. [13], [14] proposed using conditional restricted Boltzman machines (CRBM) to model the temporal evolution of human actions. While these methods have been very successful for both human activity synthesis and recognition, their representation of human motion is in general not easy to interpret in connection to the physical and qualitative properties of the human motion. For example, the parameters obtained from the LDS modeling of the skeletal joint trajectories will likely describe positions and velocities of the individual joints, which do not directly convey any information about the changes in the skeletal configuration of the human body as the action is performed.
When humans perform an action, we can observe that each individual performs the same action with a different style, generating dissimilar joint trajectories; however, all individuals activate the same set of joints contributing to the overall movement, roughly in the same order. In our approach we take advantage of this observation to capture invariances in human skeletal motion for a given action. Given an action, we propose to identify the most informative joints in a particular temporal window by finding the relative informativeness of all the joints in that window. We can quantify the informativeness of a joint using, for example, the entropy of its joint angle time series. In the case of a Gaussian random variable, its entropy is proportional to the logarithm of its variance. Therefore, the joint that has the highest variance of motion as captured by the change in the joint angle can be defined as the most informative, assuming the joint angle data are independent and identically distributed (i.i.d.) samples from a one-dimensional Gaussian. Such a notion of informativeness is very intuitive and interpretable. During performance of an action, we can observe that different joints are activated at different times with various degree. Therefore, the ordered sequence of informative joints in a full skeletal motion implicitly encodes the temporal dynamics of the motion.
Based on these properties, we recently proposed in [16] the sequence of the most informative joints (SMIJ) as a new representation for human motion based on the temporal ordering of joints that are deemed to be the most informative for performing an action. In [16], we briefly compared the performance of the SMIJ representation to other action representations, based on the histograms of motion words, as well as the methods that explicitly model the dynamics of the skeletal motion. In this paper, we provide a more comprehensive description of the SMIJ representation and other feature representations and further evaluate their quality using action classification as our performance test. In addition, we propose a different metric for comparison of SMIJ features, based on normalized edit distance [17], which outperforms the normalized Levenshtein distance, applied in our previous work. Furthermore, we show that our simple yet highly intuitive and interpretable representation performs much better than standard methods for the task of action recognition from skeletal motion data.
Section snippets
Sequence of the most informative joints (SMIJ)
The human body is an articulated system that can be represented by a hierarchy of joints that are connected with bones, forming a skeleton. Different joint configurations produce different skeletal poses and a time series of these poses yields the skeletal motion. An action can thus simply be described as a collection of time series of 3D positions (i.e., 3D trajectories) of the joints in the skeleton hierarchy. This representation, however, lacks important properties such as invariance with
Alternative feature representations
In this section, we briefly describe three alternative feature representations against which we compare the results of the proposed SMIJ representation. We consider two standard methods, linear dynamical system parameters (LDSP) with a linear system identification approach and histogram of motion words (HMW) with a bag-of-words model. In addition we compare our results with the histogram of the most informative joints (HMIJ) which was also proposed originally in [16] as an alternative to SMIJ.
Experiments
In this section we compare our proposed feature representation SMIJ (described in Section 2) against the baseline feature representations (explained in Section 3) on the datasets outlined in Section 4.1 using action recognition as a test, and provide experimental results in Section 4.2.
Conclusions
We have proposed a very intuitive and qualitatively interpretable skeletal motion feature representation, called sequence of the most informative joints (SMIJ). Unlike most feature representations used for human motion analysis, which rely on sets of parameters that have no physical meaning, the SMIJ representation has a very specific practical interpretation, i.e., the ordering of the joints by their informativeness and their temporal evolution for a given action. More specifically, in the
Acknowledgments
This work is supported in part by the European Research Council grant VideoWorld as well as the grants NSF 0941362, NSF 0941463, NSF 0941382 and ONR N000141310116.
References (43)
- et al.
A survey of computer vision-based human motion capture
Computer Vision and Image Understanding (CVIU)
(2001) - et al.
A survey of advances in vision-based human motion capture and analysis
Computer Vision and Image Understanding (CVIU)
(2006) - et al.
Machine recognition of human activities: a survey
IEEE Transactions on Circuits and Systems for Video Technology
(2008) - et al.
Human activity analysis: a review
ACM Computing Surveys
(2011) - P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in:...
On space-time interest points
International Journal of Computer Vision (IJCV)
(2005)- I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Proceedings of IEEE...
- A. Bissacco, A. Chiuso, Y. Ma, S. Soatto, Recognition of human gaits, in: Proceedings of IEEE Conference on Computer...
- et al.
Classification and recognition of dynamical models: the role of phase, independent components, kernels and optimal transport
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
(2007) - S. Ali, A. Basharat, M. Shah, Chaotic invariants for human action recognition, in: Proceedings of IEEE International...
Human motion analysis: a review
Computer Vision and Image Understanding (CVIU)
Elements of Information Theory
Computation of normalized edit distance and applications
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
Cited by (285)
Efficient skeleton-based action recognition via multi-stream depthwise separable convolutional neural network
2023, Expert Systems with ApplicationsMulti-channel network: Constructing efficient GCN baselines for skeleton-based action recognition
2023, Computers and Graphics (Pergamon)DeGCN: Deformable Graph Convolutional Networks for Skeleton-Based Action Recognition
2024, IEEE Transactions on Image ProcessingViewpoint guided multi-stream neural network for skeleton action recognition
2024, Multimedia Tools and ApplicationsResidual Neural Networks for Human Action Recognition from RGB-D Videos
2023, Journal of Image and Graphics(United Kingdom)View Invariant Spatio-Temporal Descriptor for Action Recognition from Skeleton Sequences
2023, IEEE Transactions on Artificial Intelligence