Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition

doi:10.1016/j.jvcir.2013.04.007

Journal of Visual Communication and Image Representation

Volume 25, Issue 1, January 2014, Pages 24-38

https://doi.org/10.1016/j.jvcir.2013.04.007 Get rights and content

Highlights

•
We propose a new representation of human actions that is extremely interpretable.
•
We represent a given action as a sequence of the most informative joints (SMIJ).
•
SMIJ is successful at capturing the invariances in different human actions.
•
SMIJ outperforms standard methods for action recognition task from skeletal data.
•
SMIJ is resilient to dataset bias and generalizes well across different datasets.

Abstract

Much of the existing work on action recognition combines simple features with complex classifiers or models to represent an action. Parameters of such models usually do not have any physical meaning nor do they provide any qualitative insight relating the action to the actual motion of the body or its parts. In this paper, we propose a new representation of human actions called sequence of the most informative joints (SMIJ), which is extremely easy to interpret. At each time instant, we automatically select a few skeletal joints that are deemed to be the most informative for performing the current action based on highly interpretable measures such as the mean or variance of joint angle trajectories. We then represent the action as a sequence of these most informative joints. Experiments on multiple databases show that the SMIJ representation is discriminative for human action recognition and performs better than several state-of-the-art algorithms.

Introduction

Human motion analysis has remained as one of the most important areas of research in computer vision. Over the last few decades, a large number of methods have been proposed for human motion analysis (see the surveys by Moeslund et al. [1], [2] and Turaga et al. [3] and most recently by Aggarwal and Ryoo [4] for a comprehensive analysis). In general all methods use a parametric representation of human motion and develop algorithms for comparing and classifying different instances of human activities under these representations.

One of the most common and intuitive methods for representation of human motion is a temporal sequence of approximate human skeletal configurations. The skeletal configurations represent hierarchically arranged joint kinematics with body segments reduced to straight lines. In the past, extracting accurate skeletal configurations from monocular videos was a difficult and unreliable process, especially for arbitrary human poses. Motion capture systems on the other hand could provide very accurate skeletal configurations of human actions based on active or passive markers positioned on the body; however, the data acquisition was limited to controlled indoor environments. Methods for human motion analysis that relied heavily on accurate skeletal data, therefore, became less popular over the years as compared to the image feature-based activity recognition methods. In the latter, spatio-temporal interest points are extracted from monocular videos and the recognition is based on learned statistics on large datasets [5], [6], [7]. Recently, with the release of several low-cost and relatively accurate 3D capturing systems, such as the Microsoft Kinect, real-time 3D data collection and skeleton extraction have become much easier and more practical for the applications of natural human computer interaction, gesture recognition and animation, thus reviving interest in the skeleton-based action representation.

Existing skeleton-based methods for human action recognition are primarily focused on modeling the dynamics of either the full skeleton or a combination of body segments. To represent the dynamics of normalized 3D positions of joints or joint angle configurations, most of the methods use linear dynamical systems (LDS) or non-linear dynamical systems (NLDS), e.g., in [8], [9], [10], or hidden Markov models (HMM), see, e.g., the earlier work by Yamato et al. [11] and a review of several others in [12]. Recently Taylor et al. [13], [14] proposed using conditional restricted Boltzman machines (CRBM) to model the temporal evolution of human actions. While these methods have been very successful for both human activity synthesis and recognition, their representation of human motion is in general not easy to interpret in connection to the physical and qualitative properties of the human motion. For example, the parameters obtained from the LDS modeling of the skeletal joint trajectories will likely describe positions and velocities of the individual joints, which do not directly convey any information about the changes in the skeletal configuration of the human body as the action is performed.

When humans perform an action, we can observe that each individual performs the same action with a different style, generating dissimilar joint trajectories; however, all individuals activate the same set of joints contributing to the overall movement, roughly in the same order. In our approach we take advantage of this observation to capture invariances in human skeletal motion for a given action. Given an action, we propose to identify the most informative joints in a particular temporal window by finding the relative informativeness of all the joints in that window. We can quantify the informativeness of a joint using, for example, the entropy of its joint angle time series. In the case of a Gaussian random variable, its entropy is proportional to the logarithm of its variance. Therefore, the joint that has the highest variance of motion as captured by the change in the joint angle can be defined as the most informative, assuming the joint angle data are independent and identically distributed (i.i.d.) samples from a one-dimensional Gaussian. Such a notion of informativeness is very intuitive and interpretable. During performance of an action, we can observe that different joints are activated at different times with various degree. Therefore, the ordered sequence of informative joints in a full skeletal motion implicitly encodes the temporal dynamics of the motion.

Based on these properties, we recently proposed in [16] the sequence of the most informative joints (SMIJ) as a new representation for human motion based on the temporal ordering of joints that are deemed to be the most informative for performing an action. In [16], we briefly compared the performance of the SMIJ representation to other action representations, based on the histograms of motion words, as well as the methods that explicitly model the dynamics of the skeletal motion. In this paper, we provide a more comprehensive description of the SMIJ representation and other feature representations and further evaluate their quality using action classification as our performance test. In addition, we propose a different metric for comparison of SMIJ features, based on normalized edit distance [17], which outperforms the normalized Levenshtein distance, applied in our previous work. Furthermore, we show that our simple yet highly intuitive and interpretable representation performs much better than standard methods for the task of action recognition from skeletal motion data.

Section snippets

Sequence of the most informative joints (SMIJ)

The human body is an articulated system that can be represented by a hierarchy of joints that are connected with bones, forming a skeleton. Different joint configurations produce different skeletal poses and a time series of these poses yields the skeletal motion. An action can thus simply be described as a collection of time series of 3D positions (i.e., 3D trajectories) of the joints in the skeleton hierarchy. This representation, however, lacks important properties such as invariance with

Alternative feature representations

In this section, we briefly describe three alternative feature representations against which we compare the results of the proposed SMIJ representation. We consider two standard methods, linear dynamical system parameters (LDSP) with a linear system identification approach and histogram of motion words (HMW) with a bag-of-words model. In addition we compare our results with the histogram of the most informative joints (HMIJ) which was also proposed originally in [16] as an alternative to SMIJ.

Experiments

In this section we compare our proposed feature representation SMIJ (described in Section 2) against the baseline feature representations (explained in Section 3) on the datasets outlined in Section 4.1 using action recognition as a test, and provide experimental results in Section 4.2.

Conclusions

We have proposed a very intuitive and qualitatively interpretable skeletal motion feature representation, called sequence of the most informative joints (SMIJ). Unlike most feature representations used for human motion analysis, which rely on sets of parameters that have no physical meaning, the SMIJ representation has a very specific practical interpretation, i.e., the ordering of the joints by their informativeness and their temporal evolution for a given action. More specifically, in the

Acknowledgments

This work is supported in part by the European Research Council grant VideoWorld as well as the grants NSF 0941362, NSF 0941463, NSF 0941382 and ONR N000141310116.

References (43)

T.B. Moeslund et al.
A survey of computer vision-based human motion capture
Computer Vision and Image Understanding (CVIU)
(2001)
T.B. Moeslund et al.
A survey of advances in vision-based human motion capture and analysis
Computer Vision and Image Understanding (CVIU)
(2006)
P. Turaga et al.
Machine recognition of human activities: a survey
IEEE Transactions on Circuits and Systems for Video Technology
(2008)
J. Aggarwal et al.
Human activity analysis: a review
ACM Computing Surveys
(2011)
P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in:...
I. Laptev
On space-time interest points
International Journal of Computer Vision (IJCV)
(2005)
I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Proceedings of IEEE...
A. Bissacco, A. Chiuso, Y. Ma, S. Soatto, Recognition of human gaits, in: Proceedings of IEEE Conference on Computer...
A. Bissacco et al.
Classification and recognition of dynamical models: the role of phase, independent components, kernels and optimal transport
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
(2007)
S. Ali, A. Basharat, M. Shah, Chaotic invariants for human action recognition, in: Proceedings of IEEE International...

J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in:...

J.K. Aggarwal et al.

Human motion analysis: a review

Computer Vision and Image Understanding (CVIU)

(1999)

G.W. Taylor, G.E. Hinton, S. Roweis, Modeling human motion using binary latent variables, in: Proceedings of Neural...

G.W. Taylor, G.E. Hinton, Factored conditional restricted Boltzmann machines for modeling motion style, in: Proceedings...

T. Cover et al.

Elements of Information Theory

(2006)

F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Sequence of the most informative joints (SMIJ): a new...

A. Marzal et al.

Computation of normalized edit distance and applications

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)

(1993)

P. Beaudoin, S. Coros, M. van de Panne, P. Poulin, Motion-motif graphs, in: Proceedings of the 2008 ACM...

M. Müller, A. Baak, H.-P. Seidel, Efficient and robust annotation of motion capture data, in: Proceedings of the ACM...

J. Barbič, A. Safonova, J.-Y. Pan, C. Faloutsos, J.K. Hodgins, N.S. Pollard, Segmenting motion capture data into...

A. López-Méndez, J. Gall, J. Casas, L.V. Gool, Metric learning from poses for temporal clustering of human motion, in:...

Cited by (285)

Efficient skeleton-based action recognition via multi-stream depthwise separable convolutional neural network
2023, Expert Systems with Applications
Skeleton-based human action recognition has attracted considerable attention and achieved great success in several engineering fields, which is also one of the most active research topic in computer vision community. However, the existing methods may suffer from a large model size and slow execution speed. In addition, some useful information embedded in skeleton sequence, such as the motion information with different scales, had not been fully utilized, such that the performance of recognition is compromised or decreased. To this end, we propose an efficient skeleton-based action recognition approach based on multi-stream neural networks, namely Lightweight Double-feature Triple-scale motion Network (LDT-NET). Precisely, by endowing a lightweight network structure, i.e., multi-stream Depthwise Separable Convolutional Neural Network, LDT-NET is capable of being employed efficiently on a single CPU/GPU. The experimental results on several famous datasets tested by almost all studies in this field, such as SHREC (i.e., hand actions) and JHMDB (i.e., body actions), the proposed LDT-NET achieves promising performance against the state-of-the-art methods. More promisingly, compared to the latest representative work, such as DD-NET, LDT-NET achieves a speedup both in training and inferring by 25%, with only about 20% of its model parameter. In a nutshell, our method not only achieves higher recognition rate but can be used in practical application comparing to the state-of-the-art methods.
Multi-channel network: Constructing efficient GCN baselines for skeleton-based action recognition
2023, Computers and Graphics (Pergamon)
Skeleton-based action sequences are widely used for human behaviour understanding due to their compact characteristics. Most existing work designed Graph Convolutional Networks and integrated multiple input channels rather than the original motion sequence to improve the final performance. However, few of them have been reported on the detailed effects of such multiple input channels. In contrast to them, we systemically study the impact of different input channels and construct a more efficient GCN framework. We have identified the complementary effect between the local frame channel and global sequence channel, which is essential to improve the action recognition accuracy. By coupling local frame and global sequence information with a classical spatial–temporal graph neural network, e.g. MS-G3D, it achieves competitive performance compared with SOTA methods on challenging benchmark datasets. Related code would be available at https://github.com/movearbitrarily/multi-stream.
DeGCN: Deformable Graph Convolutional Networks for Skeleton-Based Action Recognition
2024, IEEE Transactions on Image Processing
Viewpoint guided multi-stream neural network for skeleton action recognition
2024, Multimedia Tools and Applications
Residual Neural Networks for Human Action Recognition from RGB-D Videos
2023, Journal of Image and Graphics(United Kingdom)
View Invariant Spatio-Temporal Descriptor for Action Recognition from Skeleton Sequences
2023, IEEE Transactions on Artificial Intelligence

View all citing articles on Scopus

View full text

Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition

Highlights

Abstract

Introduction

Section snippets

Sequence of the most informative joints (SMIJ)

Alternative feature representations

Experiments

Conclusions

Acknowledgments

Computer Vision and Image Understanding (CVIU)

Computer Vision and Image Understanding (CVIU)

Machine recognition of human activities: a survey

IEEE Transactions on Circuits and Systems for Video Technology

Human activity analysis: a review

ACM Computing Surveys

On space-time interest points

International Journal of Computer Vision (IJCV)

Classification and recognition of dynamical models: the role of phase, independent components, kernels and optimal transport

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)

Human motion analysis: a review

Computer Vision and Image Understanding (CVIU)

Elements of Information Theory

Computation of normalized edit distance and applications

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)