Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition

https://doi.org/10.1016/j.jvcir.2013.04.007Get rights and content

Highlights

  • We propose a new representation of human actions that is extremely interpretable.

  • We represent a given action as a sequence of the most informative joints (SMIJ).

  • SMIJ is successful at capturing the invariances in different human actions.

  • SMIJ outperforms standard methods for action recognition task from skeletal data.

  • SMIJ is resilient to dataset bias and generalizes well across different datasets.

Abstract

Much of the existing work on action recognition combines simple features with complex classifiers or models to represent an action. Parameters of such models usually do not have any physical meaning nor do they provide any qualitative insight relating the action to the actual motion of the body or its parts. In this paper, we propose a new representation of human actions called sequence of the most informative joints (SMIJ), which is extremely easy to interpret. At each time instant, we automatically select a few skeletal joints that are deemed to be the most informative for performing the current action based on highly interpretable measures such as the mean or variance of joint angle trajectories. We then represent the action as a sequence of these most informative joints. Experiments on multiple databases show that the SMIJ representation is discriminative for human action recognition and performs better than several state-of-the-art algorithms.

Introduction

Human motion analysis has remained as one of the most important areas of research in computer vision. Over the last few decades, a large number of methods have been proposed for human motion analysis (see the surveys by Moeslund et al. [1], [2] and Turaga et al. [3] and most recently by Aggarwal and Ryoo [4] for a comprehensive analysis). In general all methods use a parametric representation of human motion and develop algorithms for comparing and classifying different instances of human activities under these representations.

One of the most common and intuitive methods for representation of human motion is a temporal sequence of approximate human skeletal configurations. The skeletal configurations represent hierarchically arranged joint kinematics with body segments reduced to straight lines. In the past, extracting accurate skeletal configurations from monocular videos was a difficult and unreliable process, especially for arbitrary human poses. Motion capture systems on the other hand could provide very accurate skeletal configurations of human actions based on active or passive markers positioned on the body; however, the data acquisition was limited to controlled indoor environments. Methods for human motion analysis that relied heavily on accurate skeletal data, therefore, became less popular over the years as compared to the image feature-based activity recognition methods. In the latter, spatio-temporal interest points are extracted from monocular videos and the recognition is based on learned statistics on large datasets [5], [6], [7]. Recently, with the release of several low-cost and relatively accurate 3D capturing systems, such as the Microsoft Kinect, real-time 3D data collection and skeleton extraction have become much easier and more practical for the applications of natural human computer interaction, gesture recognition and animation, thus reviving interest in the skeleton-based action representation.

Existing skeleton-based methods for human action recognition are primarily focused on modeling the dynamics of either the full skeleton or a combination of body segments. To represent the dynamics of normalized 3D positions of joints or joint angle configurations, most of the methods use linear dynamical systems (LDS) or non-linear dynamical systems (NLDS), e.g., in [8], [9], [10], or hidden Markov models (HMM), see, e.g., the earlier work by Yamato et al. [11] and a review of several others in [12]. Recently Taylor et al. [13], [14] proposed using conditional restricted Boltzman machines (CRBM) to model the temporal evolution of human actions. While these methods have been very successful for both human activity synthesis and recognition, their representation of human motion is in general not easy to interpret in connection to the physical and qualitative properties of the human motion. For example, the parameters obtained from the LDS modeling of the skeletal joint trajectories will likely describe positions and velocities of the individual joints, which do not directly convey any information about the changes in the skeletal configuration of the human body as the action is performed.

When humans perform an action, we can observe that each individual performs the same action with a different style, generating dissimilar joint trajectories; however, all individuals activate the same set of joints contributing to the overall movement, roughly in the same order. In our approach we take advantage of this observation to capture invariances in human skeletal motion for a given action. Given an action, we propose to identify the most informative joints in a particular temporal window by finding the relative informativeness of all the joints in that window. We can quantify the informativeness of a joint using, for example, the entropy of its joint angle time series. In the case of a Gaussian random variable, its entropy is proportional to the logarithm of its variance. Therefore, the joint that has the highest variance of motion as captured by the change in the joint angle can be defined as the most informative, assuming the joint angle data are independent and identically distributed (i.i.d.) samples from a one-dimensional Gaussian. Such a notion of informativeness is very intuitive and interpretable. During performance of an action, we can observe that different joints are activated at different times with various degree. Therefore, the ordered sequence of informative joints in a full skeletal motion implicitly encodes the temporal dynamics of the motion.

Based on these properties, we recently proposed in [16] the sequence of the most informative joints (SMIJ) as a new representation for human motion based on the temporal ordering of joints that are deemed to be the most informative for performing an action. In [16], we briefly compared the performance of the SMIJ representation to other action representations, based on the histograms of motion words, as well as the methods that explicitly model the dynamics of the skeletal motion. In this paper, we provide a more comprehensive description of the SMIJ representation and other feature representations and further evaluate their quality using action classification as our performance test. In addition, we propose a different metric for comparison of SMIJ features, based on normalized edit distance [17], which outperforms the normalized Levenshtein distance, applied in our previous work. Furthermore, we show that our simple yet highly intuitive and interpretable representation performs much better than standard methods for the task of action recognition from skeletal motion data.

Section snippets

Sequence of the most informative joints (SMIJ)

The human body is an articulated system that can be represented by a hierarchy of joints that are connected with bones, forming a skeleton. Different joint configurations produce different skeletal poses and a time series of these poses yields the skeletal motion. An action can thus simply be described as a collection of time series of 3D positions (i.e., 3D trajectories) of the joints in the skeleton hierarchy. This representation, however, lacks important properties such as invariance with

Alternative feature representations

In this section, we briefly describe three alternative feature representations against which we compare the results of the proposed SMIJ representation. We consider two standard methods, linear dynamical system parameters (LDSP) with a linear system identification approach and histogram of motion words (HMW) with a bag-of-words model. In addition we compare our results with the histogram of the most informative joints (HMIJ) which was also proposed originally in [16] as an alternative to SMIJ.

Experiments

In this section we compare our proposed feature representation SMIJ (described in Section 2) against the baseline feature representations (explained in Section 3) on the datasets outlined in Section 4.1 using action recognition as a test, and provide experimental results in Section 4.2.

Conclusions

We have proposed a very intuitive and qualitatively interpretable skeletal motion feature representation, called sequence of the most informative joints (SMIJ). Unlike most feature representations used for human motion analysis, which rely on sets of parameters that have no physical meaning, the SMIJ representation has a very specific practical interpretation, i.e., the ordering of the joints by their informativeness and their temporal evolution for a given action. More specifically, in the

Acknowledgments

This work is supported in part by the European Research Council grant VideoWorld as well as the grants NSF 0941362, NSF 0941463, NSF 0941382 and ONR N000141310116.

References (43)

  • T.B. Moeslund et al.

    A survey of computer vision-based human motion capture

    Computer Vision and Image Understanding (CVIU)

    (2001)
  • T.B. Moeslund et al.

    A survey of advances in vision-based human motion capture and analysis

    Computer Vision and Image Understanding (CVIU)

    (2006)
  • P. Turaga et al.

    Machine recognition of human activities: a survey

    IEEE Transactions on Circuits and Systems for Video Technology

    (2008)
  • J. Aggarwal et al.

    Human activity analysis: a review

    ACM Computing Surveys

    (2011)
  • P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in:...
  • I. Laptev

    On space-time interest points

    International Journal of Computer Vision (IJCV)

    (2005)
  • I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Proceedings of IEEE...
  • A. Bissacco, A. Chiuso, Y. Ma, S. Soatto, Recognition of human gaits, in: Proceedings of IEEE Conference on Computer...
  • A. Bissacco et al.

    Classification and recognition of dynamical models: the role of phase, independent components, kernels and optimal transport

    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)

    (2007)
  • S. Ali, A. Basharat, M. Shah, Chaotic invariants for human action recognition, in: Proceedings of IEEE International...
  • J. Yamato, J. Ohya, K. Ishii, Recognizing human action in time-sequential images using hidden Markov model, in:...
  • J.K. Aggarwal et al.

    Human motion analysis: a review

    Computer Vision and Image Understanding (CVIU)

    (1999)
  • G.W. Taylor, G.E. Hinton, S. Roweis, Modeling human motion using binary latent variables, in: Proceedings of Neural...
  • G.W. Taylor, G.E. Hinton, Factored conditional restricted Boltzmann machines for modeling motion style, in: Proceedings...
  • T. Cover et al.

    Elements of Information Theory

    (2006)
  • F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Sequence of the most informative joints (SMIJ): a new...
  • A. Marzal et al.

    Computation of normalized edit distance and applications

    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)

    (1993)
  • P. Beaudoin, S. Coros, M. van de Panne, P. Poulin, Motion-motif graphs, in: Proceedings of the 2008 ACM...
  • M. Müller, A. Baak, H.-P. Seidel, Efficient and robust annotation of motion capture data, in: Proceedings of the ACM...
  • J. Barbič, A. Safonova, J.-Y. Pan, C. Faloutsos, J.K. Hodgins, N.S. Pollard, Segmenting motion capture data into...
  • A. López-Méndez, J. Gall, J. Casas, L.V. Gool, Metric learning from poses for temporal clustering of human motion, in:...
  • Cited by (285)

    • Residual Neural Networks for Human Action Recognition from RGB-D Videos

      2023, Journal of Image and Graphics(United Kingdom)
    View all citing articles on Scopus
    View full text