Hand pose recognition from monocular images by geometrical and texture analysis

https://doi.org/10.1016/j.jvlc.2014.12.001Get rights and content

Highlights

  • Proposed a novel scheme for hand pose recognition for HCI.

  • Proposed an object-based video abstraction method for hand segmentation.

  • Abduction angle variations are modeled by geometrical features.

  • Flexion angle variations are modeled by analyzing textures of the fingers.

  • Achieved 99% and 97% recognition rate for one-hand and two-hand poses respectively.

Abstract

One challenging research problem of hand pose recognition is the accurate detection of finger abduction and flexion with a single camera. The detection of flexion movements from a 2D image is difficult, because it involves estimation of finger movements along the optical axis of the camera (z direction). In this paper, a novel approach to hand pose recognition is proposed. We use the concept of object-based video abstraction for segmenting the frames into video object planes (VOPs), as used in MPEG-4, with each VOP corresponding to one semantically meaningful hand position. Subsequently, a particular hand pose is recognized by analyzing the key geometrical features and the textures of the hand. The abduction and adduction movements of the fingers are analyzed by considering a skeletal model. Probabilistic distributions of the geometric features are considered for modeling intra-class abduction and adduction variations. Additionally, gestures differing in flexion positions of the fingers are classified by texture analysis using homogeneous texture descriptors (HTD). Finally, hand poses are classified based on proximity measurement by considering the intra-class abduction and adduction and/or inter-class flexion variations. Experimental results show the efficacy of our proposed hand pose recognition system. The system achieved a 99% recognition rate for one-hand poses and a 97% recognition rate for two-hand poses.

Introduction

The use of the human hand as a natural interface for human–computer interaction (HCI) motivates research on hand gesture recognition. Vision-based hand gesture recognition involves the visual analysis of hand shape, position, and/or movement. Accordingly, the basic aim of gesture recognition research is to build a system that can identify and interpret human gestures automatically. Such a system can be used for manipulation, such as controlling robots or other devices without any physical contact between the human and the interface. It can also be used for communication, such as conveying information through sign language [1], [2]. A sign language recognition system must be able to recognize the changing poses of the hand.

Various methods have been proposed for recognizing accurate hand poses. However, because an articulated model of the hand has many degrees of freedom, detecting the finger movements has remained a challenge. Current solutions rely either on data gloves or on computer vision. Data gloves are made cumbersome by the many cables connecting them to the computer. This can render human–computer interaction unnatural. Awkwardness in using gloves is overcome by using vision-based noncontact interaction techniques.

Vision-based methods for hand gesture recognition are performed by two major approaches, namely appearance-based and model-based representations. For hand pose analysis, model-based methods are most suitable because they provide accurate estimation of hand parameters without loss of spatial information [1]. Many methods for hand pose detection take geometrical properties of the hand and model the location and movement of the fingers [3], [4], [5]. Hand poses are analyzed by applying physiological constraints on hand kinematics and dynamics. These constraints include joint-angle limits on the extension, flexion, adduction, and abduction of metacarpophalangeal (MP) joints. The constraints determine the types of movements the hand can make. Matsumoto et al. implemented this kind of approach with a skeletal hand model [6]. They used a voxel model with an estimation algorithm for recognizing different hand poses with a multi-perspective camera system.

Some methods incorporate inverse kinematics and three-dimensional (3D) reconstruction techniques to estimate the 3D hand pose from single 2D monocular images. Lee et al. proposed an articulated model with hand kinematics constrains to reconstruct a 3D image from monocular view of the hand pose [7]. They took intra- and inter-finger constrains with 20 DOF for 3D model fitting of the 2D image. They only model the static hand poses including self occlusion with physical constrains into a 3D model. Guan et al. also estimated 3D hand pose parameters from a single image only using 12 DOF with an articulated model of the hand [8]. They used eight 2D projected features from the geometrical properties of the hand to retrieve the 3D hand pose. However, they only modeled the 3D pose from different views without considering simultaneous movements of fingers with adduction/abduction angle and flexion movements of fingers. Weng et al. developed a real-time motion capturing system using a 3D hand model with a state-based particle filter to estimate the motion of individual fingers [9]. But tracking fingers is computationally expensive given their high degree of freedom. Most of the model-based gesture recognition systems used 3D articulated hand models to characterize the parameters of the hand, viz., 3D positions of all joint angles, fingertip positions, and their orientation [10], [11]. These methods use 3D model fitting technique through minimization of a cost function based on extracted features.

To estimate kinematic parameters and perform 3D reconstruction, these algorithms involve high computational complexity. In addition, the use of multiple cameras and the depth sensors increases the overall system complexity [12], [13], [14]. Most of the existing methods make use of hand segmentation by skin color. Skin color offers an effective and efficient way to segment out hand regions. However, this approach is degraded by variation in skin tone, lighting conditions, and dynamic scenes.

To address some of these issues, we use the concept of object-based video abstraction for segmenting the frames into video object planes (VOPs), as used in MPEG-4, where the hand is considered as a video object (VO). A binary model for the moving hand is derived and is used for tracking in subsequent frames. The Hausdorff tracker is used for this purpose [15].

A notable advantage of our proposed scheme is its robustness to background noise. The tracker can track the hand as an object very efficiently even without adopting any kind of background filtering. Moreover, unlike tracking algorithms that use Kalman filters, the VOP generation algorithm does not require extra computation for scaling and rotation. In our algorithm, the concept of “shape change” can accommodate both the scaling and the rotation of the tracked video object in successive frames of the gesture video sequence. The only computation required for the shape change is the model update in each frame of the video sequence. The model update computation using the motion vector is much simpler computationally than the other computations, viz., the affine transformation required for scaling and rotation.

Subsequently, we propose a method to recognize hand poses from the 2D images by modeling abduction, adduction, and/or flexion movements of fingers. These two types of hand poses are illustrated in Fig. 1.

We earlier proposed a method for recognizing abduction and adduction movements of the fingers [16]. The hand poses having only abduction and adduction movements of the fingers are modeled by a multidimensional Gaussian distribution. In this paper, another model is proposed to recognize flexion movements of the fingers, which is integrated with our earlier proposed model to recognize different hand poses. Hand poses having only flexion movements of the fingers are modeled by homogeneous texture descriptors (HTD) [17]. Representation of flexion movements of the fingers by analyzing only the texture information of the fingers is quite simple and straightforward as compared with 3D model-based methods. Our approach has advantages over previous methods in the sense that we can model both the abduction/adduction and flexion movements of fingers together from 2D monocular images. Finally, proximity measure is used to classify the input gestures by comparing the input features with the templates in the database. The overall block diagram of the proposed system is illustrated in Fig. 2.

The organization of the rest of the paper is as follows. Section 2 and Section 3 present our proposed hand pose recognition approach for abduction/adduction finger movements and for flexion movements, respectively. Section 4 reports experimental results. Finally, we draw our conclusion in Section 5.

Section snippets

Proposed hand pose recognition scheme for abduction finger movements

In our proposed method, a user-specific hand model is obtained via a series of image segmentation and morphological operations. The system uses the model to determine the user׳s hand pose. The hand is first separated out from the forearm and some key geometric features of pre-defined gestures are obtained after hand calibration. These features are subsequently modeled as a Gaussian distribution to consider spatiotemporal variations of the finger positions during gesticulation. Based on the

Proposed hand pose recognition scheme for flexion finger movements

As explained earlier, it is quite difficult to estimate the flexion angles from the 2D image as it involves the estimation of finger motion along the optical axis of the camera. One possible solution is the 3D modeling of the fingers to estimate all the finger movements. But model-based methods are computationally complex. So we propose a novel scheme to determine the flexion angle variations by analyzing the texture of the projected region of the fingers in the 2D image plane. So, instead of

Experimental results

To validate the proposed technique, several hand poses with important variations of the hand configurations were considered. We judiciously selected the hand poses so that they could be used as pointing gestures for HCI interfaces. The proposed system was tested in real-time on an Intel® Core I3-based personal computer. The input images are captured by a CCD camera at a resolution of 640×480 pixels. Our dataset is a challenging real-life dataset collected in cluttered backgrounds. Besides, for

Conclusion

Static hand gesture recognition is a highly challenging task due to many degrees of freedom of the hand׳s kinematic structure. In this paper, we proposed a novel hand pose recognition method to support vision-based interfaces.

In our proposed method, MPEG-4 based video object extraction is used for finding different hand positions and shapes from the gesture video sequence. The VOP based method of segmentation does not require the rotation and scaling of the object to be segmented. The shape

References (20)

  • Ali Erol et al.

    Xander twombly, Vision-based hand pose estimation: a review

    Comput. Vis. Image Understand.

    (2007)
  • Martin de La Gorce et al.

    A variational approach to monocular hand-pose estimation

    Comput. Vis. Image Understand.

    (2010)
  • Vladimir I. Pavlovic et al.

    Visual interpretation of hand gestures for human-computer interaction

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • Jintae Lee et al.

    Model-based analysis of hand posture

    IEEE Comput. Graph. Appl.

    (1995)
  • Dung Duc Nguyen, Thien Cong Pham, Jae Wook Jeon, Finger extraction from scene with grayscale morphology and BLOB...
  • Sung Kwan Kang, Mi Young Nam, Phill Kyu Rhee, Color-based hand and finger detection technology for user interaction,...
  • Etsuko Ueda et al.

    A hand-pose estimation for vision-based human interfaces

    IEEE Trans. Ind. Electron.

    (2003)
  • Sung Uk Lee, Isaac Cohen, 3D hand reconstruction from a monocular view, in: Proceedings of the IEEE International...
  • Haiying Guan, Chin-Seng Chua, Yeong-Khing Ho, 3D hand pose retrieval from a single 2D image, in: Proceedings of the...
  • Chingyu Weng, Chuanyu Tseng, Chunglin Huang, A vision-based hand motion parameter capturing for HCI, in: Proc. IEEE...
There are more references available in the full text version of this article.

Cited by (17)

  • Wrist detection based on a minimum bounding box and geometric features

    2020, Journal of King Saud University - Computer and Information Sciences
    Citation Excerpt :

    A hand reference point is a point or set of points that can be used to refer to other features of the hand (landmarks), e.g., the palm, fingertip, or metacarpophalangeal joint. The wrist was chosen as the hand reference point in the following studies (Bhuyan et al., 2015; Premaratne et al., 2013; Yao et al., 2013). Considering the wrist points provides the coordinates of two points, the distance, and the orientation; thus, this research greatly emphasised wrist detection.

  • Object Recognition Based on Improved Zernike Moments and SURF

    2019, ACM International Conference Proceeding Series
View all citing articles on Scopus

This paper has been recommended for acceptance by Shi Kho Chang.

View full text