Abstract
Fully articulated hand tracking promises to enable fundamentally new interactions with virtual and augmented worlds, but the limited accuracy and efficiency of current systems has prevented widespread adoption. Today's dominant paradigm uses machine learning for initialization and recovery followed by iterative model-fitting optimization to achieve a detailed pose fit. We follow this paradigm, but make several changes to the model-fitting, namely using: (1) a more discriminative objective function; (2) a smooth-surface model that provides gradients for non-linear optimization; and (3) joint optimization over both the model pose and the correspondences between observed data points and the model surface. While each of these changes may actually increase the cost per fitting iteration, we find a compensating decrease in the number of iterations. Further, the wide basin of convergence means that fewer starting points are needed for successful model fitting. Our system runs in real-time on CPU only, which frees up the commonly over-burdened GPU for experience designers. The hand tracker is efficient enough to run on low-power devices such as tablets. We can track up to several meters from the camera to provide a large working volume for interaction, even using the noisy data from current-generation depth cameras. Quantitative assessments on standard datasets show that the new approach exceeds the state of the art in accuracy. Qualitative results take the form of live recordings of a range of interactive experiences enabled by this new approach.
Supplemental Material
Available for Download
Supplemental files.
- 3Gear Systems Inc, 2013. Gesture recognizer. http://threegear.com, Jan.Google Scholar
- Athitsos, V., and Sclaroff, S. 2003. Estimating 3D hand pose from a cluttered image. In Proc. CVPR, vol. 2, II--432.Google Scholar
- Ballan, L., Taneja, A., Gall, J., Gool, L. V., and Pollefeys, M. 2012. Motion capture of hands in action using discriminative salient points. In Proc. ECCV, 640--653. Google ScholarDigital Library
- Bray, M., Koller-Meier, E., and Van Gool, L. 2004. Smart particle filtering for 3D hand tracking. In Proc. Automatic Face and Gesture Recognition, 675--680. Google ScholarDigital Library
- de La Gorce, M., Fleet, D. J., and Paragios, N. 2011. Model-Based 3D Hand Pose Estimation from Monocular Video. IEEE Trans. PAMI 33, 9, 1793--1805. Google ScholarDigital Library
- Dipietro, L., Sabatini, A. M., and Dario, P. 2008. A survey of glove-based systems and their applications. IEEE Trans. Systems, Man, and Cybernetics, Part C: Applications and Reviews 38, 4, 461--482. Google ScholarDigital Library
- Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. 2007. Vision-based hand pose estimation: A review. CVIU 108, 1-2, 52--73. Google ScholarDigital Library
- Fleishman, S., Kliger, M., Lerner, A., and Kutliroff, G. 2015. ICPIK: Inverse kinematics based articulated-ICP. In Proc. CVPR Workshops, 28--35.Google Scholar
- Geman, S., and McClure, D. E. 1987. Statistical methods for tomographic image reconstruction. Bulletin of the International Statistical Institute 52, 4, 5--21.Google Scholar
- Guzmán-Rivera, A., Kohli, P., Glocker, B., Shotton, J., Sharp, T., Fitzgibbon, A. W., and Izadi, S. 2014. Multi-output learning for camera relocalization. In Proc. CVPR, 1114--1121. Google ScholarDigital Library
- Heap, T., and Hogg, D. 1996. Towards 3D hand tracking using a deformable model. In Proc. Automatic Face and Gesture Recognition, 140--145. Google ScholarDigital Library
- Intel Corporation, 2016. RealSense SDK. http://software.intel.com/realsense, Jan.Google Scholar
- Jacobson, A., Deng, Z., Kavan, L., and Lewis, J. 2014. Skinning: Real-time shape deformation. In ACM SIGGRAPH 2014 Courses, #24. Google ScholarDigital Library
- Keskin, C., Kiraç, F., Kara, Y. E., and Akarun, L. 2012. Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In Proc. ECCV, 852--863. Google ScholarDigital Library
- Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., and Fitzgibbon, A. 2015. Learning an efficient model of hand shape variation from depth images. In Proc. CVPR, 2540--2548.Google Scholar
- Kim, D., Hilliges, O., Izadi, S., Butler, A. D., Chen, J., Oikonomidis, I., and Olivier, P. 2012. Digits: freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In Proc. UIST, 167--176. Google ScholarDigital Library
- Krupka, E., Bar Hillel, A., Klein, B., Vinnikov, A., Freedman, D., and Stachniak, S. 2014. Discriminative ferns ensemble for hand pose recognition. In Proc. CVPR, 3670--3677. Google ScholarDigital Library
- Leap Motion Inc, 2013. Motion Controller. http://leapmotion.com/product, Jan.Google Scholar
- Leap Motion Inc, 2015. Orion. http://developer.leapmotion.com/orion, Feb.Google Scholar
- Li, P., Ling, H., Li, X., and Liao, C. 2015. 3D hand pose estimation using randomized decision forest with segmentation index points. In Proc. ICCV, 819--827. Google ScholarDigital Library
- Loop, C. T. 1987. Smooth Subdivision Surfaces Based on Triangles. Master's thesis, University of Utah.Google Scholar
- Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and Black, M. J. 2015. SMPL: a skinned multi-person linear model. ACM Trans. Graphics 34, 6, #248. Google ScholarDigital Library
- Makris, A., Kyriazis, N., and Argyros, A. 2015. Hierarchical particle filtering for 3D hand tracking. In Proc. CVPR Workshops, 8--17.Google Scholar
- Melax, S., Keselman, L., and Orsten, S. 2013. Dynamics based 3D skeletal hand tracking. In Proceedings of the 2013 Graphics Interface Conference, 63--70. Google ScholarDigital Library
- Mitchell, D. P. 1991. Spectrally optimal sampling for distribution ray tracing. In Proc. SIGGRAPH, 157--164. Google ScholarDigital Library
- Monnai, Y., Hasegawa, K., Fujiwara, M., Yoshino, K., Inoue, S., and Shinoda, H. 2014. HaptoMime: Mid-air haptic interaction with a floating virtual screen. In Proc. UIST, 663--667. Google ScholarDigital Library
- Neverova, N., Wolf, C., Nebout, F., and Taylor, G. 2015. Hand pose estimation through weakly-supervised learning of a rich intermediate representation. arXiv preprint 1511.06728.Google Scholar
- Oberweger, M., Wohlhart, P., and Lepetit, V. 2015. Training a feedback loop for hand pose estimation. In Proc. ICCV, 3316--3324. Google ScholarDigital Library
- Oikonomidis, I., Kyriazis, N., and Argyros, A. 2011. Efficient model-based 3D tracking of hand articulations using Kinect. In Proc. BMVC, 101.1--101.11.Google Scholar
- Poier, G., Roditakis, K., Schulter, S., Michel, D., Bischof, H., and Argyros, A. A. 2015. Hybrid one-shot 3D hand pose estimation by exploiting uncertainties. In Proc. BMVC, 182.1--182.14.Google Scholar
- Qian, C., Sun, X., Wei, Y., Tang, X., and Sun, J. 2014. Realtime and robust hand tracking from depth. In Proc. CVPR, 1106--1113. Google ScholarDigital Library
- Rehg, J. M., and Kanade, T. 1994. Visual tracking of high DOF articulated structures: an application to human hand tracking. In Proc. ECCV, 35--46. Google ScholarDigital Library
- Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., and Izadi, S. 2015. Accurate, robust, and flexible realtime hand tracking. In Proc. CHI, 3633--3642. Google ScholarDigital Library
- Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A. 2011. Real-time human pose recognition in parts from a single depth image. In Proc. CVPR, 1297--1304. Google ScholarDigital Library
- Shotton, J., Sharp, T., Kohli, P., Nowozin, S., Winn, J., and Criminisi, A. 2013. Decision jungles: Compact and rich models for classification. In NIPS, 234--242.Google Scholar
- Sridhar, S., Oulasvirta, A., and Theobalt, C. 2013. Interactive markerless articulated hand motion tracking using RGB and depth data. In Proc. ICCV, 2456--2463. Google ScholarDigital Library
- Sridhar, S., Rhodin, H., Seidel, H.-P., Oulasvirta, A., and Theobalt, C. 2014. Real-time hand tracking using a sum of anisotropic Gaussians model. In Proc. 3DV, 319--326. Google ScholarDigital Library
- Sridhar, S., Mueller, F., Oulasvirta, A., and Theobalt, C. 2015. Fast and robust hand tracking using detection-guided optimization. In Proc. CVPR, 3213--3221.Google Scholar
- Stenger, B., Mendonça, P. R., and Cipolla, R. 2001. Model-based 3D tracking of an articulated hand. In Proc. CVPR, vol. 2, II--310.Google Scholar
- Sun, X., Wei, Y., Liang, S., Tang, X., and Sun, J. 2015. Cascaded hand pose regression. In Proc. CVPR, 824--832.Google Scholar
- Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., and Pauly, M. 2015. Robust articulated-ICP for real-time hand tracking. Computer Graphics Forum 34, 5, 101--114.Google ScholarCross Ref
- Tan, D. J., Cashman, T., Taylor, J., Fitzgibbon, A., Tarlow, D., Khamis, S., Izadi, S., and Shotton, J. 2016. Fits like a glove: Rapid and reliable hand shape personalization. In Proc. CVPR.Google Scholar
- Tang, D., Yu, T.-H., and Kim, T.-K. 2013. Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In Proc. ICCV, 3224--3231. Google ScholarDigital Library
- Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.-K., and Shotton, J. 2015. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proc. ICCV, 3325--3333. Google ScholarDigital Library
- Taylor, J., Shotton, J., Sharp, T., and Fitzgibbon, A. 2012. The Vitruvian Manifold: Inferring dense correspondences for one-shot human pose estimation. In Proc. CVPR, 103--110. Google ScholarDigital Library
- Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A., and Fitzgibbon, A. 2014. User-specific hand modeling from monocular depth sequences. In Proc. CVPR, 644--651. Google ScholarDigital Library
- Tejani, A., Tang, D., Kouskouridas, R., and Kim, T.-K. 2014. Latent-class Hough forests for 3D object detection and pose estimation. In Proc. ECCV, 462--477.Google Scholar
- Tompson, J., Stein, M., Lecun, Y., and Perlin, K. 2014. Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graphics 33, 5, #169. Google ScholarDigital Library
- Triggs, W., McLauchlan, P., Hartley, R., and Fitzgibbon, A. 2000. Bundle adjustment --- A modern synthesis. In Vision Algorithms: Theory and Practice, LNCS. 298--372. Google ScholarDigital Library
- Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., and Gall, J. 2015. Capturing hands in action using discriminative salient points and physics simulation. arXiv preprint 1506.02178. Google ScholarDigital Library
- Ultrahaptics Ltd, 2013. Haptics System. http://ultrahaptics.com, Jan. Valentin, J., Dai, A., Niessner, M., Kohli, P., Torr, P., Izadi, S., and Keskin, C. 2016. Learning to navigate the energy landscape. arXiv preprint 1603.05772.Google Scholar
- Vicente, S., and Agapito, L. 2013. Balloon shapes: reconstructing and deforming objects with volume from images. In Proc. 3DV, 223--230. Google ScholarDigital Library
- Wang, R. Y., and Popović, J. 2009. Real-time hand-tracking with a color glove. ACM Trans. Graphics 28, 3, #63. Google ScholarDigital Library
- Wang, R., Paris, S., and Popović, J. 2011. 6D hands. In Proc. UIST, 549--558. Google ScholarDigital Library
- Wang, Y., Min, J., Zhang, J., Liu, Y., Xu, F., Dai, Q., and Chai, J. 2013. Video-based hand manipulation capture through composite motion control. ACM Trans. Graphics 32, 4 (July), 43:1--43:14. Google ScholarDigital Library
- Wu, Y., and Huang, T. S. 2000. View-independent recognition of hand postures. In Proc. CVPR, vol. 2, 88--94.Google Scholar
- Wu, Y., Lin, J. Y., and Huang, T. S. 2001. Capturing natural hand articulation. In Proc. ICCV, vol. 2, 426--432.Google Scholar
- Xu, C., and Cheng, L. 2013. Efficient hand pose estimation from a single depth image. In Proc. ICCV, 3456--3462. Google ScholarDigital Library
- Zach, C. 2014. Robust bundle adjustment revisited. In Proc. ECCV, 772--787.Google ScholarCross Ref
- Zhao, W., Chai, J., and Xu, Y.-Q. 2012. Combining marker-based mocap and RGB-D camera for acquiring high-fidelity hand motion data. In Proc. Symposium on Computer Animation, 33--42. Google ScholarDigital Library
Index Terms
- Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences
Recommendations
Silhouette lookup for monocular 3D pose tracking
Computers should be able to detect and track the articulated 3D pose of a human being moving through a video sequence. Incremental tracking methods often prove slow and unreliable, and many must be initialized by a human operator before they can track a ...
Innovative geometric pose reconstruction for marker-based single camera tracking
VRCIA '06: Proceedings of the 2006 ACM international conference on Virtual reality continuum and its applicationsMobile augmented reality applications are in need of tracking systems which can be wearable and do not cause a high processing load, while still offering reasonable performance, robustness and accuracy. The motivation to develop yet another tracking ...
Global hand pose estimation by multiple camera ellipse tracking
Immersive virtual environments with life-like interaction capabilities have very demanding requirements including high-precision motion capture and high-processing speed. These issues raise many challenges for computer vision-based motion estimation ...
Comments