Abstract
The objective of this paper is to recognize gestures in videos – both localizing the gesture and classifying it into one of multiple classes.
We show that the performance of a gesture classifier learnt from a single (strongly supervised) training example can be boosted significantly using a ‘reservoir’ of weakly supervised gesture examples (and that the performance exceeds learning from the one-shot example or reservoir alone). The one-shot example and weakly supervised reservoir are from different ‘domains’ (different people, different videos, continuous or non-continuous gesturing, etc), and we propose a domain adaptation method for human pose and hand shape that enables gesture learning methods to generalise between them. We also show the benefits of using the recently introduced Global Alignment Kernel [12], instead of the standard Dynamic Time Warping that is generally used for time alignment.
The domain adaptation and learning methods are evaluated on two large scale challenging gesture datasets: one for sign language, and the other for Italian hand gestures. In both cases performance exceeds the previous published results, including the best skeleton-classification-only entry in the 2013 ChaLearn challenge.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE PAMI 32(2), 288–303 (2010)
Baisero, A., Pokorny, F.T., Kragic, D., Ek, C.: The path kernel. In: ICPRAM (2013)
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: Proc. ICCV (2013)
Books, M.: The standard dictionary of the British sign language. DVD (2005)
Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: Proc. ICCV (2001)
Bristol Centre for Deaf Studies: Signstation, http://www.signstation.org (accessed March 1, 2014)
Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: Proc. CVPR (2009)
Chai, X., Li, G., Lin, Y., Xu, Z., Tang, Y., Chen, X., Zhou, M.: Sign language recognition and translation with Kinect. In: Proc. Int. Conf. Autom. Face and Gesture Recog. (2013)
Charles, J., Pfister, T., Everingham, M., Zisserman, A.: Automatic and efficient human pose estimation for sign language videos. IJCV (2013)
Charles, J., Pfister, T., Magee, D., Hogg, D., Zisserman, A.: Domain adaptation for upper body pose tracking in signed TV broadcasts. In: Proc. BMVC (2013)
Cooper, H., Bowden, R.: Learning signs from subtitles: A weakly supervised approach to sign language recognition. In: Proc. CVPR (2009)
Cuturi, M.: Fast global alignment kernels. In: ICML (2011)
Cuturi, M., Vert, J., Birkenes, Ø., Matsui, T.: A kernel for time series based on global alignments. In: ICASSP (2007)
Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In: Proc. CVPR (2009)
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Guyon, I., Athitsos, V., Escalante, H., Sigal, L., Argyros, A., Sminchisescu, C.: Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In: ACM MM (2013)
Fanello, S., Gori, I., Metta, G., Odone, F.: Keep it simple and sparse: real-time action recognition. J. Machine Learning Research 14(1), 2617–2640 (2013)
Farhadi, A., Forsyth, D., White, R.: Transfer learning in sign language. In: Proc. CVPR (2007)
Gaidon, A., Harchaoui, Z., Schmid, C.: A time series kernel for action recognition. In: Proc. BMVC (2011)
Guyon, I., Athitsos, V., Jangyodsuk, P., Escalante, H., Hamner, B.: Results and analysis of the ChaLearn gesture challenge 2012. In: Proc. ICPR (2013)
Guyon, I., Athitsos, V., Jangyodsuk, P., Hamner, B., Escalante, H.: ChaLearn gesture challenge: Design and first results. In: CVPR Workshops (2012)
Hariharan, B., Malik, J., Ramanan, D.: Discriminative decorrelation for clustering and classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 459–472. Springer, Heidelberg (2012)
Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: Proc. ICCV (2007)
Kelly, D., McDonald, J., Markham, C.: Weakly supervised training of a sign language recognition system using multiple instance learning density matrices. Trans. Systems, Man, and Cybernetics 41(2), 526–541 (2011)
Krishnan, R., Sarkar, S.: Similarity measure between two gestures using triplets. In: CVPR Workshops (2013)
Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: Proc. ICCV (2011)
Nayak, S., Duncan, K., Sarkar, S., Loeding, B.: Finding recurrent patterns from continuous sign language sentences for automated extraction of signs. J. Machine Learning Research 13(1), 2589–2615 (2012)
Pfister, T., Charles, J., Everingham, M., Zisserman, A.: Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In: Proc. BMVC (2012)
Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign language by watching TV (using co-occurrences). In: Proc. BMVC (2013)
Rother, C., Kolmogorov, V., Blake, A.: Grabcut: interactive foreground extraction using iterated graph cuts. In: Proc. ACM SIGGRAPH (2004)
Sakoe, H.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing (1978)
Sakoe, H., Chiba, S.: A similarity evaluation of speech patterns by dynamic programming. In: Nat. Meeting of Institute of Electronic Communications Engineers of Japan (1970)
Shimodaira, H., Noma, K., Nakai, M., Sagayama, S.: Dynamic time-alignment kernel in support vector machine. In: NIPS (2001)
Wan, J., Ruan, Q., Li, W., Deng, S.: One-shot learning gesture recognition from RGB-D data using bag of features. J. Machine Learning Research 14(1), 2549–2582 (2013)
Wu, J., Cheng, J., Zhao, C., Lu, H.: Fusing multi-modal features for gesture recognition. In: ICMI (2013)
Zhou, F., De la Torre, F.: Generalized time warping for multi-modal alignment of human motion. In: Proc. CVPR (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Pfister, T., Charles, J., Zisserman, A. (2014). Domain-Adaptive Discriminative One-Shot Learning of Gestures. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8694. Springer, Cham. https://doi.org/10.1007/978-3-319-10599-4_52
Download citation
DOI: https://doi.org/10.1007/978-3-319-10599-4_52
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10598-7
Online ISBN: 978-3-319-10599-4
eBook Packages: Computer ScienceComputer Science (R0)