Abstract
Sign language recognition (SLR) is a challenging problem, involving complex manual features (i.e., hand gestures) and fine-grained non-manual features (NMFs) (i.e., facial expression, mouth shapes, etc.). Although manual features are dominant, non-manual features also play an important role in the expression of a sign word. Specifically, many sign words convey different meanings due to non-manual features, even though they share the same hand gestures. This ambiguity introduces great challenges in the recognition of sign words. To tackle the above issue, we propose a simple yet effective architecture called Global-Local Enhancement Network (GLE-Net), including two mutually promoted streams toward different crucial aspects of SLR. Of the two streams, one captures the global contextual relationship, while the other stream captures the discriminative fine-grained cues. Moreover, due to the lack of datasets explicitly focusing on this kind of feature, we introduce the first non-manual-feature-aware isolated Chinese sign language dataset (NMFs-CSL) with a total vocabulary size of 1,067 sign words in daily life. Extensive experiments on NMFs-CSL and SLR500 datasets demonstrate the effectiveness of our method.
- Danilo Avola, Marco Bernardi, Luigi Cinque, Gian Luca Foresti, and Cristiano Massaroni. 2018. Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Transactions on Multimedia 21, 1 (2018), 234–245.Google ScholarDigital Library
- Patrick Buehler, Andrew Zisserman, and Mark Everingham. 2009. Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2961–2968.Google ScholarCross Ref
- Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, and Richard Bowden. 2017. SubUNets: End-to-end hand shape and continuous sign language recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3075–3084.Google ScholarCross Ref
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarCross Ref
- Xiujuan Chai, Hanjie Wang, and Xilin Chen. 2014. The devisign large vocabulary of Chinese sign language database and baseline evaluations. Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences.Google Scholar
- Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352–367.Google ScholarCross Ref
- Ching-Hua Chuan, Eric Regina, and Caroline Guardino. 2014. American sign language recognition using leap motion sensor. In Proceedings of the International Conference on Machine Learning and Applications. 541–544. Google ScholarDigital Library
- Helen Cooper, Eng-Jon Ong, Nicolas Pugeault, and Richard Bowden. 2012. Sign language recognition using sub-units. Journal of Machine Learning Research 13, (2012), 2205–2231. Google ScholarDigital Library
- Runpeng Cui, Hu Liu, and Changshui Zhang. 2019. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia 21, 7 (2019), 1880–1891.Google ScholarCross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.Google ScholarCross Ref
- Georgios D. Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Continuous gesture recognition from articulated poses. In Proceedings of the European Conference on Computer Vision. 595–607.Google Scholar
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
- Sidney S. Fels and Geoffrey E. Hinton. 1993. Glove-talk: A neural network interface between a data-glove and a speech synthesizer. IEEE Transactions on Neural Networks 4, 1 (1993), 2–8. Google ScholarDigital Library
- Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4438–4446.Google ScholarCross Ref
- Georgia Gkioxari, Ross Girshick, and Jitendra Malik. 2015. Contextual action recognition with r* cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1080–1088. Google ScholarDigital Library
- Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The “Something Something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. 5843–5851.Google ScholarCross Ref
- Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2017. Online early-late fusion based on adaptive HMM for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2017), 1–18. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- Eun-Jung Holden and Robyn Owens. 2001. Visual sign language recognition. In Multi-Image Analysis. 270–287. Google ScholarDigital Library
- Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.Google ScholarCross Ref
- Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2015. Sign language recognition using 3D convolutional neural networks. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1–6. Google ScholarDigital Library
- Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2015. Sign language recognition using real-sense. In Proceedings of the IEEE China Summit and International Conference on Signal and Information Processing. 166–170.Google ScholarCross Ref
- Jie Huang, Wengang Zhou, Houqiang Li, and Weiping Li. 2019. Attention based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 9 (2019), 2822–2832.Google ScholarDigital Library
- Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems. 2017–2025. Google ScholarDigital Library
- Hamid Reza Vaezi Joze and Oscar Koller. 2019. MS-ASL: A large-scale data set and benchmark for understanding American sign language. In Proceedings of the British Machine Vision Conference. 1–16.Google Scholar
- Tomasz Kapuscinski, Mariusz Oszust, Marian Wysocki, and Dawid Warchol. 2015. Recognition of hand gestures observed by depth cameras. International Journal of Advanced Robotic Systems 12, 4 (2015), 1–15.Google ScholarCross Ref
- Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
- Jong-Sung Kim, Won Jang, and Zeungnam Bien. 1996. A dynamic gesture recognition system for the Korean sign language (KSL). IEEE Transactions on Systems, Man, and Cybernetics 26, 2 (1996), 354–359. Google ScholarDigital Library
- Oscar Koller, Cihan Camgoz, Hermann Ney, and Richard Bowden. 2019. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2019), 2306–2320.Google ScholarDigital Library
- Oscar Koller, Jens Forster, and Hermann Ney. 2015. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding 141 (2015), 108–125. Google ScholarDigital Library
- Oscar Koller, O. Zargaran, Hermann Ney, and Richard Bowden. 2016. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In Proceedings of the British Machine Vision Conference. Article 136, 136.1-136.12 pages.Google Scholar
- Oscar Koller, Sepehr Zargaran, and Hermann Ney. 2017. Re-Sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4297–4305.Google ScholarCross Ref
- Oscar Koller, Sepehr Zargaran, Hermann Ney, and Richard Bowden. 2018. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. International Journal of Computer Vision 126, 12 (2018), 1311–1325. Google ScholarDigital Library
- Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2 (2005), 107–123. Google ScholarDigital Library
- Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G.M. Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018), 41–50. Google ScholarDigital Library
- Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.Google ScholarCross Ref
- Zhipeng Liu, Xiujuan Chai, Zhuang Liu, and Xilin Chen. 2017. Continuous gesture recognition with hand-oriented spatiotemporal feature. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 3056–3064.Google ScholarCross Ref
- Stephan Liwicki and Mark Everingham. 2009. Automatic recognition of fingerspelled words in British sign language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 50–57.Google ScholarCross Ref
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605.Google Scholar
- Arun Mallya and Svetlana Lazebnik. 2016. Learning models for actions and person-object interactions with transfer to question answering. In Proceedings of the European Conference on Computer Vision. 414–428.Google ScholarCross Ref
- Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems. 2204–2212. Google ScholarDigital Library
- Liqiang Nie, Xiang Wang, Jianglong Zhang, Xiangnan He, Hanwang Zhang, Richang Hong, and Qi Tian. 2017. Enhancing micro-video understanding by harnessing external sounds. In Proceedings of ACM International Conference on Multimedia. 1192–1200. Google ScholarDigital Library
- Mariusz Oszust and Marian Wysocki. 2013. Polish sign language words recognition with Kinect. In Proceedings of the International Conference on Human System Interactions. 219–226.Google ScholarCross Ref
- Junfu Pu, Wengang Zhou, and Houqiang Li. 2018. Dilated convolutional network with iterative optimization for continuous sign language recognition. In Proceedings of the International Joint Conference on Artificial Intelligence. 885–891. Google ScholarDigital Library
- Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4165–4174.Google ScholarCross Ref
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533–5541.Google ScholarCross Ref
- Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, and Tao Mei. 2019. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12056–12065.Google ScholarCross Ref
- Adria Recasens, Petr Kellnhofer, Simon Stent, Wojciech Matusik, and Antonio Torralba. 2018. Learning to zoom: A saliency-based sampling layer for neural networks. In Proceedings of the European Conference on Computer Vision. 51–66.Google ScholarDigital Library
- Franco Ronchetti, Facundo Quiroga, César Armando Estrebou, Laura Cristina Lanzarini, and Alejandro Rosete. 2016. LSA64: An Argentinian sign language dataset. In XXII Congreso Argentino de Ciencias de la Computación.Google Scholar
- Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).Google Scholar
- Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Diane Brentari, Greg Shakhnarovich, and Karen Livescu. 2019. Fingerspelling recognition in the wild with iterative visual attention. In Proceedings of the IEEE International Conference on Computer Vision. 5400–5409.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems. 568–576. Google ScholarDigital Library
- Thad E. Starner. 1995. Visual Recognition of American Sign Language Using Hidden Markov Models. Technical Report. Massachusetts Inst. of Tech., Cambridge Dept of Brain and Cognitive Sciences.Google Scholar
- Kiriakos Stefanidis, Dimitrios Konstantinidis, Athanasios Kalvourtzis, Kosmas Dimitropoulos, and Petros Daras. 2020. 3D technologies and applications in sign language. In Recent Advances in 3D Imaging, Modeling, and Reconstruction. 50–78.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
- Ao Tang, Ke Lu, Yufei Wang, Jie Huang, and Houqiang Li. 2015. A real-time hand posture recognition system using deep neural networks. ACM Transactions on Intelligent Systems and Technology 6, 2 (2015), 1–23. Google ScholarDigital Library
- Alaa Tharwat, Tarek Gaber, Aboul Ella Hassanien, Mohamed K. Shahin, and Basma Refaat. 2015. SIFT-based Arabic sign language recognition system. In Proceedings of the Afro-European Conference for Industrial Advancement. 359–370.Google ScholarCross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497. Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008. Google ScholarDigital Library
- Chunli Wang, Wen Gao, and Zhaoguo Xuan. 2001. A real-time large vocabulary continuous recognition system for Chinese sign language. In Proceedings of the Pacific-Rim Conference on Multimedia. 150–157. Google ScholarDigital Library
- Hanjie Wang, Xiujuan Chai, and Xilin Chen. 2019. A novel sign language recognition framework using hierarchical Grassmann covariance matrix. IEEE Transactions on Multimedia 21, 11 (2019), 2806–2814.Google ScholarCross Ref
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3169–3176. Google ScholarDigital Library
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20–36.Google ScholarCross Ref
- Li-Chun Wang, Ru Wang, De-Hui Kong, and Bao-Cai Yin. 2014. Similarity assessment model for Chinese sign language videos. IEEE Transactions on Multimedia 16, 3 (2014), 751–761. Google ScholarDigital Library
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google ScholarCross Ref
- Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305–321.Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057. Google ScholarDigital Library
- Farhad Yasir, P. W. Chandana Prasad, Abeer Alsadoon, and Amr Elchouemi. 2015. SIFT-based approach on Bangla sign language recognition. In International Workshop on Computational Intelligence and Applications. 35–39.Google ScholarCross Ref
- Fang Yin, Xiujuan Chai, and Xilin Chen. 2016. Iterative reference driven metric learning for signer independent isolated sign language recognition. In Proceedings of the European Conference on Computer Vision. 434–450.Google ScholarCross Ref
- Morteza Zahedi, Daniel Keysers, Thomas Deselaers, and Hermann Ney. 2005. Combination of tangent distance and an image distortion model for appearance-based sign language recognition. In Joint Pattern Recognition Symposium. 401–408. Google ScholarDigital Library
- Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2018. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018).Google Scholar
- Jihai Zhang, Wengang Zhou, Chao Xie, Junfu Pu, and Houqiang Li. 2016. Chinese sign language recognition with adaptive HMM. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1–6.Google ScholarCross Ref
- Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4320–4328.Google ScholarCross Ref
- Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. 2019. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5012–5021.Google ScholarCross Ref
- Hao Zhou, Wengang Zhou, and Houqiang Li. 2019. Dynamic pseudo label decoding for continuous sign language recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1282–1287.Google ScholarCross Ref
- Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 449–458.Google ScholarCross Ref
Index Terms
- Global-Local Enhancement Network for NMF-Aware Sign Language Recognition
Recommendations
Subunit sign modeling framework for continuous sign language recognition
AbstractA new framework named three subunit sign modeling is introduced for automatic sign language recognition. This works on continuous video sequences consisting of isolated words, signed sentences under different signer variations and ...
Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning
Research in automatic analysis of sign language has largely focused on recognizing the lexical (or citation) form of sign gestures as they appear in continuous signing, and developing algorithms that scale well to large vocabularies. However, successful ...
Local Binary Pattern based features for sign language recognition
In this paper we focus on appearance features particularly the Local Binary Patterns describing the manual component of Sign Language. We compare the performance of these features with geometric moments describing the trajectory and shape of hands. ...
Comments