Abstract
In the process of “Industry 4.0”, video analysis plays a vital role in a variety of industrial applications. Video-based action detection has obtained promising performance in computer vision community. However, in complex factory environment, how to detect workflow of both machines and workers in production process is not well resolved. To solve this issue, we propose a generic proposal based Graph Attention Networks for workflow detection. Specifically, an efficient and effective action proposal method is firstly employed to generate workflow proposals. Then, these proposals and their relations are exploited for proposal graph construction. Here, two types of relationships are considered for identifying the workflow phases, which are contextual and surrounding relations to capture context information and characterize the correlations between different workflow instances. To improve the recognition accuracy, within-category and between-category attention are incorporated to learn long-range and dynamic dependencies respectively. Thus, the capability of feature representation for workflow detection can be greatly enhanced. Experimental results verify that the proposed approach is considerably improved upon the state-of-the-arts on THUMOS’14 and a practical workflow dataset, achieving 6.7% and 3.9% absolute improvement compared to the advanced GTAN detector at tIoU threshold 0.4, respectively. Moreover, augmentation experiments are carried out on ActivityNet1.3 to prove the effectiveness of performance improvement by modeling workflow proposal relationships.
Similar content being viewed by others
References
Jalal A, Kamal S, Kim DS (2018) Detecting complex 3D human motions with body model low-rank representation for real-time smart activity monitoring system. KSII Trans Int Inform Syst 12(3)
Chen Y, Zhao D, Lv L et al (2018) Multi-task learning for dangerous object detection in autonomous driving. Inform Sci 432:559–571
SalazarAutores DRC, Maldonado CBG, Alvarado HFG et al (2018) Patterns for semantic human behavior analysis. In: Iberian Conference on Information Systems and Technologies (CISTI), pp 1–5
Voulodimos A, Kosmopoulos D, Vasileiou G et al (2011) A dataset for workflow recognition in industrial scenes. In: IEEE International Conference on Image Processing, pp 3249–3252
Voulodimos A, Kosmopoulos D, Vasileiou G et al (2012) A threefold dataset for activity and workflow recognition in complex industrial environments. IEEE MultiMedia 19(3):42–52
Li Z, Hu H, Hu H et al (2018) Multi-objective scheduling for scientific workflow in multicloud environment. J Netw Comput Appl 114:108–122
Li Z, Ge J, Hu H et al (2015) Cost and energy aware scheduling algorithm for scientific workflows with deadline constraint in clouds. IEEE Trans Serv Comput 11(4):713–726
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1049–1058
Chao YW, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1130–1139
Protopapadakis EE, Doulamis AD, Doulamis ND (2013) Tapped delay multiclass support vector machines for industrial workflow recognition. In: International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp 1–4
Veres G, Grabner H, Middleton L et al (2010) Automatic workflow monitoring in industrial environments. In: Asian Conference on Computer Vision, pp 200–213
Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 695–712
Li C, Zhong Q, Xie D (2019) Collaborative Spatiotemporal Feature Learning for Video Action Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7872–7881
Pinhanez CS, Bobick AF (1997) Intelligent studios modeling space and action to control tv cameras. Appl Art Intell 11(4):285–305
Koile K, Tollmar K, Demirdjian D et al (2003) Activity zones for context-aware computing. In: International Conference on Ubiquitous Computing, pp 90–106
Xiang T, Gong S (2008) Optimising dynamic graphical models for video content analysis. Comput Vision Image Understand 112(3):310–323
Vu VT, Brémond F, Thonnat M (2003) Automatic video interpretation: a novel algorithm for temporal scenario recognition. In: International joint conference on artificial intelligence, pp 1295–1300
Shi Y, Bobick A, Essa I (2006) Learning temporal sequence model from partially labeled data. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 1631–1638
Jin Y, Dou Q, Chen H et al (2017) SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Med Imag 37(5):1114–1126
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Ji S, Xu W, Yang M (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in lstms for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1942–1950
Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1302–1311
Caba Heilbron F, Carlos Niebles J, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1914–1923
Escorcia V, Heilbron FC, Niebles JC et al (2016) Daps: Deep action proposals for action understanding. In: European Conference on Computer Vision, pp 768–784
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907
Tan M, Shi Q, van den Hengel A et al (2015) Learning graph structure for multi-label image classification via clique generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4100–4109
Wang X, Gupta A (2018) Videos as space-time region graphs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 399–417
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence
Zhao L, Peng X, Tian Y et al (2019) Semantic Graph Convolutional Networks for 3D Human Pose Regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3425–3435
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang L, Huang Y, Hou Y, Zhang S, Shan J (2019) Graph attention convolution for point cloud semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10296–10305
Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Hong C, Yu J, Tao D, Wang M (2014) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Indus Electron 62(6):3742–3751
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Hong C, Yu J, Zhang J, Jin X, Lee KH (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Indus Inform 15(7):3952–3961
Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Dai X, Singh B, Zhang G et al (2017) Temporal context network for activity localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5793–5802
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903
Zhao Y, Xiong Y, Wang L, et al (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2914–2923
Hamilton WL, Ying R, Leskovec J (2017) Inductive representation learning on large graphs. arXiv:1706.02216
Jiang YG, Liu J, Roshan Zamir A, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/
Zhang L, Wang QW (2018) XIOLIFT Database, https://pan.baidu.com/s/lySILNURWD-N40q5TpAvGKUA
Caba Heilbron F, Escorcia V, Ghanem B et al (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 961–970
Lin T, Zhao X, Su H et al (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19
Li J, Liu X, Zong Z, Zhao W, Zhang M, Song J (2020) Graph Attention Based Proposal 3D ConvNets for Action Detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 4626–4633
Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos 2014.https://hal.inria.fr/hal-01074442/
Richard A, Gall J (2016) Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3131-3140
Shou Z, Chan J, Zareian A et al (2017) Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5734-5743
Yuan Z, Stroud JC, Lu T et al (2017) Temporal action localization by structured maximal sums. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3684-3692
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 344-353
Wang L, Xiong Y, Lin D (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4325-4334
Oyedotun OK, Aouada D (2020) Why do deep neural networks with skip connections and concatenated hidden representations work?. In: International Conference on Neural Information Processing, pp 380-392
Acknowledgements
This research is based upon work partially supported by National Natural Science Foundation of China (Grant no. 61572251, 61572162, 61702144 and 61802095), the Natural Science Foundation of Zhejiang Province (LQ17F020003), the Key Science and Technology Project Foundation of Zhejiang Province (2018C01012).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, M., Hu, H., Li, Z. et al. Proposal-Based Graph Attention Networks for Workflow Detection. Neural Process Lett 54, 101–123 (2022). https://doi.org/10.1007/s11063-021-10622-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10622-7