Skip to main content
Log in

Proposal-Based Graph Attention Networks for Workflow Detection

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In the process of “Industry 4.0”, video analysis plays a vital role in a variety of industrial applications. Video-based action detection has obtained promising performance in computer vision community. However, in complex factory environment, how to detect workflow of both machines and workers in production process is not well resolved. To solve this issue, we propose a generic proposal based Graph Attention Networks for workflow detection. Specifically, an efficient and effective action proposal method is firstly employed to generate workflow proposals. Then, these proposals and their relations are exploited for proposal graph construction. Here, two types of relationships are considered for identifying the workflow phases, which are contextual and surrounding relations to capture context information and characterize the correlations between different workflow instances. To improve the recognition accuracy, within-category and between-category attention are incorporated to learn long-range and dynamic dependencies respectively. Thus, the capability of feature representation for workflow detection can be greatly enhanced. Experimental results verify that the proposed approach is considerably improved upon the state-of-the-arts on THUMOS’14 and a practical workflow dataset, achieving 6.7% and 3.9% absolute improvement compared to the advanced GTAN detector at tIoU threshold 0.4, respectively. Moreover, augmentation experiments are carried out on ActivityNet1.3 to prove the effectiveness of performance improvement by modeling workflow proposal relationships.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Jalal A, Kamal S, Kim DS (2018) Detecting complex 3D human motions with body model low-rank representation for real-time smart activity monitoring system. KSII Trans Int Inform Syst 12(3)

  2. Chen Y, Zhao D, Lv L et al (2018) Multi-task learning for dangerous object detection in autonomous driving. Inform Sci 432:559–571

    Article  Google Scholar 

  3. SalazarAutores DRC, Maldonado CBG, Alvarado HFG et al (2018) Patterns for semantic human behavior analysis. In: Iberian Conference on Information Systems and Technologies (CISTI), pp 1–5

  4. Voulodimos A, Kosmopoulos D, Vasileiou G et al (2011) A dataset for workflow recognition in industrial scenes. In: IEEE International Conference on Image Processing, pp 3249–3252

  5. Voulodimos A, Kosmopoulos D, Vasileiou G et al (2012) A threefold dataset for activity and workflow recognition in complex industrial environments. IEEE MultiMedia 19(3):42–52

    Article  Google Scholar 

  6. Li Z, Hu H, Hu H et al (2018) Multi-objective scheduling for scientific workflow in multicloud environment. J Netw Comput Appl 114:108–122

    Article  Google Scholar 

  7. Li Z, Ge J, Hu H et al (2015) Cost and energy aware scheduling algorithm for scientific workflows with deadline constraint in clouds. IEEE Trans Serv Comput 11(4):713–726

    Article  Google Scholar 

  8. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  9. Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732

  10. Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1049–1058

  11. Chao YW, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1130–1139

  12. Protopapadakis EE, Doulamis AD, Doulamis ND (2013) Tapped delay multiclass support vector machines for industrial workflow recognition. In: International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp 1–4

  13. Veres G, Grabner H, Middleton L et al (2010) Automatic workflow monitoring in industrial environments. In: Asian Conference on Computer Vision, pp 200–213

  14. Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 695–712

  15. Li C, Zhong Q, Xie D (2019) Collaborative Spatiotemporal Feature Learning for Video Action Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7872–7881

  16. Pinhanez CS, Bobick AF (1997) Intelligent studios modeling space and action to control tv cameras. Appl Art Intell 11(4):285–305

    Article  Google Scholar 

  17. Koile K, Tollmar K, Demirdjian D et al (2003) Activity zones for context-aware computing. In: International Conference on Ubiquitous Computing, pp 90–106

  18. Xiang T, Gong S (2008) Optimising dynamic graphical models for video content analysis. Comput Vision Image Understand 112(3):310–323

    Article  Google Scholar 

  19. Vu VT, Brémond F, Thonnat M (2003) Automatic video interpretation: a novel algorithm for temporal scenario recognition. In: International joint conference on artificial intelligence, pp 1295–1300

  20. Shi Y, Bobick A, Essa I (2006) Learning temporal sequence model from partially labeled data. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 1631–1638

  21. Jin Y, Dou Q, Chen H et al (2017) SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Med Imag 37(5):1114–1126

    Article  Google Scholar 

  22. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199

  23. Ji S, Xu W, Yang M (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  24. Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in lstms for activity detection and early detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1942–1950

  25. Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1302–1311

  26. Caba Heilbron F, Carlos Niebles J, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1914–1923

  27. Escorcia V, Heilbron FC, Niebles JC et al (2016) Daps: Deep action proposals for action understanding. In: European Conference on Computer Vision, pp 768–784

  28. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907

  29. Tan M, Shi Q, van den Hengel A et al (2015) Learning graph structure for multi-label image classification via clique generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4100–4109

  30. Wang X, Gupta A (2018) Videos as space-time region graphs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 399–417

  31. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence

  32. Zhao L, Peng X, Tian Y et al (2019) Semantic Graph Convolutional Networks for 3D Human Pose Regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3425–3435

  33. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  34. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762

  35. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  36. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Tang X (2017) Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  37. Wang L, Huang Y, Hou Y, Zhang S, Shan J (2019) Graph attention convolution for point cloud semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10296–10305

  38. Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779

    Article  Google Scholar 

  39. Hong C, Yu J, Tao D, Wang M (2014) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Indus Electron 62(6):3742–3751

    Google Scholar 

  40. Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670

    Article  MathSciNet  Google Scholar 

  41. Hong C, Yu J, Zhang J, Jin X, Lee KH (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Indus Inform 15(7):3952–3961

    Article  Google Scholar 

  42. Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058

    Article  Google Scholar 

  43. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308

  44. Dai X, Singh B, Zhang G et al (2017) Temporal context network for activity localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5793–5802

  45. Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903

  46. Zhao Y, Xiong Y, Wang L, et al (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2914–2923

  47. Hamilton WL, Ying R, Leskovec J (2017) Inductive representation learning on large graphs. arXiv:1706.02216

  48. Jiang YG, Liu J, Roshan Zamir A, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/

  49. Zhang L, Wang QW (2018) XIOLIFT Database, https://pan.baidu.com/s/lySILNURWD-N40q5TpAvGKUA

  50. Caba Heilbron F, Escorcia V, Ghanem B et al (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 961–970

  51. Lin T, Zhao X, Su H et al (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19

  52. Li J, Liu X, Zong Z, Zhao W, Zhang M, Song J (2020) Graph Attention Based Proposal 3D ConvNets for Action Detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 4626–4633

  53. Oneata D, Verbeek J, Schmid C (2014) The lear submission at thumos 2014.https://hal.inria.fr/hal-01074442/

  54. Richard A, Gall J (2016) Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3131-3140

  55. Shou Z, Chan J, Zareian A et al (2017) Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5734-5743

  56. Yuan Z, Stroud JC, Lu T et al (2017) Temporal action localization by structured maximal sums. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3684-3692

  57. Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 344-353

  58. Wang L, Xiong Y, Lin D (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4325-4334

  59. Oyedotun OK, Aouada D (2020) Why do deep neural networks with skip connections and concatenated hidden representations work?. In: International Conference on Neural Information Processing, pp 380-392

Download references

Acknowledgements

This research is based upon work partially supported by National Natural Science Foundation of China (Grant no. 61572251, 61572162, 61702144 and 61802095), the Natural Science Foundation of Zhejiang Province (LQ17F020003), the Key Science and Technology Project Foundation of Zhejiang Province (2018C01012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiyang Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, M., Hu, H., Li, Z. et al. Proposal-Based Graph Attention Networks for Workflow Detection. Neural Process Lett 54, 101–123 (2022). https://doi.org/10.1007/s11063-021-10622-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10622-7

Keywords

Navigation