From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

Gao, Chenqiang; Yang, Luyu; Du, Yinhe; Feng, Zeming; Liu, Jiang

doi:10.1007/s11280-015-0348-y

From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

Published: 02 May 2015

Volume 19, pages 265–276, (2016)
Cite this article

World Wide Web Aims and scope Submit manuscript

Chenqiang Gao¹,
Luyu Yang¹,
Yinhe Du¹,
Zeming Feng¹ &
…
Jiang Liu¹

732 Accesses
10 Citations
Explore all metrics

Abstract

As an important task in computer vision, the interaction recognition has attracted extensive attention due to its widely potential applications. The existing methods mainly focus on the interaction recognition problem on constrained datasets with few variations of scenes, viewpoints, background clutter for the experimental purpose. The performance of the recently proposed methods on the available constrained dataset almost approaches to saturation, which is not adaptive to further evaluate the robustness of new methods. In this paper, we introduce a new unconstrained dataset, called WEB-interaction, collected from the Internet. Our WEB-interaction more represents realistic scenes and has much more challenges than existing datasets. Besides, we evaluate the state-of-the-art pipeline of interaction recognition on both WEB-interaction and UT-interaction datasets. The evaluation results reveal that MBHx and MBHy of Motion Boundary Histogram (MBH) are important feature descriptors for interaction recognition and MBHx has relatively dominative information. For fusion strategy, the late fusion benefits more to performance than early fusion. Filming condition effects are also evaluated on WEB-interaction dataset. In addition, the best average precision(AP) result of different features on our WEB-interaction dataset is 44.2 % and the mean is around 38 %. Compare to the UT-interaction dataset, our dataset has bigger improvement space, which is more significant to promote new methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Image Features Detection, Description and Matching

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Notes

https://sites.google.com/site/gaochenqiang/

References

Cai, Y., Chen, Q., Brown, L., Datta, A., Fan, Q., Feris, R., Yan, S., Hauptmann, A., Pankanti, S.: Cmu-ibm-nus@trecvid 2012: Surveillance event detection. In: Proc. TRECVID (2012)
Chang, C-C., Lin, C-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Chelaru, S., Orellana-Rodriguez, C., Altingovde, I.S.: How useful is social feedback for learning to rank youtube videos? World Wide Web 17(5), 997–1025 (2014)
Article Google Scholar
Chen, M-y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos (2009)
Clausi, D.A., Deng, H.: Design-based texture feature fusion using gabor filters and co-occurrence probabilities. IEEE Trans. Image Process. 14(7), 925–936 (2005)
Article Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, volume 1, pp. 1–2. Prague (2004)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp 886–893. IEEE (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Computer Vision–ECCV 2006, pp 428–441. Springer (2006)
Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65–72. IEEE (2005)
Douze, M., Jégou, H., Schmid, C.: An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Trans. Multimedia 12(4), 257–266 (2010)
Article Google Scholar
Fu, Y., Jia, Y., Kong, Y.: Interactive phrases: Semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2014)
Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hierarchies. Int. J. Comput. Vis. 107(3), 219–238 (2014)
Article MathSciNet Google Scholar
Gallese, V., Fadiga, L., Fogassi, L., Rizzolatti, G.: Action recognition in the premotor cortex. Brain 119(2), 593–609 (1996)
Article Google Scholar
Han, Y-h., Shao, J., Wu, F., Wei, B-g.: Multiple hypergraph ranking for video concept detection. J. Zhejiang Univ. Sci. C 11(7), 525–537 (2010)
Article Google Scholar
Han, Y., Yang, Y., Yan, Y., Ma, Z., Sebe, N., Zhou, X.: Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans. Neural Netw. Learn. Syst. 26(2), 252–264 (2015)
Article Google Scholar
Hoai, M., Zisserman, A.: Improving human action recognition using score distribution and ranking. In: Proceedings of the Asian Conference on Computer Vision (2014)
Huang, G., Zhang, Y., Cao, J., Steyn, M., Taraporewalla, K.: Online mining abnormal period patterns from multiple medical sensor data streams. World Wide Web 17(4), 569–587 (2014)
Article Google Scholar
Kong, Y., Jia, Y., Yun, F.: Learning human interaction by interactive phrases. In: Computer Vision–ECCV 2012, pp. 300–313. Springer (2012)
Lan, Z-z., Bao, L., Yu, S-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. Springer (2012)
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2-3), 107–123 (2005)
Article Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE (2008)
Lin, G., Zhu, H., Kang, X., Fan, C., Zhang, E.: Feature structure fusion and its application. Information Fusion 20, 146–154 (2014)
Article Google Scholar
Liu, Y., Han, Y.: A real-world web cross-media dataset containing images, texts and videos. In: Proceedings of International Conference on Internet Multimedia Computing and Service, p. 332. ACM (2014)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Ma, Z., Yang, Y., Sebe, N., Hauptmann, A. G.: Multiple features but few labels?: A symbiotic solution exemplified for video analysis. In: Proceedings of the ACM International Conference on Multimedia, pp. 77–86. ACM (2014)
Nour el Houda Slimani, K., Benezeth, Y., Souami, F.: Human interaction recognition based on the co-occurrence of visual words. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pp. 461–466. IEEE (2014)
Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 1817–1824. IEEE (2013)
Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in tv shows. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2441–2453 (2012)
Article Google Scholar
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010, pp. 143–156. Springer (2010)
Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043. IEEE (2011)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Computer vision, 2009 ieee 12th international conference on, pp. 1593–1600. IEEE (2009)
Ryoo, M.S., Chen, C-C., Aggarwal, J.K., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (sdha) 2010. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 270–285. Springer (2010)
Sener, F., Bas, C., Ikizler-Cinbis, N.: On recognizing actions in still images via multiple features. In: Computer Vision–ECCV 2012. Workshops and Demonstrations, pp. 263–272. Springer (2012)
Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1470–1477. IEEE (2003)
Snoek, C.G.M., Worring, M., Smeulders, A.W.M: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on Multimedia, pp. 399–402. ACM (2005)
Vahdat, A., Gao, B., Ranjbar, M., Mori, G.: A discriminative key pose sequence model for recognizing human interactions. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 1729–1736. IEEE (2011)
Waltisberg, D., Yao, A., Gall, J., Gool, L.V.: Variations of a hough-voting action recognition system. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer (2010)
Wang, H., Klaser, A., Schmid, C., Liu, C-L.: Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3169–3176. IEEE (2011)
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)
Article Google Scholar
Wu, J., Chen, F., Hu, D.: Human interaction recognition by spatial structure models. In: Intelligence Science and Big Data Engineering, pp. 216–222. Springer (2013)
Xia, L., Aggarwal, J.K.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2834–2841. IEEE (2013)
Yang, Y., Ma, Z., Nie, F., Chang, X., Hauptmann, A.G.: Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis., 1–15 (2014)
Yang, Y., Ma, Z., Xu, Z., Yan, S., Hauptmann, A.G.: How related exemplars help complex event detection in web videos?. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 2104–2111. IEEE (2013)
Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013)
Article Google Scholar
Ye, G., Liu, D., Jhuo, I-H., Chang, S-F.: Robust late fusion with rank minimization. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3021–3028. IEEE (2012)
Yu, T-H., Kim, T-K., Cipolla, R.: Real-time action recognition by spatiotemporal semantic and structural forests. In: BMVC, vol. 2 (2010)
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporal interest points for action recognition. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 724–730. IEEE (2013)

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61102131, 61275099), the Natural Science Foundation of Chongqing Science and Technology Commission (No. cstc2014jcyjA40048), Cooperation of Industry, Education and Academy of Chongqing University of Posts and Telecommunications No. WF201404), the Chongqing Distinguished Youth Foundation (No. CSTC2011jjjq40002).

Author information

Authors and Affiliations

Chongqing Key Laboratory of Signal and Information Processing, Chongqing University of Posts and Telecommunications, Chongwen Road NO.2, Nan’an District, Chongqing, 400065, China
Chenqiang Gao, Luyu Yang, Yinhe Du, Zeming Feng & Jiang Liu

Authors

Chenqiang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Luyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yinhe Du
View author publications
You can also search for this author in PubMed Google Scholar
Zeming Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenqiang Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, C., Yang, L., Du, Y. et al. From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition. World Wide Web 19, 265–276 (2016). https://doi.org/10.1007/s11280-015-0348-y

Download citation

Received: 15 December 2014
Revised: 04 March 2015
Accepted: 14 April 2015
Published: 02 May 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11280-015-0348-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Image Features Detection, Description and Matching

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Image Features Detection, Description and Matching

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation