Skip to main content
Log in

From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

As an important task in computer vision, the interaction recognition has attracted extensive attention due to its widely potential applications. The existing methods mainly focus on the interaction recognition problem on constrained datasets with few variations of scenes, viewpoints, background clutter for the experimental purpose. The performance of the recently proposed methods on the available constrained dataset almost approaches to saturation, which is not adaptive to further evaluate the robustness of new methods. In this paper, we introduce a new unconstrained dataset, called WEB-interaction, collected from the Internet. Our WEB-interaction more represents realistic scenes and has much more challenges than existing datasets. Besides, we evaluate the state-of-the-art pipeline of interaction recognition on both WEB-interaction and UT-interaction datasets. The evaluation results reveal that MBHx and MBHy of Motion Boundary Histogram (MBH) are important feature descriptors for interaction recognition and MBHx has relatively dominative information. For fusion strategy, the late fusion benefits more to performance than early fusion. Filming condition effects are also evaluated on WEB-interaction dataset. In addition, the best average precision(AP) result of different features on our WEB-interaction dataset is 44.2 % and the mean is around 38 %. Compare to the UT-interaction dataset, our dataset has bigger improvement space, which is more significant to promote new methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

Notes

  1. https://sites.google.com/site/gaochenqiang/

References

  1. Cai, Y., Chen, Q., Brown, L., Datta, A., Fan, Q., Feris, R., Yan, S., Hauptmann, A., Pankanti, S.: Cmu-ibm-nus@trecvid 2012: Surveillance event detection. In: Proc. TRECVID (2012)

  2. Chang, C-C., Lin, C-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  3. Chelaru, S., Orellana-Rodriguez, C., Altingovde, I.S.: How useful is social feedback for learning to rank youtube videos? World Wide Web 17(5), 997–1025 (2014)

    Article  Google Scholar 

  4. Chen, M-y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos (2009)

  5. Clausi, D.A., Deng, H.: Design-based texture feature fusion using gabor filters and co-occurrence probabilities. IEEE Trans. Image Process. 14(7), 925–936 (2005)

    Article  Google Scholar 

  6. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, volume 1, pp. 1–2. Prague (2004)

  7. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp 886–893. IEEE (2005)

  8. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Computer Vision–ECCV 2006, pp 428–441. Springer (2006)

  9. Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65–72. IEEE (2005)

  10. Douze, M., Jégou, H., Schmid, C.: An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Trans. Multimedia 12(4), 257–266 (2010)

    Article  Google Scholar 

  11. Fu, Y., Jia, Y., Kong, Y.: Interactive phrases: Semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell., 1 (2014)

  12. Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hierarchies. Int. J. Comput. Vis. 107(3), 219–238 (2014)

    Article  MathSciNet  Google Scholar 

  13. Gallese, V., Fadiga, L., Fogassi, L., Rizzolatti, G.: Action recognition in the premotor cortex. Brain 119(2), 593–609 (1996)

    Article  Google Scholar 

  14. Han, Y-h., Shao, J., Wu, F., Wei, B-g.: Multiple hypergraph ranking for video concept detection. J. Zhejiang Univ. Sci. C 11(7), 525–537 (2010)

    Article  Google Scholar 

  15. Han, Y., Yang, Y., Yan, Y., Ma, Z., Sebe, N., Zhou, X.: Semisupervised feature selection via spline regression for video semantic recognition. IEEE Trans. Neural Netw. Learn. Syst. 26(2), 252–264 (2015)

    Article  Google Scholar 

  16. Hoai, M., Zisserman, A.: Improving human action recognition using score distribution and ranking. In: Proceedings of the Asian Conference on Computer Vision (2014)

  17. Huang, G., Zhang, Y., Cao, J., Steyn, M., Taraporewalla, K.: Online mining abnormal period patterns from multiple medical sensor data streams. World Wide Web 17(4), 569–587 (2014)

    Article  Google Scholar 

  18. Kong, Y., Jia, Y., Yun, F.: Learning human interaction by interactive phrases. In: Computer Vision–ECCV 2012, pp. 300–313. Springer (2012)

  19. Lan, Z-z., Bao, L., Yu, S-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. Springer (2012)

  20. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2-3), 107–123 (2005)

    Article  Google Scholar 

  21. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE (2008)

  22. Lin, G., Zhu, H., Kang, X., Fan, C., Zhang, E.: Feature structure fusion and its application. Information Fusion 20, 146–154 (2014)

    Article  Google Scholar 

  23. Liu, Y., Han, Y.: A real-world web cross-media dataset containing images, texts and videos. In: Proceedings of International Conference on Internet Multimedia Computing and Service, p. 332. ACM (2014)

  24. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  25. Ma, Z., Yang, Y., Sebe, N., Hauptmann, A. G.: Multiple features but few labels?: A symbiotic solution exemplified for video analysis. In: Proceedings of the ACM International Conference on Multimedia, pp. 77–86. ACM (2014)

  26. Nour el Houda Slimani, K., Benezeth, Y., Souami, F.: Human interaction recognition based on the co-occurrence of visual words. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on, pp. 461–466. IEEE (2014)

  27. Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 1817–1824. IEEE (2013)

  28. Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in tv shows. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2441–2453 (2012)

    Article  Google Scholar 

  29. Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010, pp. 143–156. Springer (2010)

  30. Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 1036–1043. IEEE (2011)

  31. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Computer vision, 2009 ieee 12th international conference on, pp. 1593–1600. IEEE (2009)

  32. Ryoo, M.S., Chen, C-C., Aggarwal, J.K., Roy-Chowdhury, A.: An overview of contest on semantic description of human activities (sdha) 2010. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 270–285. Springer (2010)

  33. Sener, F., Bas, C., Ikizler-Cinbis, N.: On recognizing actions in still images via multiple features. In: Computer Vision–ECCV 2012. Workshops and Demonstrations, pp. 263–272. Springer (2012)

  34. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1470–1477. IEEE (2003)

  35. Snoek, C.G.M., Worring, M., Smeulders, A.W.M: Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on Multimedia, pp. 399–402. ACM (2005)

  36. Vahdat, A., Gao, B., Ranjbar, M., Mori, G.: A discriminative key pose sequence model for recognizing human interactions. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 1729–1736. IEEE (2011)

  37. Waltisberg, D., Yao, A., Gall, J., Gool, L.V.: Variations of a hough-voting action recognition system. In: Recognizing Patterns in Signals, Speech, Images and Videos, pp. 306–312. Springer (2010)

  38. Wang, H., Klaser, A., Schmid, C., Liu, C-L.: Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 3169–3176. IEEE (2011)

  39. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)

  40. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006)

    Article  Google Scholar 

  41. Wu, J., Chen, F., Hu, D.: Human interaction recognition by spatial structure models. In: Intelligence Science and Big Data Engineering, pp. 216–222. Springer (2013)

  42. Xia, L., Aggarwal, J.K.: Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2834–2841. IEEE (2013)

  43. Yang, Y., Ma, Z., Nie, F., Chang, X., Hauptmann, A.G.: Multi-class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vis., 1–15 (2014)

  44. Yang, Y., Ma, Z., Xu, Z., Yan, S., Hauptmann, A.G.: How related exemplars help complex event detection in web videos?. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 2104–2111. IEEE (2013)

  45. Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013)

    Article  Google Scholar 

  46. Ye, G., Liu, D., Jhuo, I-H., Chang, S-F.: Robust late fusion with rank minimization. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3021–3028. IEEE (2012)

  47. Yu, T-H., Kim, T-K., Cipolla, R.: Real-time action recognition by spatiotemporal semantic and structural forests. In: BMVC, vol. 2 (2010)

  48. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporal interest points for action recognition. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 724–730. IEEE (2013)

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61102131, 61275099), the Natural Science Foundation of Chongqing Science and Technology Commission (No. cstc2014jcyjA40048), Cooperation of Industry, Education and Academy of Chongqing University of Posts and Telecommunications No. WF201404), the Chongqing Distinguished Youth Foundation (No. CSTC2011jjjq40002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenqiang Gao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, C., Yang, L., Du, Y. et al. From constrained to unconstrained datasets: an evaluation of local action descriptors and fusion strategies for interaction recognition. World Wide Web 19, 265–276 (2016). https://doi.org/10.1007/s11280-015-0348-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-015-0348-y

Keywords

Navigation