FAR: Fourier Aerial Video Recognition

Kothandaraman, Divya; Guan, Tianrui; Wang, Xijun; Hu, Shuowen; Lin, Ming; Manocha, Dinesh

doi:10.1007/978-3-031-19836-6_37

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

European Conference on Computer Vision

2363 Accesses
5 Citations

Abstract

We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convolution-multiplication properties of Fourier transform to map this representation to the corresponding object-background entangled features obtained from the network. To encapsulate contextual information and long-range space-time dependencies, we present a novel Fourier Attention algorithm, which emulates the benefits of self-attention by modeling the weighted outer product in the frequency domain. Our Fourier attention formulation uses much fewer computations than self-attention. We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone. We demonstrate a relative improvement of 8.02%–38.69% in top-1 accuracy and up to 3 times faster over prior works.

The second and third authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)
Barekatain, M., et al.: Okutama-action: An aerial view video dataset for concurrent human action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 28–35 (2017)
Google Scholar
Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM Comput. Surveys (CSUR) 27(3), 433–466 (1995)
Article Google Scholar
Benjdira, B., Bazi, Y., Koubaa, A., Ouni, K.: Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sensing 11(11), 1369 (2019)
Article Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021)
Buijs, H., Pomerleau, A., Fournier, M., Tam, W.: Implementation of a fast fourier transform (fft) for image processing applications. IEEE Trans. Acoust. Speech Signal Process. 22(6), 420–424 (1974)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster r-cnn for object detection in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348 (2018)
Google Scholar
Chéron, G., Laptev, I., Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3218–3226 (2015)
Google Scholar
Chi, L., Jiang, B., Mu, Y.: Fast fourier convolution. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 4479–4488. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/2fd5d41ec6cfab47e32164d5624269b1-Paper.pdf
Choi, J.: Action recognition list of papers. In: https://github.com/jinwchoi/awesome-action-recognition
Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t i dance in the mall? learning to mitigate scene bias in action recognition. arXiv preprint arXiv:1912.05534 (2019)
Choi, J., Sharma, G., Chandraker, M., Huang, J.B.: Unsupervised and semi-supervised domain adaptation for action recognition from drones. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1717–1726 (2020)
Google Scholar
Chun, B.T., Bae, Y., Kim, T.Y.: Automatic text extraction in digital videos using fft and neural network. In: FUZZ-IEEE’99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No. 99CH36315), vol. 2, pp. 1112–1115. IEEE (1999)
Google Scholar
Ding, M., Li, N., Song, Z., Zhang, R., Zhang, X., Zhou, H.: A lightweight action recognition method for unmanned-aerial-vehicle video. In: 2020 IEEE 3rd International Conference on Electronics and Communication Engineering (ICECE), pp. 181–185. IEEE (2020)
Google Scholar
Dosovitskiy, A., et al.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2758–2766 (2015)
Google Scholar
Du, D., et al.: The unmanned aerial vehicle benchmark: Object detection and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 370–386 (2018)
Google Scholar
Dundar, A., Shih, K.J., Garg, A., Pottorf, R., Tao, A., Catanzaro, B.: Unsupervised disentanglement of pose, appearance and background from images and videos. arXiv preprint arXiv:2001.09518 (2020)
Ellenfeld, M., Moosbauer, S., Cardenes, R., Klauck, U., Teutsch, M.: Deep fusion of appearance and frame differencing for motion segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4339–4349 (2021)
Google Scholar
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211 (2019)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition., pp. 1933–1941 (2016)
Google Scholar
Frigo, M., Johnson, S.G.: Fftw: An adaptive software architecture for the fft. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181). vol. 3, pp. 1381–1384. IEEE (1998)
Google Scholar
Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154 (2019)
Google Scholar
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Two stream lstm: A deep fusion framework for human action recognition. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 177–186. IEEE (2017)
Google Scholar
Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)
Google Scholar
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. arXiv preprint arXiv:2012.10671 (2020)
Griffin, B.A., Corso, J.J.: Bubblenets: Learning to select the guidance frame in video object segmentation by deep sorting frames. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8914–8923 (2019)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Google Scholar
Hussein, N., Gavves, E., Smeulders, A.W.: Timeception for complex action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 254–263 (2019)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470 (2017)
Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3192–3199 (2013)
Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International Conference on Machine Learning, pp. 5156–5165. PMLR (2020)
Google Scholar
Kim, Y.J., Awadalla, H.H.: Fastformers: Highly efficient transformer models for natural language understanding. arXiv preprint arXiv:2010.13382 (2020)
Korbar, B., Tran, D., Torresani, L.: Scsampler: Sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6232–6242 (2019)
Google Scholar
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4013–4021. IEEE Computer Society, Los Alamitos, CA, USA (jun 2016). https://doi.org/10.1109/CVPR.2016.435
Lee, M., Lee, S., Son, S., Park, G., Kwak, N.: Motion feature network: fixed motion filter for action recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 392–408. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_24
Chapter Google Scholar
Lee-Thorp, J., Ainslie, J., Eckstein, I., Ontanon, S.: Fnet: Mixing tokens with fourier transforms (2021)
Google Scholar
Li, K., Wu, Z., Peng, K.C., Ernst, J., Fu, Y.: Tell me where to look: Guided attention inference network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9215–9223 (2018)
Google Scholar
Li, R., Su, J., Duan, C., Zheng, S.: Linear attention mechanism: An efficient attention for semantic segmentation. arXiv preprint arXiv:2007.14902 (2020)
Li, T., Liu, J., Zhang, W., Ni, Y., Wang, W., Li, Z.: Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16266–16275 (2021)
Google Scholar
Liu, Z., et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021)
Lloyd, D.B., Govindaraju, N.K., Quammen, C., Molnar, S.E., Manocha, D.: Logarithmic perspective shadow maps. ACM Trans. Graph. (TOG) 27(4), 1–32 (2008)
Article Google Scholar
Mazzia, V., Angarano, S., Salvetti, F., Angelini, F., Chiaberge, M.: Action transformer: A self-attention model for short-time human action recognition. arXiv preprint arXiv:2107.00606 (2021)
Mitchell, D.P., Netravali, A.N.: Reconstruction filters in computer-graphics. ACM Siggraph Comput. Graph. 22(4), 221–228 (1988)
Article Google Scholar
Mittal, P., Singh, R., Sharma, A.: Deep learning-based object detection in low-altitude uav datasets: a survey. Image Vis. Comput. 104, 104046 (2020)
Article Google Scholar
Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 502–508 (2019)
Article Google Scholar
Ng, J.Y.H., Choi, J., Neumann, J., Davis, L.S.: Actionflownet: Learning motion representation for action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1616–1624. IEEE (2018)
Google Scholar
Peng, H., Razi, A.: Fully autonomous UAV-based action recognition system using aerial imagery. In: Bebis, G. (ed.) ISVC 2020. LNCS, vol. 12509, pp. 276–290. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64556-4_22
Chapter Google Scholar
Perera, A.G., Law, Y.W., Chahl, J.: Drone-action: an outdoor recorded drone video dataset for action recognition. Drones 3(4), 82 (2019)
Article Google Scholar
Piccardi, M.: Background subtraction techniques: a review. In: 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583). vol. 4, pp. 3099–3104. IEEE (2004)
Google Scholar
Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9945–9953 (2019)
Google Scholar
Plizzari, C., Cannici, M., Matteucci, M.: Spatial temporal transformer network for skeleton-based action recognition. arXiv preprint arXiv:2012.06399 (2020)
Reddy, B.S., Chatterji, B.N.: An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Process. 5(8), 1266–1271 (1996)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
Google Scholar
Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Schlag, I., Irie, K., Schmidhuber, J.: Linear transformers are secretly fast weight programmers. In: International Conference on Machine Learning, pp. 9355–9366. PMLR (2021)
Google Scholar
Sengupta, S., Jayaram, V., Curless, B., Seitz, S.M., Kemelmacher-Shlizerman, I.: Background matting: The world is your green screen. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2291–2300 (2020)
Google Scholar
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3531–3539 (2021)
Google Scholar
Shi, F., et al.: Star: Sparse transformer-based action recognition. arXiv preprint arXiv:2107.07089 (2021)
Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014)
Sultani, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comput. Vis. Image Underst. 206, 103186 (2021)
Article Google Scholar
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) In: Advances in Neural Information Processing Systems. vol. 33, pp. 7537–7547. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/55053683268957697aa39fba6f231c68-Paper.pdf
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5552–5561 (2019)
Google Scholar
Ulhaq, A., Yin, X., Zhang, Y., Gondal, I.: Action-02MCF: A robust space-time correlation filter for action recognition in clutter and adverse lighting conditions. In: Blanc-Talon, J., Distante, C., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2016. LNCS, vol. 10016, pp. 465–476. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48680-2_41
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)
Google Scholar
Wang, M., Deng, W.: Deep visual domain adaptation: a survey. Neurocomputing 312, 135–153 (2018)
Article Google Scholar
Wang, S.e.a.: Linformer: Self-attention with linear complexity. arXiv:2006.04768 (2020)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7794–7803 (2018)
Google Scholar
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2566–2576 (2019)
Google Scholar
Xiong, Y., et al.: Nyströmformer: A nystöm-based algorithm for approximating self-attention. In: Proceedings of the. AAAI Conference on Artificial Intelligence. In: AAAI Conference on Artificial Intelligence. vol. 35, p. 14138. NIH Public Access (2021)
Google Scholar
Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y., Ren, F.: Learning in the frequency domain. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1737–1746. IEEE Computer Society, Los Alamitos, CA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.00181
Yang, Y., Soatto, S.: Fda: Fourier domain adaptation for semantic segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4084–4094 (2020)
Google Scholar
Zappella, L., Lladó, X., Salvi, J.: Motion segmentation: A review. Artificial Intelligence Research and Development, pp. 398–407 (2008)
Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International conference on machine learning, pp. 7354–7363. PMLR (2019)
Google Scholar
Zhang, Z., Zhao, J., Zhang, D., Qu, C., Ke, Y., Cai, B.: Contour based forest fire detection using fft and wavelet. In: 2008 International Conference on Computer Science and Software Engineering, vol. 1, pp. 760–763. IEEE (2008)
Google Scholar
Zhi, Y., Tong, Z., Wang, L., Wu, G.: Mgsampler: An explainable sampling strategy for video action recognition. arXiv preprint arXiv:2104.09952 (2021)
Zhu, Y., Deng, C., Cao, H., Wang, H.: Object and background disentanglement for unsupervised cross-domain person re-identification. Neurocomputing 403, 88–97 (2020)
Article Google Scholar
Zou, Z., Shi, Z., Guo, Y., Ye, J.: Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055 (2019)

Download references

Acknowledgements

We thank Rohan Chandra for reviewing the paper. This research has been supported by ARO Grants W911NF1910069, W911NF2110026 and Army Cooperative Agreement W911NF2120076.

Author information

Authors and Affiliations

University of Maryland, College Park, USA
Divya Kothandaraman, Tianrui Guan, Xijun Wang, Ming Lin & Dinesh Manocha
DEVCOM Army Research Laboratory, Adelphi, USA
Shuowen Hu

Authors

Divya Kothandaraman
View author publications
You can also search for this author in PubMed Google Scholar
Tianrui Guan
View author publications
You can also search for this author in PubMed Google Scholar
Xijun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shuowen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ming Lin
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Manocha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Divya Kothandaraman .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 13100 KB)

Supplementary material 1 (pdf 4215 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M., Manocha, D. (2022). FAR: Fourier Aerial Video Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-19836-6_37
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19835-9
Online ISBN: 978-3-031-19836-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics