skip to main content
10.1145/3474085.3478329acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

PyTorchVideo: A Deep Learning Library for Video Understanding

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.

References

  1. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML.Google ScholarGoogle Scholar
  2. Cisco. 2020. Annual Internet Report (2018--2023) White Paper.Google ScholarGoogle Scholar
  3. MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Under- standing Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.Google ScholarGoogle Scholar
  4. Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  5. Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC- KITCHENS Dataset. In ECCV.Google ScholarGoogle Scholar
  6. Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. 2020. PySlowFast. https://github.com/facebookresearch/slowfast.Google ScholarGoogle Scholar
  7. Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers.Google ScholarGoogle Scholar
  8. Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In CVPR.Google ScholarGoogle Scholar
  9. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast networks for video recognition. In ICCV.Google ScholarGoogle Scholar
  10. Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In CVPR.Google ScholarGoogle Scholar
  11. Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense.. In ICCV.Google ScholarGoogle Scholar
  12. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao- han Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. In NIPS.Google ScholarGoogle Scholar
  13. Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In CVPR.Google ScholarGoogle Scholar
  14. Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, and Yi Zhu. 2020. GluonCV and Glu- onNLP: Deep Learning in Computer Vision and Natural Language Processing. In Journal of Machine Learning Research.Google ScholarGoogle Scholar
  15. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2021. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR.Google ScholarGoogle Scholar
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.Google ScholarGoogle Scholar
  17. Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. ICLR (2020).Google ScholarGoogle Scholar
  18. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv.Google ScholarGoogle Scholar
  19. Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV.Google ScholarGoogle Scholar
  21. Michel Silva, Washington Ramos, João Ferreira, Felipe Chamone, Mario Campos, and Erickson R. Nascimento. 2018. A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos. In CVPR.Google ScholarGoogle Scholar
  22. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Calsses from Videos in the Wild. Technical Report CRCV-TR-12-01.Google ScholarGoogle Scholar
  23. Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video Classifi- cation with Channel-Separated Convolutional Networks. In Proc. ICCV.Google ScholarGoogle Scholar
  24. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun,, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.Google ScholarGoogle Scholar
  25. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In CVPR.Google ScholarGoogle Scholar
  26. Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Fe- ichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020).Google ScholarGoogle Scholar
  27. Dahua Lin Yue Zhao, Yuanjun Xiong. 2019. MMAction. https://github.com/ open-mmlab/mmaction.Google ScholarGoogle Scholar

Index Terms

  1. PyTorchVideo: A Deep Learning Library for Video Understanding

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader