short-paper

PyTorchVideo: A Deep Learning Library for Video Understanding

Authors:
Haoqi Fan

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Tullie Murrell

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Heng Wang

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Kalyan Vasudev Alwala

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Yanghao Li

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Yilei Li

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Bo Xiong

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Nikhila Ravi

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Meng Li

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Haichuan Yang

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Jitendra Malik

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Ross Girshick

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Matt Feiszli

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Aaron Adcock

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Wan-Yen Lo

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

,
Christoph Feichtenhofer

Facebook AI, Menlo Park, CA, USA

Facebook AI, Menlo Park, CA, USA
View Profile

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021Pages 3783–3786https://doi.org/10.1145/3474085.3478329

Published:17 October 2021Publication History

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3783–3786

ABSTRACT

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.

References

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML.Google Scholar
Cisco. 2020. Annual Internet Report (2018--2023) White Paper.Google Scholar
MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Under- standing Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.Google Scholar
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proc. CVPR.Google ScholarCross Ref
Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC- KITCHENS Dataset. In ECCV.Google Scholar
Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. 2020. PySlowFast. https://github.com/facebookresearch/slowfast.Google Scholar
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers.Google Scholar
Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In CVPR.Google Scholar
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast networks for video recognition. In ICCV.Google Scholar
Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In CVPR.Google Scholar
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense.. In ICCV.Google Scholar
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao- han Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. In NIPS.Google Scholar
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In CVPR.Google Scholar
Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, and Yi Zhu. 2020. GluonCV and Glu- onNLP: Deep Learning in Computer Vision and Natural Language Processing. In Journal of Machine Learning Research.Google Scholar
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2021. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.Google Scholar
Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. ICLR (2020).Google Scholar
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv.Google Scholar
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV. Google ScholarDigital Library
Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV.Google Scholar
Michel Silva, Washington Ramos, João Ferreira, Felipe Chamone, Mario Campos, and Erickson R. Nascimento. 2018. A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos. In CVPR.Google Scholar
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Calsses from Videos in the Wild. Technical Report CRCV-TR-12-01.Google Scholar
Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video Classifi- cation with Channel-Separated Convolutional Networks. In Proc. ICCV.Google Scholar
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun,, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.Google Scholar
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In CVPR.Google Scholar
Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Fe- ichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020).Google Scholar
Dahua Lin Yue Zhao, Yuanjun Xiong. 2019. MMAction. https://github.com/ open-mmlab/mmaction.Google Scholar

Index Terms

PyTorchVideo: A Deep Learning Library for Video Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

How Severe Is Benchmark-Sensitivity in Video Self-supervised Learning?
Computer Vision – ECCV 2022
Abstract
Despite the recent success of video self-supervised learning models, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the current ...
Read More
The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

This paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition ...
Read More
Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition
Abstract
Video-based action recognition is an important task in the computer vision community, aiming to extract rich spatial–temporal information to recognize human actions from videos. Many approaches adopt self-supervised learning in large-scale ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
video representation learning
video understanding
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 516
  Total Downloads
- Downloads (Last 12 months)109
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PyTorchVideo: A Deep Learning Library for Video Understanding

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

How Severe Is Benchmark-Sensitivity in Video Self-supervised Learning?

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection

Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition