ABSTRACT
We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In ICML.Google Scholar
- Cisco. 2020. Annual Internet Report (2018--2023) White Paper.Google Scholar
- MMAction2 Contributors. 2020. OpenMMLab's Next Generation Video Under- standing Toolbox and Benchmark. https://github.com/open-mmlab/mmaction2.Google Scholar
- Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proc. CVPR.Google ScholarCross Ref
- Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC- KITCHENS Dataset. In ECCV.Google Scholar
- Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. 2020. PySlowFast. https://github.com/facebookresearch/slowfast.Google Scholar
- Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers.Google Scholar
- Christoph Feichtenhofer. 2020. X3D: Expanding architectures for efficient video recognition. In CVPR.Google Scholar
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast networks for video recognition. In ICCV.Google Scholar
- Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. 2021. A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. In CVPR.Google Scholar
- Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense.. In ICCV.Google Scholar
- Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhao- han Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent: A new approach to self-supervised Learning. In NIPS.Google Scholar
- Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Suk- thankar, Cordelia Schmid, and Jitendra Malik. 2018. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. In CVPR.Google Scholar
- Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, and Yi Zhu. 2020. GluonCV and Glu- onNLP: Deep Learning in Computer Vision and Natural Language Processing. In Journal of Machine Learning Research.Google Scholar
- Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2021. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.Google Scholar
- Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. 2020. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. ICLR (2020).Google Scholar
- Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv.Google Scholar
- Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: a large video database for human motion recognition. In ICCV. Google ScholarDigital Library
- Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV.Google Scholar
- Michel Silva, Washington Ramos, João Ferreira, Felipe Chamone, Mario Campos, and Erickson R. Nascimento. 2018. A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos. In CVPR.Google Scholar
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Calsses from Videos in the Wild. Technical Report CRCV-TR-12-01.Google Scholar
- Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feiszli. 2019. Video Classifi- cation with Channel-Separated Convolutional Networks. In Proc. ICCV.Google Scholar
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun,, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In CVPR.Google Scholar
- Fanyi Xiao, Yong Jae Lee, Kristen Grauman, Jitendra Malik, and Christoph Fe- ichtenhofer. 2020. Audiovisual slowfast networks for video recognition. arXiv preprint arXiv:2001.08740 (2020).Google Scholar
- Dahua Lin Yue Zhao, Yuanjun Xiong. 2019. MMAction. https://github.com/ open-mmlab/mmaction.Google Scholar
Index Terms
- PyTorchVideo: A Deep Learning Library for Video Understanding
Recommendations
How Severe Is Benchmark-Sensitivity in Video Self-supervised Learning?
Computer Vision – ECCV 2022AbstractDespite the recent success of video self-supervised learning models, there is much still to be understood about their generalization capability. In this paper, we investigate how sensitive video self-supervised learning is to the current ...
The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalThis paper strives for video event detection using a representation learned from deep convolutional neural networks. Different from the leading approaches, who all learn from the 1,000 classes defined in the ImageNet Large Scale Visual Recognition ...
Knowledge-guided pre-training and fine-tuning: Video representation learning for action recognition
AbstractVideo-based action recognition is an important task in the computer vision community, aiming to extract rich spatial–temporal information to recognize human actions from videos. Many approaches adopt self-supervised learning in large-scale ...
Comments