Aggregating Spatio-temporal Context for Video Object Segmentation

Tao, Yu; Hu, Jian-Fang; Zheng, Wei-Shi

doi:10.1007/978-3-030-60633-6_45

Yu Tao¹⁶,
Jian-Fang Hu^16,17 &
Wei-Shi Zheng¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12305))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2525 Accesses

Abstract

In this paper, we focus on aggregating spatio-temporal contextual information for video object segmentation. Our approach exploits the spatio-temporal relationship among image regions by modelling the dependencies among the corresponding visual features with a spatio-temporal RNN. Our spatio-temporal RNN is placed on top of a pre-trained CNN network to simultaneously embed spatial and temporal information into the feature maps. Following the spatio-temporal RNN, we further construct an online adaption module to adapt the learned model for segmenting specific objects in given video. We show that our adaption module can be optimized efficiently with closed-form solutions. Our experiments on two public datasets illustrate that the proposed method performs favorably against state-of-the-art methods in terms of efficiency and accuracy.

This work is partially supported by the National Key Research and Development Program of China (2018YFB1004903), NSFC (61702567, 61628212), SF-China (61772570), Pearl River S&T Nova Program of Guangzhou (201806010056), Guangdong Natural Science Funds for Distinguished Young Scholar (2018B030306025), and FY19-Research-Sponsorship-185.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The Deeplab-v2 is pre-trained on COCO [10] with a ResNet-101 backbone.
2.
Each vertex corresponds to a pixel on the feature map.

References

Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018)
Google Scholar
Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., Yang, M.H.: Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7415–7424 (2018)
Google Scholar
Gui, Y., Tian, Y., Zeng, D.J., Xie, Z.F., Cai, Y.Y.: Reliable and dynamic appearance modeling and label consistency enforcing for fast and coherent video object segmentation with the bilateral grid. IEEE Trans. Circ. Syst. Video Technol. (2019)
Google Scholar
Hu, Y.T., Huang, J.B., Schwing, A.: MaskRNN: instance level video object segmentation. In: Advances in Neural Information Processing Systems, pp. 325–334 (2017)
Google Scholar
Hu, Y.T., Huang, J.B., Schwing, A.G.: VideoMatch: matching based video object segmentation. In: Proceedings of the ECCV, pp. 54–70 (2018)
Google Scholar
Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for object tracking. In: The DAVIS Challenge on Video Object Segmentation (2017)
Google Scholar
Liang, H., Tan, Y.: Visual attention guided video object segmentation. In: 2019 14th ICIEA, pp. 345–349. IEEE (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2663–2672 (2017)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation (2017). arXiv preprint arXiv:1704.00675
Poole, B., Barron, J.T.: The fast bilateral solver. In: Proceedings of 14th European Conference on Computer Vision (ECCV), pp. 617–632 (2016)
Google Scholar
Ren, X., Pan, H., Jing, Z., Gao, L.: Semi-supervised video object segmentation with recurrent neural network. In: 2019 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pp. 1–6. IEEE (2019)
Google Scholar
Shuai, B., Zuo, Z., Wang, B., Wang, G.: Scene segmentation with dag-recurrent neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1480–1493 (2017)
Article Google Scholar
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: The British Machine Vision Conference (2017)
Google Scholar
Wang, X., Hu, J.F., Lai, J.H., Zhang, J., Zheng, W.S.: Progressive teacher-student learning for early action prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Wang, Z., Ji, S.: Smoothed dilated convolutions for improved dense prediction. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2486–2495 (2018)
Google Scholar
Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
Google Scholar
Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: CVPR (2017)
Google Scholar
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6499–6507 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Sun Yat-sen University, Guangzhou, China
Yu Tao, Jian-Fang Hu & Wei-Shi Zheng
GuangDong Province Key Laboratory of Information Security Technology, Guangzhou, China
Jian-Fang Hu

Authors

Yu Tao
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Fang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Shi Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian-Fang Hu .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Yuxin Peng
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Dalian University of Technology, Dalian, China
Huchuan Lu
Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Chinese Academy of Sciences, Beijing, China
Chenglin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Peking University, Beijing, China
Hongbin Zha
Nanjing University of Science and Technology, Nanjing, China
Jian Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tao, Y., Hu, JF., Zheng, WS. (2020). Aggregating Spatio-temporal Context for Video Object Segmentation. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12305. Springer, Cham. https://doi.org/10.1007/978-3-030-60633-6_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-60633-6_45
Published: 11 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60632-9
Online ISBN: 978-3-030-60633-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics