skip to main content
10.1145/3476098.3485056acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open Access

Contextual Image Parsing via Panoptic Segment Sorting

Authors Info & Claims
Published:20 October 2021Publication History

ABSTRACT

Real-world visual recognition is far more complex than object recognition; there is stuff without distinctive shape or appearance, and the same object appearing in different contexts calls for different actions. While we need context-aware visual recognition, visual context is hard to describe and impossible to label manually. We consider visual context as semantic correlations between objects and their surroundings that include both object instances and stuff categories. We approach contextual object recognition as a pixel-wise feature representation learning problem that accomplishes supervised panoptic segmentation while discovering and encoding visual context automatically. Panoptic segmentation is a dense image parsing task that segments an image into regions with both semantic category and object instance labels. These two aspects could conflict each other, for two adjacent cars would have the same semantic label but different instance labels. Whereas most existing approaches handle the two labeling tasks separately and then fuse the results together, we propose a single pixel-wise feature learning approach that unifies both aspects of semantic segmentation and instance segmentation. Our work takes the metric learning perspective of SegSort but extends it non-trivially to panoptic segmentation, as we must merge segments into proper instances and handle instances of various scales. Our most exciting result is the emergence of visual context in the feature space through contrastive learning between pixels and segments, such that we can retrieve a person crossing a somewhat empty street without any such context labeling. Our experimental results on Cityscapes and PASCAL VOC demonstrate that, in terms of surround semantics distributions, our retrievals are much more consistent with the query than the state-of-the-art segmentation method, validating our pixel-wise representation learning approach for the unsupervised discovery and learning of visual context.

Skip Supplemental Material Section

Supplemental Material

References

  1. Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In CVPR.Google ScholarGoogle Scholar
  2. Min Bai and Raquel Urtasun. 2017. Deep watershed transform for instance segmentation. In CVPR.Google ScholarGoogle Scholar
  3. Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. 2005. Clus- tering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research (2005).Google ScholarGoogle Scholar
  4. Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. 2016. Semantic segmentation with boundary neural fields. In CVPR.Google ScholarGoogle Scholar
  5. Gedas Bertasius, Lorenzo Torresani, Stella X Yu, and Jianbo Shi. 2017. Convolu- tional Random Walk Networks for Semantic Image Segmentation. In CVPR.Google ScholarGoogle Scholar
  6. Joao Carreira and Cristian Sminchisescu. 2011. CPMC: Automatic object segmen- tation using constrained parametric min-cuts. TPAMI (2011).Google ScholarGoogle Scholar
  7. Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017--1025.Google ScholarGoogle ScholarCross RefCross Ref
  8. Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. 2018. Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR.Google ScholarGoogle Scholar
  9. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep con- volutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016).Google ScholarGoogle Scholar
  10. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google ScholarGoogle Scholar
  11. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV.Google ScholarGoogle Scholar
  12. Xinlei Chen and Abhinav Gupta. 2017. Spatial memory for context reasoning in object detection. In ICCV.Google ScholarGoogle Scholar
  13. Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7239--7248.Google ScholarGoogle ScholarCross RefCross Ref
  14. Yifeng Chen, Guangchen Lin, Songyuan Li, Omar Bourahla, Yiming Wu, Fang- fang Wang, Junyi Feng, Mingliang Xu, and Xi Li. 2020. BANet: Bidirectional Aggregation Network with Occlusion Handling for Panoptic Segmentation. In CVPR.Google ScholarGoogle Scholar
  15. Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. 2020. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR.Google ScholarGoogle Scholar
  16. Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. 2010. Ex- ploiting hierarchical context on a large database of object categories. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 129--136.Google ScholarGoogle Scholar
  17. Myung Jin Choi, Antonio Torralba, and Alan S Willsky. 2011. A tree-based context model for object recognition. IEEE transactions on pattern analysis and machine intelligence 34, 2 (2011), 240--252.Google ScholarGoogle Scholar
  18. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.Google ScholarGoogle Scholar
  19. Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In CVPR.Google ScholarGoogle Scholar
  20. Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In ICCV.Google ScholarGoogle Scholar
  21. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n.d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html.Google ScholarGoogle Scholar
  22. Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 (2017).Google ScholarGoogle Scholar
  23. Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. 2019. SSAP: Single-Shot Instance Segmentation With Affinity Pyramid. In ICCV.Google ScholarGoogle Scholar
  24. Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan R Salakhutdinov. 2005. Neighbourhood components analysis. In NIPS.Google ScholarGoogle Scholar
  25. Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).Google ScholarGoogle Scholar
  26. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In ICCV.Google ScholarGoogle Scholar
  27. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.Google ScholarGoogle Scholar
  28. Jyh-Jing Hwang, Tsung-Wei Ke, Jianbo Shi, and Stella X Yu. 2019. Adversarial Structure Matching for Structured Prediction Tasks. In CVPR.Google ScholarGoogle Scholar
  29. Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh Chen. 2019. SegSort: Segmentation by Discriminative Sorting of Segments. In ICCV.Google ScholarGoogle Scholar
  30. Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X Yu. 2018. Adaptive affinity fields for semantic segmentation. In ECCV.Google ScholarGoogle Scholar
  31. Tsung-Wei Ke, Jyh-Jing Hwang, and Stella X Yu. 2021. Universal Weakly Super- vised Segmentation by Pixel-to-Segment Contrastive Learning. ICLR (2021).Google ScholarGoogle Scholar
  32. Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR.Google ScholarGoogle Scholar
  33. Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Panoptic Feature Pyramid Networks. In CVPR.Google ScholarGoogle Scholar
  34. Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In CVPR.Google ScholarGoogle Scholar
  35. Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. Instancecut: from edges to instances with multicut. In CVPR.Google ScholarGoogle Scholar
  36. Shu Kong and Charless Fowlkes. 2018. Recurrent pixel embedding for instance grouping. In CVPR.Google ScholarGoogle Scholar
  37. Yong Jae Lee and Kristen Grauman. 2010. Object-graphs for context-aware cate- gory discovery. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  38. Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, and Adrien Gaidon. 2018. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192 (2018).Google ScholarGoogle Scholar
  39. Qizhu Li, Anurag Arnab, and Philip HS Torr. 2018. Weakly- and semi-supervised panoptic segmentation. In ECCV.Google ScholarGoogle Scholar
  40. Qizhu Li, Xiaojuan Qi, and Philip HS Torr. 2020. Unifying training and inference for panoptic segmentation. In CVPR.Google ScholarGoogle Scholar
  41. Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xin- gang Wang. 2019. Attention-guided unified network for panoptic segmentation. In CVPR.Google ScholarGoogle Scholar
  42. Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolu- tional instance-aware semantic segmentation. In CVPR.Google ScholarGoogle Scholar
  43. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In CVPR.Google ScholarGoogle Scholar
  44. Ce Liu, Jenny Yuen, and Antonio Torralba. 2011. Nonparametric scene parsing via label transfer. PAMI (2011).Google ScholarGoogle Scholar
  45. Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. 2019. An end-to-end network for panoptic segmentation. In CVPR.Google ScholarGoogle Scholar
  46. Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. 2017. Learning Affinity via Spatial Propagation Networks. In NIPS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. Sgn: Sequential grouping networks for instance segmentation. In ICCV.Google ScholarGoogle Scholar
  48. Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In CVPR.Google ScholarGoogle Scholar
  49. Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. 2018. Affinity derivation and graph merge for instance segmentation. In ECCV.Google ScholarGoogle Scholar
  50. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR.Google ScholarGoogle Scholar
  51. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (2008).Google ScholarGoogle Scholar
  52. Michael Maire, Takuya Narihira, and Stella X Yu. 2016. Affinity CNN: Learning pixel-centric pairwise relations for figure/ground embedding. In CVPR.Google ScholarGoogle Scholar
  53. Tomasz Malisiewicz and Alyosha Efros. 2009. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS.Google ScholarGoogle Scholar
  54. Tomasz Malisiewicz and Alexei A Efros. 2008. Recognition by association via learning per-exemplar distances. In CVPR.Google ScholarGoogle Scholar
  55. Mohammadreza Mostajabi, Michael Maire, and Gregory Shakhnarovich. 2018. Regularizing Deep Networks by Modeling and Predicting Label Structure. In CVPR.Google ScholarGoogle Scholar
  56. Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 891--898.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. 2019. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR.Google ScholarGoogle Scholar
  58. Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NeurIPS.Google ScholarGoogle Scholar
  59. Aude Oliva and Antonio Torralba. 2007. The role of context in object recognition. Trends in cognitive sciences 11, 12 (2007), 520--527.Google ScholarGoogle Scholar
  60. George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. 2018. Personlab: Person pose estimation and in- stance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV.Google ScholarGoogle Scholar
  61. Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. 2015. Learning to segment object candidates. In NeurIPS.Google ScholarGoogle Scholar
  62. Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In ECCV.Google ScholarGoogle Scholar
  63. Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. 2019. Seamless Scene Segmentation. In CVPR.Google ScholarGoogle Scholar
  64. Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J Belongie. 2007. Objects in Context.. In ICCV.Google ScholarGoogle Scholar
  65. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.Google ScholarGoogle Scholar
  66. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI.Google ScholarGoogle Scholar
  67. Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, and Andrew Zisserman. 2009. Segmenting scenes by matching image composites. In NIPS.Google ScholarGoogle Scholar
  68. M Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. In CVPR.Google ScholarGoogle Scholar
  69. Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6619--6628.Google ScholarGoogle ScholarCross RefCross Ref
  70. Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: scalable nonparametric image parsing with superpixels. In ECCV.Google ScholarGoogle Scholar
  71. Joseph Tighe and Svetlana Lazebnik. 2013. Finding things: Image parsing with regions and per-exemplar detectors. In CVPR.Google ScholarGoogle Scholar
  72. Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. IJCV (2005).Google ScholarGoogle Scholar
  73. Haochen Wang, Ruotian Luo, Michael Maire, and Greg Shakhnarovich. 2020. Pixel Consensus Voting for Panoptic Segmentation. In CVPR.Google ScholarGoogle Scholar
  74. Yangxin Wu, Gengwei Zhang, Yiming Gao, Xiajun Deng, Ke Gong, Xiaodan Liang, and Liang Lin. 2020. Bidirectional Graph Reasoning Network for Panoptic Segmentation. In CVPR.Google ScholarGoogle Scholar
  75. Zhirong Wu, Alexei A Efros, and Stella X Yu. 2018. Improving Generalization via Scalable Neighborhood Component Analysis. In ECCV.Google ScholarGoogle Scholar
  76. Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. In CVPR.Google ScholarGoogle Scholar
  77. Saining Xie, Xun Huang, and Zhuowen Tu. 2016. Top-down learning for structured labeling with convolutional pseudoprior. In ECCV.Google ScholarGoogle Scholar
  78. Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. 2019. UPSNet: A Unified Panoptic Segmentation Network. In CVPR.Google ScholarGoogle Scholar
  79. Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. 2019. Deeper- Lab: Single-Shot Image Parser. arXiv preprint arXiv:1902.05093 (2019).Google ScholarGoogle Scholar
  80. Jian Yao, Sanja Fidler, and Raquel Urtasun. 2012. Describing the scene as a whole: joint object detection. In CVPR.Google ScholarGoogle Scholar
  81. Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR.Google ScholarGoogle Scholar
  82. Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR.Google ScholarGoogle Scholar
  83. Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In ICCV.Google ScholarGoogle Scholar
  84. Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. 2019. Bottom-up object detection by grouping extreme and center points. In CVPR.Google ScholarGoogle Scholar
  85. Song-Chun Zhu and David Mumford. 2007. A stochastic grammar of images. Foundations and Trends®in Computer Graphics and Vision (2007).Google ScholarGoogle Scholar

Index Terms

  1. Contextual Image Parsing via Panoptic Segment Sorting

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling
        October 2021
        64 pages
        ISBN:9781450386814
        DOI:10.1145/3476098
        • Program Chairs:
        • Xiu-Shen Wei,
        • Han-Jia Ye,
        • Jufeng Yang,
        • Jian Yang

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 October 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader