research-article

Open Access

Contextual Image Parsing via Panoptic Segment Sorting

Authors:
Jyh-Jing Hwang

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Tsung-Wei Ke

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Stella X. Yu

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less LabelingOctober 2021Pages 27–36https://doi.org/10.1145/3476098.3485056

Published:20 October 2021Publication History

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

Pages 27–36

ABSTRACT

Real-world visual recognition is far more complex than object recognition; there is stuff without distinctive shape or appearance, and the same object appearing in different contexts calls for different actions. While we need context-aware visual recognition, visual context is hard to describe and impossible to label manually. We consider visual context as semantic correlations between objects and their surroundings that include both object instances and stuff categories. We approach contextual object recognition as a pixel-wise feature representation learning problem that accomplishes supervised panoptic segmentation while discovering and encoding visual context automatically. Panoptic segmentation is a dense image parsing task that segments an image into regions with both semantic category and object instance labels. These two aspects could conflict each other, for two adjacent cars would have the same semantic label but different instance labels. Whereas most existing approaches handle the two labeling tasks separately and then fuse the results together, we propose a single pixel-wise feature learning approach that unifies both aspects of semantic segmentation and instance segmentation. Our work takes the metric learning perspective of SegSort but extends it non-trivially to panoptic segmentation, as we must merge segments into proper instances and handle instances of various scales. Our most exciting result is the emergence of visual context in the feature space through contrastive learning between pixels and segments, such that we can retrieve a person crossing a somewhat empty street without any such context labeling. Our experimental results on Cityscapes and PASCAL VOC demonstrate that, in terms of surround semantics distributions, our retrievals are much more consistent with the query than the state-of-the-art segmentation method, validating our pixel-wise representation learning approach for the unsupervised discovery and learning of visual context.

Supplemental Material

Available for Download

zip

mull07aux.zip (11.8 MB)

The supplementary material includes more qualitative and quantitative results of our method Panoptic Segment Sorting.

References

Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In CVPR.Google Scholar
Min Bai and Raquel Urtasun. 2017. Deep watershed transform for instance segmentation. In CVPR.Google Scholar
Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. 2005. Clus- tering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research (2005).Google Scholar
Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. 2016. Semantic segmentation with boundary neural fields. In CVPR.Google Scholar
Gedas Bertasius, Lorenzo Torresani, Stella X Yu, and Jianbo Shi. 2017. Convolu- tional Random Walk Networks for Semantic Image Segmentation. In CVPR.Google Scholar
Joao Carreira and Cristian Sminchisescu. 2011. CPMC: Automatic object segmen- tation using constrained parametric min-cuts. TPAMI (2011).Google Scholar
Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017--1025.Google ScholarCross Ref
Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. 2018. Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR.Google Scholar
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep con- volutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016).Google Scholar
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV.Google Scholar
Xinlei Chen and Abhinav Gupta. 2017. Spatial memory for context reasoning in object detection. In ICCV.Google Scholar
Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7239--7248.Google ScholarCross Ref
Yifeng Chen, Guangchen Lin, Songyuan Li, Omar Bourahla, Yiming Wu, Fang- fang Wang, Junyi Feng, Mingliang Xu, and Xi Li. 2020. BANet: Bidirectional Aggregation Network with Occlusion Handling for Panoptic Segmentation. In CVPR.Google Scholar
Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. 2020. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR.Google Scholar
Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. 2010. Ex- ploiting hierarchical context on a large database of object categories. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 129--136.Google Scholar
Myung Jin Choi, Antonio Torralba, and Alan S Willsky. 2011. A tree-based context model for object recognition. IEEE transactions on pattern analysis and machine intelligence 34, 2 (2011), 240--252.Google Scholar
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.Google Scholar
Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In CVPR.Google Scholar
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In ICCV.Google Scholar
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n.d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html.Google Scholar
Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 (2017).Google Scholar
Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. 2019. SSAP: Single-Shot Instance Segmentation With Affinity Pyramid. In ICCV.Google Scholar
Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan R Salakhutdinov. 2005. Neighbourhood components analysis. In NIPS.Google Scholar
Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).Google Scholar
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In ICCV.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.Google Scholar
Jyh-Jing Hwang, Tsung-Wei Ke, Jianbo Shi, and Stella X Yu. 2019. Adversarial Structure Matching for Structured Prediction Tasks. In CVPR.Google Scholar
Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh Chen. 2019. SegSort: Segmentation by Discriminative Sorting of Segments. In ICCV.Google Scholar
Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X Yu. 2018. Adaptive affinity fields for semantic segmentation. In ECCV.Google Scholar
Tsung-Wei Ke, Jyh-Jing Hwang, and Stella X Yu. 2021. Universal Weakly Super- vised Segmentation by Pixel-to-Segment Contrastive Learning. ICLR (2021).Google Scholar
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR.Google Scholar
Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Panoptic Feature Pyramid Networks. In CVPR.Google Scholar
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In CVPR.Google Scholar
Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. Instancecut: from edges to instances with multicut. In CVPR.Google Scholar
Shu Kong and Charless Fowlkes. 2018. Recurrent pixel embedding for instance grouping. In CVPR.Google Scholar
Yong Jae Lee and Kristen Grauman. 2010. Object-graphs for context-aware cate- gory discovery. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarCross Ref
Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, and Adrien Gaidon. 2018. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192 (2018).Google Scholar
Qizhu Li, Anurag Arnab, and Philip HS Torr. 2018. Weakly- and semi-supervised panoptic segmentation. In ECCV.Google Scholar
Qizhu Li, Xiaojuan Qi, and Philip HS Torr. 2020. Unifying training and inference for panoptic segmentation. In CVPR.Google Scholar
Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xin- gang Wang. 2019. Attention-guided unified network for panoptic segmentation. In CVPR.Google Scholar
Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolu- tional instance-aware semantic segmentation. In CVPR.Google Scholar
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In CVPR.Google Scholar
Ce Liu, Jenny Yuen, and Antonio Torralba. 2011. Nonparametric scene parsing via label transfer. PAMI (2011).Google Scholar
Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. 2019. An end-to-end network for panoptic segmentation. In CVPR.Google Scholar
Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. 2017. Learning Affinity via Spatial Propagation Networks. In NIPS.Google ScholarDigital Library
Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. Sgn: Sequential grouping networks for instance segmentation. In ICCV.Google Scholar
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In CVPR.Google Scholar
Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. 2018. Affinity derivation and graph merge for instance segmentation. In ECCV.Google Scholar
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR.Google Scholar
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (2008).Google Scholar
Michael Maire, Takuya Narihira, and Stella X Yu. 2016. Affinity CNN: Learning pixel-centric pairwise relations for figure/ground embedding. In CVPR.Google Scholar
Tomasz Malisiewicz and Alyosha Efros. 2009. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS.Google Scholar
Tomasz Malisiewicz and Alexei A Efros. 2008. Recognition by association via learning per-exemplar distances. In CVPR.Google Scholar
Mohammadreza Mostajabi, Michael Maire, and Gregory Shakhnarovich. 2018. Regularizing Deep Networks by Modeling and Predicting Label Structure. In CVPR.Google Scholar
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 891--898.Google ScholarDigital Library
Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. 2019. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR.Google Scholar
Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NeurIPS.Google Scholar
Aude Oliva and Antonio Torralba. 2007. The role of context in object recognition. Trends in cognitive sciences 11, 12 (2007), 520--527.Google Scholar
George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. 2018. Personlab: Person pose estimation and in- stance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV.Google Scholar
Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. 2015. Learning to segment object candidates. In NeurIPS.Google Scholar
Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In ECCV.Google Scholar
Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. 2019. Seamless Scene Segmentation. In CVPR.Google Scholar
Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J Belongie. 2007. Objects in Context.. In ICCV.Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.Google Scholar
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI.Google Scholar
Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, and Andrew Zisserman. 2009. Segmenting scenes by matching image composites. In NIPS.Google Scholar
M Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. In CVPR.Google Scholar
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6619--6628.Google ScholarCross Ref
Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: scalable nonparametric image parsing with superpixels. In ECCV.Google Scholar
Joseph Tighe and Svetlana Lazebnik. 2013. Finding things: Image parsing with regions and per-exemplar detectors. In CVPR.Google Scholar
Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. IJCV (2005).Google Scholar
Haochen Wang, Ruotian Luo, Michael Maire, and Greg Shakhnarovich. 2020. Pixel Consensus Voting for Panoptic Segmentation. In CVPR.Google Scholar
Yangxin Wu, Gengwei Zhang, Yiming Gao, Xiajun Deng, Ke Gong, Xiaodan Liang, and Liang Lin. 2020. Bidirectional Graph Reasoning Network for Panoptic Segmentation. In CVPR.Google Scholar
Zhirong Wu, Alexei A Efros, and Stella X Yu. 2018. Improving Generalization via Scalable Neighborhood Component Analysis. In ECCV.Google Scholar
Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. In CVPR.Google Scholar
Saining Xie, Xun Huang, and Zhuowen Tu. 2016. Top-down learning for structured labeling with convolutional pseudoprior. In ECCV.Google Scholar
Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. 2019. UPSNet: A Unified Panoptic Segmentation Network. In CVPR.Google Scholar
Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. 2019. Deeper- Lab: Single-Shot Image Parser. arXiv preprint arXiv:1902.05093 (2019).Google Scholar
Jian Yao, Sanja Fidler, and Raquel Urtasun. 2012. Describing the scene as a whole: joint object detection. In CVPR.Google Scholar
Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR.Google Scholar
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR.Google Scholar
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In ICCV.Google Scholar
Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. 2019. Bottom-up object detection by grouping extreme and center points. In CVPR.Google Scholar
Song-Chun Zhu and David Mumford. 2007. A stochastic grammar of images. Foundations and Trends®in Computer Graphics and Vision (2007).Google Scholar

Index Terms

Contextual Image Parsing via Panoptic Segment Sorting
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Structured outputs

Recommendations

ConsInstancy: learning instance representations for semi-supervised panoptic segmentation of concrete aggregate particles
Abstract
We present a semi-supervised method for panoptic segmentation based on ConsInstancy regularisation, a novel strategy for semi-supervised learning. It leverages completely unlabelled data by enforcing consistency between predicted instance ...
Read More
A discriminative graph inferring framework towards weakly supervised image parsing

In this paper, we focus on the task of assigning labels to the over-segmented image patches in a weakly supervised manner, in which the training images contain the labels but do not have the labels' locations in the images. We propose a unified ...
Read More
Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2018
Abstract
Automatic parsing of anatomical objects in X-ray images is critical to many clinical applications in particular towards image-guided invention and workflow automation. Existing deep network models require a large amount of labeled data. However, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling
October 2021
64 pages
ISBN:9781450386814
DOI:10.1145/3476098
Program Chairs:
Xiu-Shen Wei
Nanjing University of Science and Technology, China
,
Han-Jia Ye
Nanjing University, China
,
Jufeng Yang
Nankai University, China
,
Jian Yang
Nanjing University of Science and Technology, China
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
context discovery
context encoding
contrastive learning
image parsing
panoptic segmentation
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 205
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Contextual Image Parsing via Panoptic Segment Sorting

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

ConsInstancy: learning instance representations for semi-supervised panoptic segmentation of concrete aggregate particles

A discriminative graph inferring framework towards weakly supervised image parsing

Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Contextual Image Parsing via Panoptic Segment Sorting

MULL'21: Multimedia Understanding with Less Labeling on Multimedia Understanding with Less Labeling

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

ConsInstancy: learning instance representations for semi-supervised panoptic segmentation of concrete aggregate particles

A discriminative graph inferring framework towards weakly supervised image parsing

Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media