ABSTRACT
Real-world visual recognition is far more complex than object recognition; there is stuff without distinctive shape or appearance, and the same object appearing in different contexts calls for different actions. While we need context-aware visual recognition, visual context is hard to describe and impossible to label manually. We consider visual context as semantic correlations between objects and their surroundings that include both object instances and stuff categories. We approach contextual object recognition as a pixel-wise feature representation learning problem that accomplishes supervised panoptic segmentation while discovering and encoding visual context automatically. Panoptic segmentation is a dense image parsing task that segments an image into regions with both semantic category and object instance labels. These two aspects could conflict each other, for two adjacent cars would have the same semantic label but different instance labels. Whereas most existing approaches handle the two labeling tasks separately and then fuse the results together, we propose a single pixel-wise feature learning approach that unifies both aspects of semantic segmentation and instance segmentation. Our work takes the metric learning perspective of SegSort but extends it non-trivially to panoptic segmentation, as we must merge segments into proper instances and handle instances of various scales. Our most exciting result is the emergence of visual context in the feature space through contrastive learning between pixels and segments, such that we can retrieve a person crossing a somewhat empty street without any such context labeling. Our experimental results on Cityscapes and PASCAL VOC demonstrate that, in terms of surround semantics distributions, our retrievals are much more consistent with the query than the state-of-the-art segmentation method, validating our pixel-wise representation learning approach for the unsupervised discovery and learning of visual context.
Supplemental Material
Available for Download
The supplementary material includes more qualitative and quantitative results of our method Panoptic Segment Sorting.
- Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In CVPR.Google Scholar
- Min Bai and Raquel Urtasun. 2017. Deep watershed transform for instance segmentation. In CVPR.Google Scholar
- Arindam Banerjee, Inderjit S Dhillon, Joydeep Ghosh, and Suvrit Sra. 2005. Clus- tering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research (2005).Google Scholar
- Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. 2016. Semantic segmentation with boundary neural fields. In CVPR.Google Scholar
- Gedas Bertasius, Lorenzo Torresani, Stella X Yu, and Jianbo Shi. 2017. Convolu- tional Random Walk Networks for Semantic Image Segmentation. In CVPR.Google Scholar
- Joao Carreira and Cristian Sminchisescu. 2011. CPMC: Automatic object segmen- tation using constrained parametric min-cuts. TPAMI (2011).Google Scholar
- Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017--1025.Google ScholarCross Ref
- Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. 2018. Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR.Google Scholar
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2016. Deeplab: Semantic image segmentation with deep con- volutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016).Google Scholar
- Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
- Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV.Google Scholar
- Xinlei Chen and Abhinav Gupta. 2017. Spatial memory for context reasoning in object detection. In ICCV.Google Scholar
- Xinlei Chen, Li-Jia Li, Li Fei-Fei, and Abhinav Gupta. 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7239--7248.Google ScholarCross Ref
- Yifeng Chen, Guangchen Lin, Songyuan Li, Omar Bourahla, Yiming Wu, Fang- fang Wang, Junyi Feng, Mingliang Xu, and Xi Li. 2020. BANet: Bidirectional Aggregation Network with Occlusion Handling for Panoptic Segmentation. In CVPR.Google Scholar
- Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu, Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen. 2020. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR.Google Scholar
- Myung Jin Choi, Joseph J Lim, Antonio Torralba, and Alan S Willsky. 2010. Ex- ploiting hierarchical context on a large database of object categories. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 129--136.Google Scholar
- Myung Jin Choi, Antonio Torralba, and Alan S Willsky. 2011. A tree-based context model for object recognition. IEEE transactions on pattern analysis and machine intelligence 34, 2 (2011), 240--252.Google Scholar
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.Google Scholar
- Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In CVPR.Google Scholar
- Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In ICCV.Google Scholar
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. [n.d.]. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html.Google Scholar
- Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via deep metric learning. arXiv preprint arXiv:1703.10277 (2017).Google Scholar
- Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. 2019. SSAP: Single-Shot Instance Segmentation With Affinity Pyramid. In ICCV.Google Scholar
- Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Ruslan R Salakhutdinov. 2005. Neighbourhood components analysis. In NIPS.Google Scholar
- Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).Google Scholar
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In ICCV.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.Google Scholar
- Jyh-Jing Hwang, Tsung-Wei Ke, Jianbo Shi, and Stella X Yu. 2019. Adversarial Structure Matching for Structured Prediction Tasks. In CVPR.Google Scholar
- Jyh-Jing Hwang, Stella X Yu, Jianbo Shi, Maxwell D Collins, Tien-Ju Yang, Xiao Zhang, and Liang-Chieh Chen. 2019. SegSort: Segmentation by Discriminative Sorting of Segments. In ICCV.Google Scholar
- Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X Yu. 2018. Adaptive affinity fields for semantic segmentation. In ECCV.Google Scholar
- Tsung-Wei Ke, Jyh-Jing Hwang, and Stella X Yu. 2021. Universal Weakly Super- vised Segmentation by Pixel-to-Segment Contrastive Learning. ICLR (2021).Google Scholar
- Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR.Google Scholar
- Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. 2019. Panoptic Feature Pyramid Networks. In CVPR.Google Scholar
- Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In CVPR.Google Scholar
- Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. Instancecut: from edges to instances with multicut. In CVPR.Google Scholar
- Shu Kong and Charless Fowlkes. 2018. Recurrent pixel embedding for instance grouping. In CVPR.Google Scholar
- Yong Jae Lee and Kristen Grauman. 2010. Object-graphs for context-aware cate- gory discovery. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarCross Ref
- Jie Li, Allan Raventos, Arjun Bhargava, Takaaki Tagawa, and Adrien Gaidon. 2018. Learning to fuse things and stuff. arXiv preprint arXiv:1812.01192 (2018).Google Scholar
- Qizhu Li, Anurag Arnab, and Philip HS Torr. 2018. Weakly- and semi-supervised panoptic segmentation. In ECCV.Google Scholar
- Qizhu Li, Xiaojuan Qi, and Philip HS Torr. 2020. Unifying training and inference for panoptic segmentation. In CVPR.Google Scholar
- Yanwei Li, Xinze Chen, Zheng Zhu, Lingxi Xie, Guan Huang, Dalong Du, and Xin- gang Wang. 2019. Attention-guided unified network for panoptic segmentation. In CVPR.Google Scholar
- Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolu- tional instance-aware semantic segmentation. In CVPR.Google Scholar
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In CVPR.Google Scholar
- Ce Liu, Jenny Yuen, and Antonio Torralba. 2011. Nonparametric scene parsing via label transfer. PAMI (2011).Google Scholar
- Huanyu Liu, Chao Peng, Changqian Yu, Jingbo Wang, Xu Liu, Gang Yu, and Wei Jiang. 2019. An end-to-end network for panoptic segmentation. In CVPR.Google Scholar
- Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. 2017. Learning Affinity via Spatial Propagation Networks. In NIPS.Google ScholarDigital Library
- Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. Sgn: Sequential grouping networks for instance segmentation. In ICCV.Google Scholar
- Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In CVPR.Google Scholar
- Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. 2018. Affinity derivation and graph merge for instance segmentation. In ECCV.Google Scholar
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR.Google Scholar
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research (2008).Google Scholar
- Michael Maire, Takuya Narihira, and Stella X Yu. 2016. Affinity CNN: Learning pixel-centric pairwise relations for figure/ground embedding. In CVPR.Google Scholar
- Tomasz Malisiewicz and Alyosha Efros. 2009. Beyond categories: The visual memex model for reasoning about object relationships. In NIPS.Google Scholar
- Tomasz Malisiewicz and Alexei A Efros. 2008. Recognition by association via learning per-exemplar distances. In CVPR.Google Scholar
- Mohammadreza Mostajabi, Michael Maire, and Gregory Shakhnarovich. 2018. Regularizing Deep Networks by Modeling and Predicting Label Structure. In CVPR.Google Scholar
- Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 891--898.Google ScholarDigital Library
- Davy Neven, Bert De Brabandere, Marc Proesmans, and Luc Van Gool. 2019. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR.Google Scholar
- Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NeurIPS.Google Scholar
- Aude Oliva and Antonio Torralba. 2007. The role of context in object recognition. Trends in cognitive sciences 11, 12 (2007), 520--527.Google Scholar
- George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. 2018. Personlab: Person pose estimation and in- stance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV.Google Scholar
- Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. 2015. Learning to segment object candidates. In NeurIPS.Google Scholar
- Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In ECCV.Google Scholar
- Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. 2019. Seamless Scene Segmentation. In CVPR.Google Scholar
- Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J Belongie. 2007. Objects in Context.. In ICCV.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In MICCAI.Google Scholar
- Bryan Russell, Alyosha Efros, Josef Sivic, Bill Freeman, and Andrew Zisserman. 2009. Segmenting scenes by matching image composites. In NIPS.Google Scholar
- M Saquib Sarfraz, Vivek Sharma, and Rainer Stiefelhagen. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. In CVPR.Google Scholar
- Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6619--6628.Google ScholarCross Ref
- Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: scalable nonparametric image parsing with superpixels. In ECCV.Google Scholar
- Joseph Tighe and Svetlana Lazebnik. 2013. Finding things: Image parsing with regions and per-exemplar detectors. In CVPR.Google Scholar
- Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. IJCV (2005).Google Scholar
- Haochen Wang, Ruotian Luo, Michael Maire, and Greg Shakhnarovich. 2020. Pixel Consensus Voting for Panoptic Segmentation. In CVPR.Google Scholar
- Yangxin Wu, Gengwei Zhang, Yiming Gao, Xiajun Deng, Ke Gong, Xiaodan Liang, and Liang Lin. 2020. Bidirectional Graph Reasoning Network for Panoptic Segmentation. In CVPR.Google Scholar
- Zhirong Wu, Alexei A Efros, and Stella X Yu. 2018. Improving Generalization via Scalable Neighborhood Component Analysis. In ECCV.Google Scholar
- Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. In CVPR.Google Scholar
- Saining Xie, Xun Huang, and Zhuowen Tu. 2016. Top-down learning for structured labeling with convolutional pseudoprior. In ECCV.Google Scholar
- Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. 2019. UPSNet: A Unified Panoptic Segmentation Network. In CVPR.Google Scholar
- Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. 2019. Deeper- Lab: Single-Shot Image Parser. arXiv preprint arXiv:1902.05093 (2019).Google Scholar
- Jian Yao, Sanja Fidler, and Raquel Urtasun. 2012. Describing the scene as a whole: joint object detection. In CVPR.Google Scholar
- Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In ICLR.Google Scholar
- Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR.Google Scholar
- Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In ICCV.Google Scholar
- Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl. 2019. Bottom-up object detection by grouping extreme and center points. In CVPR.Google Scholar
- Song-Chun Zhu and David Mumford. 2007. A stochastic grammar of images. Foundations and Trends®in Computer Graphics and Vision (2007).Google Scholar
Index Terms
- Contextual Image Parsing via Panoptic Segment Sorting
Recommendations
ConsInstancy: learning instance representations for semi-supervised panoptic segmentation of concrete aggregate particles
AbstractWe present a semi-supervised method for panoptic segmentation based on ConsInstancy regularisation, a novel strategy for semi-supervised learning. It leverages completely unlabelled data by enforcing consistency between predicted instance ...
A discriminative graph inferring framework towards weakly supervised image parsing
In this paper, we focus on the task of assigning labels to the over-segmented image patches in a weakly supervised manner, in which the training images contain the labels but do not have the labels' locations in the images. We propose a unified ...
Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-ray Image Segmentation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2018AbstractAutomatic parsing of anatomical objects in X-ray images is critical to many clinical applications in particular towards image-guided invention and workflow automation. Existing deep network models require a large amount of labeled data. However, ...
Comments