Abstract
We introduce a novel framework for using natural language to generate and edit 3D indoor scenes, harnessing scene semantics and text-scene grounding knowledge learned from large annotated 3D scene databases. The advantage of natural language editing interfaces is strongest when performing semantic operations at the sub-scene level, acting on groups of objects. We learn how to manipulate these sub-scenes by analyzing existing 3D scenes. We perform edits by first parsing a natural language command from the user and transforming it into a semantic scene graph that is used to retrieve corresponding sub-scenes from the databases that match the command. We then augment this retrieved sub-scene by incorporating other objects that may be implied by the scene context. Finally, a new 3D scene is synthesized by aligning the augmented sub-scene with the user's current scene, where new objects are spliced into the environment, possibly triggering appropriate adjustments to the existing scene arrangement. A suggestive modeling interface with multiple interpretations of user commands is used to alleviate ambiguities in natural language. We conduct studies comparing our approach against both prior text-to-scene work and artist-made scenes and find that our method significantly outperforms prior work and is comparable to handmade scenes even when complex and varied natural sentences are used.
Supplemental Material
Available for Download
Supplemental files.
- Alan Agresti and Brent A Coull. 1998. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician 52, 2 (1998), 119--126.Google Scholar
- Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D. Manning. 2015b. Text to 3D Scene Generation with Rich Lexical Grounding. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP).Google Scholar
- Angel X. Chang, Mihail Eric, Manolis Savva, and Christopher D. Manning. 2017. SceneSeer: 3D Scene Design with Natural Language. CoRR abs/1703.00050 (2017). http://arxiv.org/abs/1703.00050Google Scholar
- Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015a. ShapeNet: An Information-Rich 3D Model Repository. (2015).Google Scholar
- Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014a. Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation. In Proc. ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI).Google Scholar
- Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014b. Learning Spatial Knowledge for Text to 3D Scene Generation. In Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
- Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, and Daniel Bauer. 2012. Annotation Tools and Knowledge Representation for a Text-To-Scene System. In COLING. 679--694.Google Scholar
- Bob Coyne and Richard Sproat. 2001. WordsEye: An Automatic Text-to-scene Conversion System. In Proc. of SIGGRAPH. 487--496. Google ScholarDigital Library
- Matthew Fisher, Yangyan Li, Manolis Savva, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric Scene Synthesis for Functional 3D Scene Modeling. ACM Trans. on Graph. 34, 6 (2015), 212:1--10. Google ScholarDigital Library
- Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based synthesis of 3D object arrangements. ACM Trans. on Graph. 31, 6 (2012), 135:1--11. Google ScholarDigital Library
- Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. 30, 4 (2011), 34. Google ScholarDigital Library
- S. Guadarrama, L. Riano, D. Golland, D. Göhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. 2013. Grounding Spatial Relations for Human-Robot Interaction. In Proc. IEEE Int. Conf. on Intelligent Robots & Systems. 1640--1647.Google Scholar
- Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction Context (ICON): Towards a Geometric Functionality Descriptor. ACM Trans. on Graph. 34, 4 (2015), Article 83. Google ScholarDigital Library
- Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. 2016. SceneNN: A Scene Meshes Dataset with aNNotations. In Proc. of 3D Vision.Google ScholarCross Ref
- Yun Jiang, Marcus Lim, and Ashutosh Saxena. 2012. Learning Object Arrangements in 3D Scenes using Human Context. In Proc. Int. Conf. on Machine Learning (ICML). Google ScholarDigital Library
- Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D Indoor Environments with Variability and Repetition. ACM Trans. on Graph. 31, 6 (2012), 138:1--138:11. Google ScholarDigital Library
- Tianqiang Liu, Aaron Hertzmann, Wilmot Li, and Thomas Funkhouser. 2015. Style Compatibility for 3D Furniture Models. ACM Trans. on Graph. 34, 4, Article 85 (2015), 85:1--85:9 pages. Google ScholarDigital Library
- Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-Driven 3D Indoor Scene Evolution. ACM Trans. on Graph. 35, 6 (2016). Google ScholarDigital Library
- Lucas Majerowicz, Ariel Shamir, Alla Sheffer, and Holger H. Hoos. 2014. Filling Your Shelves: Synthesizing Diverse Style-Preserving Artifact Arrangements. IEEE Trans. Visualization & Computer Graphics 20, 11 (2014), 1507--1518.Google ScholarCross Ref
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60. http://www.aclweb.org/anthology/P/P14/P14-5010Google Scholar
- Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. 2011. Interactive Furniture Layout Using Interior Design Guidelines. ACM Trans. on Graph. 30, 4 (2011), 87:1--10. Google ScholarDigital Library
- Dipendra Kumar Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2014. Tell Me Dave: Context-Sensitive Grounding of Natural Language to Manipulation Instructions. In Proc. of Robotics: Science and Systems.Google ScholarCross Ref
- Zeinab Sadeghipour, Zicheng Liao, Ping Tan, and Hao Zhang. 2016. Learning 3D Scene Synthesis from Annotated RGB-D Images. Computer Graphics Forum (SGP) 35, 5 (2016).Google Scholar
- Manolis Savva, Angel X. Chang, and Pat Hanrahan. 2015. Semantically-Enriched 3D Models for Common-sense Knowledge. CVPR 2015 Workshop on Functionality, Physics, Intentionality and Causality (2015).Google ScholarCross Ref
- Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. PiGraphs: Learning Interaction Snapshots from Observations. ACM Trans. on Graph. 35, 4 (2016). Google ScholarDigital Library
- Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In IEEE CVPR. 519--528. Google ScholarDigital Library
- Lee M. Seversky and Lijun Yin. 2006. Real-time Automatic 3D Scene Generation from Natural Language Voice and Text Descriptions. In Proc. of ACM International Conference on Multimedia. 61--64. Google ScholarDigital Library
- Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. on Graph. 31, 6 (2012), 136:1--11. Google ScholarDigital Library
- N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. In ECCV. Google ScholarDigital Library
- Greg Slabaugh, Bruce Culbertson, Tom Malzbender, and Ron Schafer. 2001. A Survey of Methods for Volumetric Scene Reconstruction from Photographs. In Proc. of Eurographics Conference on Volume Graphics. 81--101. Google ScholarDigital Library
- Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE CVPR. 567--576.Google Scholar
- Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 190--198.Google ScholarCross Ref
- Moritz Tenorth and Michael Beetz. 2013. KnowRob: A Knowledge Processing Infrastructure for Cognition-enabled Robots. Int. J. Rob. Res. 32, 5 (2013), 566--590. Google ScholarDigital Library
- Kai Wang, Manolis Savva, Angel X. Chang, and Daniel Ritchie. 2018. Deep Convolutional Priors for Indoor Scene Synthesis. ACM Trans. on Graphics (Proc. of SIGGRAPH) 37, 4 (2018). Google ScholarDigital Library
- Jianxiong Xiao. 2012. 3D Reconstruction is Not Just a Low-level Task: Retrospect and Survey. Technical Report. MIT9.S912: What is Intelligence?Google Scholar
- Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. 2014. Organizing Heterogeneous Scene Collection through Contextual Focal Points. ACM Trans. on Graph. 33, 4 (2014), Article 35. Google ScholarDigital Library
- Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 2016. 3D Attention-Driven Depth Acquisition for Object Identification. ACM Trans. on Graph. 35, 6 (2016). Google ScholarDigital Library
- Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F. Chan, and Stanley Osher. 2011. Make it home: automatic optimization of furniture arrangement. ACM Trans. on Graph. 30, 4 (2011), 86:1--12. Google ScholarDigital Library
- Lap-Fai Yu, Sai Kit Yeung, and Demetri Terzopoulos. 2016. The Clutterpalette: An Interactive Tool for Detailing Indoor Scenes. IEEE Trans. Visualization & Computer Graphics 22, 2 (2016), 1138--1148. Google ScholarDigital Library
- C. Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. 2013. Learning the Visual Interpretation of Sentences. In Proc. ICCV. 1681--1688. Google ScholarDigital Library
Index Terms
- Language-driven synthesis of 3D scenes from scene databases
Recommendations
3DSRASG: 3D Scene Retrieval and Augmentation Using Semantic Graphs
Progress in Artificial IntelligenceAbstractComputer Vision, encompassing 3D Vision and 3D scene Reconstruction, is a field of importance to real-world problems involving 3D views of scenes. The goal of the proposed system is to retrieve 3D scenes from the database, and further augment the ...
Calibration of panoramic cameras using 3D scene information
Proceedings of the 11th international conference on Theoretical foundations of computer visionThis chapter proposes a novel approach for the calibration of a panoramic camera using geometric information available in real scenes. Panoramic cameras are of increasing importance for various applications in computer vision, computer graphics or ...
Real-time multi-view 3d object tracking in cluttered scenes
ISVC'06: Proceedings of the Second international conference on Advances in Visual Computing - Volume Part IIThis paper presents an approach to real-time 3D object tracking in cluttered scenes using multiple synchronized and calibrated cameras. The goal is to accurately track targets over a long period of time in the presence of complete occlusion in some of ...
Comments