skip to main content
research-article
Public Access

Language-driven synthesis of 3D scenes from scene databases

Published:04 December 2018Publication History
Skip Abstract Section

Abstract

We introduce a novel framework for using natural language to generate and edit 3D indoor scenes, harnessing scene semantics and text-scene grounding knowledge learned from large annotated 3D scene databases. The advantage of natural language editing interfaces is strongest when performing semantic operations at the sub-scene level, acting on groups of objects. We learn how to manipulate these sub-scenes by analyzing existing 3D scenes. We perform edits by first parsing a natural language command from the user and transforming it into a semantic scene graph that is used to retrieve corresponding sub-scenes from the databases that match the command. We then augment this retrieved sub-scene by incorporating other objects that may be implied by the scene context. Finally, a new 3D scene is synthesized by aligning the augmented sub-scene with the user's current scene, where new objects are spliced into the environment, possibly triggering appropriate adjustments to the existing scene arrangement. A suggestive modeling interface with multiple interpretations of user commands is used to alleviate ambiguities in natural language. We conduct studies comparing our approach against both prior text-to-scene work and artist-made scenes and find that our method significantly outperforms prior work and is comparable to handmade scenes even when complex and varied natural sentences are used.

Skip Supplemental Material Section

Supplemental Material

a212-ma.mp4

mp4

45.5 MB

References

  1. Alan Agresti and Brent A Coull. 1998. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician 52, 2 (1998), 119--126.Google ScholarGoogle Scholar
  2. Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D. Manning. 2015b. Text to 3D Scene Generation with Rich Lexical Grounding. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP).Google ScholarGoogle Scholar
  3. Angel X. Chang, Mihail Eric, Manolis Savva, and Christopher D. Manning. 2017. SceneSeer: 3D Scene Design with Natural Language. CoRR abs/1703.00050 (2017). http://arxiv.org/abs/1703.00050Google ScholarGoogle Scholar
  4. Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015a. ShapeNet: An Information-Rich 3D Model Repository. (2015).Google ScholarGoogle Scholar
  5. Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014a. Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation. In Proc. ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI).Google ScholarGoogle Scholar
  6. Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014b. Learning Spatial Knowledge for Text to 3D Scene Generation. In Empirical Methods in Natural Language Processing (EMNLP).Google ScholarGoogle Scholar
  7. Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, and Daniel Bauer. 2012. Annotation Tools and Knowledge Representation for a Text-To-Scene System. In COLING. 679--694.Google ScholarGoogle Scholar
  8. Bob Coyne and Richard Sproat. 2001. WordsEye: An Automatic Text-to-scene Conversion System. In Proc. of SIGGRAPH. 487--496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Matthew Fisher, Yangyan Li, Manolis Savva, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric Scene Synthesis for Functional 3D Scene Modeling. ACM Trans. on Graph. 34, 6 (2015), 212:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based synthesis of 3D object arrangements. ACM Trans. on Graph. 31, 6 (2012), 135:1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. 30, 4 (2011), 34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Guadarrama, L. Riano, D. Golland, D. Göhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. 2013. Grounding Spatial Relations for Human-Robot Interaction. In Proc. IEEE Int. Conf. on Intelligent Robots & Systems. 1640--1647.Google ScholarGoogle Scholar
  13. Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction Context (ICON): Towards a Geometric Functionality Descriptor. ACM Trans. on Graph. 34, 4 (2015), Article 83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. 2016. SceneNN: A Scene Meshes Dataset with aNNotations. In Proc. of 3D Vision.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yun Jiang, Marcus Lim, and Ashutosh Saxena. 2012. Learning Object Arrangements in 3D Scenes using Human Context. In Proc. Int. Conf. on Machine Learning (ICML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D Indoor Environments with Variability and Repetition. ACM Trans. on Graph. 31, 6 (2012), 138:1--138:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tianqiang Liu, Aaron Hertzmann, Wilmot Li, and Thomas Funkhouser. 2015. Style Compatibility for 3D Furniture Models. ACM Trans. on Graph. 34, 4, Article 85 (2015), 85:1--85:9 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-Driven 3D Indoor Scene Evolution. ACM Trans. on Graph. 35, 6 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lucas Majerowicz, Ariel Shamir, Alla Sheffer, and Holger H. Hoos. 2014. Filling Your Shelves: Synthesizing Diverse Style-Preserving Artifact Arrangements. IEEE Trans. Visualization & Computer Graphics 20, 11 (2014), 1507--1518.Google ScholarGoogle ScholarCross RefCross Ref
  20. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60. http://www.aclweb.org/anthology/P/P14/P14-5010Google ScholarGoogle Scholar
  21. Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. 2011. Interactive Furniture Layout Using Interior Design Guidelines. ACM Trans. on Graph. 30, 4 (2011), 87:1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dipendra Kumar Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2014. Tell Me Dave: Context-Sensitive Grounding of Natural Language to Manipulation Instructions. In Proc. of Robotics: Science and Systems.Google ScholarGoogle ScholarCross RefCross Ref
  23. Zeinab Sadeghipour, Zicheng Liao, Ping Tan, and Hao Zhang. 2016. Learning 3D Scene Synthesis from Annotated RGB-D Images. Computer Graphics Forum (SGP) 35, 5 (2016).Google ScholarGoogle Scholar
  24. Manolis Savva, Angel X. Chang, and Pat Hanrahan. 2015. Semantically-Enriched 3D Models for Common-sense Knowledge. CVPR 2015 Workshop on Functionality, Physics, Intentionality and Causality (2015).Google ScholarGoogle ScholarCross RefCross Ref
  25. Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. PiGraphs: Learning Interaction Snapshots from Observations. ACM Trans. on Graph. 35, 4 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In IEEE CVPR. 519--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lee M. Seversky and Lijun Yin. 2006. Real-time Automatic 3D Scene Generation from Natural Language Voice and Text Descriptions. In Proc. of ACM International Conference on Multimedia. 61--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. on Graph. 31, 6 (2012), 136:1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. In ECCV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Greg Slabaugh, Bruce Culbertson, Tom Malzbender, and Ron Schafer. 2001. A Survey of Methods for Volumetric Scene Reconstruction from Photographs. In Proc. of Eurographics Conference on Volume Graphics. 81--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE CVPR. 567--576.Google ScholarGoogle Scholar
  32. Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 190--198.Google ScholarGoogle ScholarCross RefCross Ref
  33. Moritz Tenorth and Michael Beetz. 2013. KnowRob: A Knowledge Processing Infrastructure for Cognition-enabled Robots. Int. J. Rob. Res. 32, 5 (2013), 566--590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kai Wang, Manolis Savva, Angel X. Chang, and Daniel Ritchie. 2018. Deep Convolutional Priors for Indoor Scene Synthesis. ACM Trans. on Graphics (Proc. of SIGGRAPH) 37, 4 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jianxiong Xiao. 2012. 3D Reconstruction is Not Just a Low-level Task: Retrospect and Survey. Technical Report. MIT9.S912: What is Intelligence?Google ScholarGoogle Scholar
  36. Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. 2014. Organizing Heterogeneous Scene Collection through Contextual Focal Points. ACM Trans. on Graph. 33, 4 (2014), Article 35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 2016. 3D Attention-Driven Depth Acquisition for Object Identification. ACM Trans. on Graph. 35, 6 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F. Chan, and Stanley Osher. 2011. Make it home: automatic optimization of furniture arrangement. ACM Trans. on Graph. 30, 4 (2011), 86:1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lap-Fai Yu, Sai Kit Yeung, and Demetri Terzopoulos. 2016. The Clutterpalette: An Interactive Tool for Detailing Indoor Scenes. IEEE Trans. Visualization & Computer Graphics 22, 2 (2016), 1138--1148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C. Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. 2013. Learning the Visual Interpretation of Sentences. In Proc. ICCV. 1681--1688. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Language-driven synthesis of 3D scenes from scene databases

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Graphics
      ACM Transactions on Graphics  Volume 37, Issue 6
      December 2018
      1401 pages
      ISSN:0730-0301
      EISSN:1557-7368
      DOI:10.1145/3272127
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 December 2018
      Published in tog Volume 37, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader