research-article

Public Access

Language-driven synthesis of 3D scenes from scene databases

Authors:
Rui Ma

Simon Fraser University and AltumView Systems Inc.

Simon Fraser University and AltumView Systems Inc.
View Profile

,
Akshay Gadi Patil

Simon Fraser University

Simon Fraser University
View Profile

,
Matthew Fisher

Adobe Research

Adobe Research
View Profile

,
Manyi Li

Shandong University and Simon Fraser University

Shandong University and Simon Fraser University
View Profile

,
Sören Pirk

Stanford University

Stanford University
View Profile

,
Binh-Son Hua

University of Tokyo

University of Tokyo
View Profile

,
Sai-Kit Yeung

Hong Kong University of Science and Technology

Hong Kong University of Science and Technology
View Profile

,
Xin Tong

Microsoft Research Asia

Microsoft Research Asia
View Profile

,
Leonidas Guibas

Stanford University

Stanford University
View Profile

,
Hao Zhang

Simon Fraser University

Simon Fraser University
View Profile

Authors Info & Claims

ACM Transactions on Graphics Volume 37 Issue 6Article No.: 212pp 1–16https://doi.org/10.1145/3272127.3275035

Published:04 December 2018Publication History

ACM Transactions on Graphics

Abstract

We introduce a novel framework for using natural language to generate and edit 3D indoor scenes, harnessing scene semantics and text-scene grounding knowledge learned from large annotated 3D scene databases. The advantage of natural language editing interfaces is strongest when performing semantic operations at the sub-scene level, acting on groups of objects. We learn how to manipulate these sub-scenes by analyzing existing 3D scenes. We perform edits by first parsing a natural language command from the user and transforming it into a semantic scene graph that is used to retrieve corresponding sub-scenes from the databases that match the command. We then augment this retrieved sub-scene by incorporating other objects that may be implied by the scene context. Finally, a new 3D scene is synthesized by aligning the augmented sub-scene with the user's current scene, where new objects are spliced into the environment, possibly triggering appropriate adjustments to the existing scene arrangement. A suggestive modeling interface with multiple interpretations of user commands is used to alleviate ambiguities in natural language. We conduct studies comparing our approach against both prior text-to-scene work and artist-made scenes and find that our method significantly outperforms prior work and is comparable to handmade scenes even when complex and varied natural sentences are used.

Supplemental Material

a212-ma.mp4

mp4

45.5 MB

Download

Available for Download

zip

a212-ma.zip (30.8 MB)

Supplemental files.

References

Alan Agresti and Brent A Coull. 1998. Approximate is better than exact for interval estimation of binomial proportions. The American Statistician 52, 2 (1998), 119--126.Google Scholar
Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D. Manning. 2015b. Text to 3D Scene Generation with Rich Lexical Grounding. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP).Google Scholar
Angel X. Chang, Mihail Eric, Manolis Savva, and Christopher D. Manning. 2017. SceneSeer: 3D Scene Design with Natural Language. CoRR abs/1703.00050 (2017). http://arxiv.org/abs/1703.00050Google Scholar
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. 2015a. ShapeNet: An Information-Rich 3D Model Repository. (2015).Google Scholar
Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014a. Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation. In Proc. ACL Workshop on Interactive Language Learning, Visualization, and Interfaces (ILLVI).Google Scholar
Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014b. Learning Spatial Knowledge for Text to 3D Scene Generation. In Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
Bob Coyne, Alex Klapheke, Masoud Rouhizadeh, Richard Sproat, and Daniel Bauer. 2012. Annotation Tools and Knowledge Representation for a Text-To-Scene System. In COLING. 679--694.Google Scholar
Bob Coyne and Richard Sproat. 2001. WordsEye: An Automatic Text-to-scene Conversion System. In Proc. of SIGGRAPH. 487--496. Google ScholarDigital Library
Matthew Fisher, Yangyan Li, Manolis Savva, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric Scene Synthesis for Functional 3D Scene Modeling. ACM Trans. on Graph. 34, 6 (2015), 212:1--10. Google ScholarDigital Library
Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas Funkhouser, and Pat Hanrahan. 2012. Example-based synthesis of 3D object arrangements. ACM Trans. on Graph. 31, 6 (2012), 135:1--11. Google ScholarDigital Library
Matthew Fisher, Manolis Savva, and Pat Hanrahan. 2011. Characterizing structural relationships in scenes using graph kernels. 30, 4 (2011), 34. Google ScholarDigital Library
S. Guadarrama, L. Riano, D. Golland, D. Göhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. 2013. Grounding Spatial Relations for Human-Robot Interaction. In Proc. IEEE Int. Conf. on Intelligent Robots & Systems. 1640--1647.Google Scholar
Ruizhen Hu, Chenyang Zhu, Oliver van Kaick, Ligang Liu, Ariel Shamir, and Hao Zhang. 2015. Interaction Context (ICON): Towards a Geometric Functionality Descriptor. ACM Trans. on Graph. 34, 4 (2015), Article 83. Google ScholarDigital Library
Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. 2016. SceneNN: A Scene Meshes Dataset with aNNotations. In Proc. of 3D Vision.Google ScholarCross Ref
Yun Jiang, Marcus Lim, and Ashutosh Saxena. 2012. Learning Object Arrangements in 3D Scenes using Human Context. In Proc. Int. Conf. on Machine Learning (ICML). Google ScholarDigital Library
Young Min Kim, Niloy J. Mitra, Dong-Ming Yan, and Leonidas Guibas. 2012. Acquiring 3D Indoor Environments with Variability and Repetition. ACM Trans. on Graph. 31, 6 (2012), 138:1--138:11. Google ScholarDigital Library
Tianqiang Liu, Aaron Hertzmann, Wilmot Li, and Thomas Funkhouser. 2015. Style Compatibility for 3D Furniture Models. ACM Trans. on Graph. 34, 4, Article 85 (2015), 85:1--85:9 pages. Google ScholarDigital Library
Rui Ma, Honghua Li, Changqing Zou, Zicheng Liao, Xin Tong, and Hao Zhang. 2016. Action-Driven 3D Indoor Scene Evolution. ACM Trans. on Graph. 35, 6 (2016). Google ScholarDigital Library
Lucas Majerowicz, Ariel Shamir, Alla Sheffer, and Holger H. Hoos. 2014. Filling Your Shelves: Synthesizing Diverse Style-Preserving Artifact Arrangements. IEEE Trans. Visualization & Computer Graphics 20, 11 (2014), 1507--1518.Google ScholarCross Ref
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60. http://www.aclweb.org/anthology/P/P14/P14-5010Google Scholar
Paul Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun. 2011. Interactive Furniture Layout Using Interior Design Guidelines. ACM Trans. on Graph. 30, 4 (2011), 87:1--10. Google ScholarDigital Library
Dipendra Kumar Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2014. Tell Me Dave: Context-Sensitive Grounding of Natural Language to Manipulation Instructions. In Proc. of Robotics: Science and Systems.Google ScholarCross Ref
Zeinab Sadeghipour, Zicheng Liao, Ping Tan, and Hao Zhang. 2016. Learning 3D Scene Synthesis from Annotated RGB-D Images. Computer Graphics Forum (SGP) 35, 5 (2016).Google Scholar
Manolis Savva, Angel X. Chang, and Pat Hanrahan. 2015. Semantically-Enriched 3D Models for Common-sense Knowledge. CVPR 2015 Workshop on Functionality, Physics, Intentionality and Causality (2015).Google ScholarCross Ref
Manolis Savva, Angel X. Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. PiGraphs: Learning Interaction Snapshots from Observations. ACM Trans. on Graph. 35, 4 (2016). Google ScholarDigital Library
Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. 2006. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In IEEE CVPR. 519--528. Google ScholarDigital Library
Lee M. Seversky and Lijun Yin. 2006. Real-time Automatic 3D Scene Generation from Natural Language Voice and Text Descriptions. In Proc. of ACM International Conference on Multimedia. 61--64. Google ScholarDigital Library
Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dongping Li, and Baining Guo. 2012. An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. on Graph. 31, 6 (2012), 136:1--11. Google ScholarDigital Library
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from RGBD images. In ECCV. Google ScholarDigital Library
Greg Slabaugh, Bruce Culbertson, Tom Malzbender, and Ron Schafer. 2001. A Survey of Methods for Volumetric Scene Reconstruction from Photographs. In Proc. of Eurographics Conference on Volume Graphics. 81--101. Google ScholarDigital Library
Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE CVPR. 567--576.Google Scholar
Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. 2017. Semantic scene completion from a single depth image. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 190--198.Google ScholarCross Ref
Moritz Tenorth and Michael Beetz. 2013. KnowRob: A Knowledge Processing Infrastructure for Cognition-enabled Robots. Int. J. Rob. Res. 32, 5 (2013), 566--590. Google ScholarDigital Library
Kai Wang, Manolis Savva, Angel X. Chang, and Daniel Ritchie. 2018. Deep Convolutional Priors for Indoor Scene Synthesis. ACM Trans. on Graphics (Proc. of SIGGRAPH) 37, 4 (2018). Google ScholarDigital Library
Jianxiong Xiao. 2012. 3D Reconstruction is Not Just a Low-level Task: Retrospect and Survey. Technical Report. MIT9.S912: What is Intelligence?Google Scholar
Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. 2014. Organizing Heterogeneous Scene Collection through Contextual Focal Points. ACM Trans. on Graph. 33, 4 (2014), Article 35. Google ScholarDigital Library
Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 2016. 3D Attention-Driven Depth Acquisition for Object Identification. ACM Trans. on Graph. 35, 6 (2016). Google ScholarDigital Library
Lap-Fai Yu, Sai Kit Yeung, Chi-Keung Tang, Demetri Terzopoulos, Tony F. Chan, and Stanley Osher. 2011. Make it home: automatic optimization of furniture arrangement. ACM Trans. on Graph. 30, 4 (2011), 86:1--12. Google ScholarDigital Library
Lap-Fai Yu, Sai Kit Yeung, and Demetri Terzopoulos. 2016. The Clutterpalette: An Interactive Tool for Detailing Indoor Scenes. IEEE Trans. Visualization & Computer Graphics 22, 2 (2016), 1138--1148. Google ScholarDigital Library
C. Lawrence Zitnick, Devi Parikh, and Lucy Vanderwende. 2013. Learning the Visual Interpretation of Sentences. In Proc. ICCV. 1681--1688. Google ScholarDigital Library

Index Terms

Language-driven synthesis of 3D scenes from scene databases
1. Computing methodologies
  1. Computer graphics
    1. Shape modeling
      1. Shape analysis

Recommendations

3DSRASG: 3D Scene Retrieval and Augmentation Using Semantic Graphs
Progress in Artificial Intelligence
Abstract
Computer Vision, encompassing 3D Vision and 3D scene Reconstruction, is a field of importance to real-world problems involving 3D views of scenes. The goal of the proposed system is to retrieve 3D scenes from the database, and further augment the ... $^{}$
Read More
Calibration of panoramic cameras using 3D scene information
Proceedings of the 11th international conference on Theoretical foundations of computer vision

This chapter proposes a novel approach for the calibration of a panoramic camera using geometric information available in real scenes. Panoramic cameras are of increasing importance for various applications in computer vision, computer graphics or ...
Read More
Real-time multi-view 3d object tracking in cluttered scenes
ISVC'06: Proceedings of the Second international conference on Advances in Visual Computing - Volume Part II

This paper presents an approach to real-time 3D object tracking in cluttered scenes using multiple synchronized and calibrated cameras. The goal is to accurately track targets over a long period of time in the presence of complete occlusion in some of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Graphics Volume 37, Issue 6
December 2018
1401 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/3272127
Editor:
Takeo Igarashi
The University of Tokyo, Japan
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 December 2018
Published in tog Volume 37, Issue 6

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data-driven 3D scene generation and editing
natural language interface
relational model
semantic scene graph
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 2,132
  Total Downloads
- Downloads (Last 12 months)517
- Downloads (Last 6 weeks)70
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Language-driven synthesis of 3D scenes from scene databases

ACM Transactions on Graphics

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

3DSRASG: 3D Scene Retrieval and Augmentation Using Semantic Graphs

Calibration of panoramic cameras using 3D scene information

Real-time multi-view 3d object tracking in cluttered scenes