MVE—An image-based reconstruction environment

doi:10.1016/j.cag.2015.09.003

Computers & Graphics

Volume 53, Part A, December 2015, Pages 44-53

https://doi.org/10.1016/j.cag.2015.09.003 Get rights and content

Highlights

•
End-to-end multi-view geometry reconstruction and texturing pipeline.
•
Multi-scale reconstruction approach.

Abstract

We present an image-based reconstruction system, the Multi-View Environment. MVE is an end-to-end multi-view geometry reconstruction software which takes photos of a scene as input and produces a textured surface mesh as result. The system covers a structure-from-motion algorithm, multi-view stereo reconstruction, generation of extremely dense point clouds, reconstruction of surfaces from point clouds, and surface texturing. In contrast to most image-based geometry reconstruction approaches, our system is focused on reconstruction of multi-scale scenes, an important aspect in many areas such as cultural heritage. It allows to reconstruct large datasets containing some detailed regions with much higher resolution than the rest of the scene. Our system provides a graphical user interface for visual inspection of the individual steps of the pipeline, i.e., the structure-from-motion result, multi-view stereo depth maps, and rendering of scenes and meshes.

Graphical abstract

Introduction

Acquiring geometric data from natural and man-made objects or scenes is a fundamental field of research in computer vision and graphics. 3D digitization is relevant for designers, the entertainment industry, and for the preservation as well as digital distribution of cultural heritage objects and sites. In this paper, we introduce MVE, the Multi-View Environment, a free software solution for low-cost geometry acquisition from images. The system takes as input a set of photos and provides the algorithmic steps necessary to obtain a high-quality surface mesh of the captured object as final output. This includes structure-from-motion, multi-view stereo, surface reconstruction and texturing.

Geometric acquisition approaches are broadly classified into active and passive scanning. Active scanning technologies for 3D data acquisition exist in various flavors. Time of flight and structured light scanners are known to produce geometry with remarkable detail and accuracy. But these systems require expensive hardware and elaborate capture planning and execution. Real-time stereo systems such as the Kinect primarily exist for the purpose of gaming, but are often used for real-time geometry acquisition. These systems are based on structured infra-red light which is emitted into the scene. They are often of moderate quality and limited to indoor settings because of inference with sunlight׳s infrared component. Finally, there is some concern that active systems may damage objects of cultural value due to intense light emission.

Passive scanning systems do not emit light, are purely based on the existing illumination, and will not physically affect the subject matter. The main advantage of these systems is the cheap capture setup which does not require special hardware: A consumer-grade camera (or just a smartphone) is enough to capture datasets. The reconstruction process is based on finding visual correspondences in the input images, which, compared to active systems, usually leads to less complete geometry, and limits the scenes to static, well-textured surfaces. The inexpensive demands on the capture setup, however, come at the cost of much more elaborate computer software to process the unstructured input. The standard pipeline for geometry reconstruction from images involves four major algorithmic steps (see Fig. 1):

•
Structure-from-Motion (SfM) infers the extrinsic camera parameters (position and orientation) and the camera calibration (focal length and radial distortion) by finding sparse but stable correspondences between images. A sparse point-based 3D representation of the subject is created as a by-product of camera reconstruction.
•
Multi-View Stereo (MVS) reconstructs dense 3D geometry by finding visual correspondences in the images using the estimated camera parameters. These correspondences are triangulated yielding dense 3D information.
•
Surface Reconstruction takes as input a dense point cloud or individual depth maps and produces a globally consistent surface mesh.
•
Surface Texturing computes a consistent texture for the surface mesh using the input images.

It is not surprising that software solutions for end-to-end passive geometry reconstruction are rare. The reason lies in the technical complexity and the effort required to create such tools. Many projects cover parts of the pipeline, such as Bundler [1], VisualSfM [2], or OpenMVG [3] for structure-from-motion reconstruction, PMVS [4] for multi-view stereo, and Poisson Surface Reconstruction [5] for mesh reconstruction. A few commercial software projects offer complete end-to-end pipelines covering SfM, MVS, Surface Reconstruction and Texturing. This includes Arc3D, Agisoft Photoscan and Acute3D Smart3DCapture. All of them are, however, closed source and do not facilitate research. In contrast, we offer a complete pipeline as a free, open source software system, which was introduced in an earlier version of this paper [6].

Our system handles many kinds of scenes, such as compact objects, open outdoor scenes, and controlled studio datasets. It avoids to fill holes in regions with insufficient data for a reliable reconstruction. This may leave holes in the surfaces but does not introduce artificial geometry, common to many global reconstruction approaches. Our software puts a special emphasis on multi-resolution datasets which can contain very detailed regions in otherwise less detailed datasets. It has been shown that inferior results are produced if the multi-resolution nature of the input data is not considered properly [7], [8], [9].

In the paper׳s remainder we first give a technical overview of our system and introduce its individual components in Section 2. A few practical aspects and limitations of our system are discussed in Section 3. We then show reconstruction results on several datasets with different characteristics and demonstrate the versatility of our pipeline in Section 4. We briefly describe our software framework and conclude in Section 5.

Section snippets

System overview

Our system consists of four steps: Structure-from-motion (SfM) which reconstructs the parameters of the cameras, multi-view stereo (MVS) for establishing dense visual correspondences, a meshing step which merges the MVS geometry into a globally consistent mesh and finally a texturing step creating seamless textures from the input images. In the following, we give a concise overview of the process, using the Bronze Statue dataset as an example for a cultural heritage artifact, see Fig. 1. For a

Practical aspects

In this section we discuss some aspects that should be considered when using our image-based reconstruction system. We present some guidelines that can help users to capture better input data in order to facilitate high quality results. We also discuss some limitations of the presented approaches, which do not only apply to our reconstruction system but more generally to these types of algorithms.

Reconstruction results

In the following, we show results on a few datasets we acquired over time. We selected a variety of scenarios to show the broad applicability of our system.

Duck: The first dataset, called Duck, was captured in a controlled studio environment and contains 160 images of a small, diffuse ceramic duck figurine, see Fig. 11. This is a relatively compact dataset with uniform scale as the images have the same resolution and are evenly spaced around the object. Notice that, although the individual

Conclusion

In this paper we presented MVE, the Multi-View Environment, a free and open 3D reconstruction application, relevant to the cultural heritage community. It is versatile and can operate on a broad range of datasets, including the ability to handle quite uncontrolled photos. It is thus suitable for reconstruction amateurs. Our focus on multi-scale data allows to put an emphasis on interesting parts in larger scenes with close-up photos. We belief that the effort and expert knowledge that went into

Acknowledgments

Part of the research leading to these results has received funding from the European Commission׳s FP7 Framework Programme under grant agreements ICT-323567 (HARVEST4D) and ICT-611089 (CR-PLAY), the DFG Emmy Noether fellowship GO 1752/3-1 as well as the Intel Visual Computing Institute (Project RealityScan).

References (31)

H. Bay et al.
Speeded-up robust features (SURF)
Comput Vis Image Understand (CVIU)
(2008)
M. Callieri et al.
Masked photo blendingMapping dense photographic dataset on high-resolution sampled 3D models
Comput Graph
(2008)
N. Snavely et al.
Photo tourismexploring photo collections in 3D
Trans Graph
(2006)
Wu C. Towards linear-time incremental structure from motion. In: International conference on 3D vision (3DV). 2013, p....
Moulon P, Monasse P, Marlet R, et al. OpenMVG,...
Y. Furukawa et al.
Accurate, dense, and robust multi-view stereopsis
Trans Pattern Anal Mach Intell (PAMI)
(2010)
M. Kazhdan et al.
Screened Poisson surface reconstruction
Trans Graph
(2013)
Fuhrmann S, Langguth F, Goesele M. MVE – a multi-view reconstruction environment. In: Eurographics workshop on graphics...
Mücke P, Klowsky R, Goesele M. Surface reconstruction from multi-resolution sample points. In: Vision, Modeling and...
Fuhrmann S, Goesele M. Fusion of depth maps with multiple scales. In: SIGGRAPH Asia; 2011. p....

Fuhrmann S, Goesele M. Floating Scale Surface Reconstruction. In: SIGGRAPH,...

R. Szeliski

Computer Vision: Algorithms and Applications

(2010)

Armstrong M, Zisserman A, Beardsley PA. Euclidean reconstruction from uncalibrated images. In: British Machine Vision...

Pollefeys M, Koch R, Gool LV. Self-calibration and metric reconstruction in spite of varying and unknown internal...

M. Pollefeys et al.

3D recording for archaeological field work

Comput Graph Appl (CGA)

(2003)

Cited by (71)

3D grape bunch model reconstruction from 2D images
2023, Computers and Electronics in Agriculture
A crucial step in the production of table grapes is berry thinning. This is because the market value of table grape production is significantly influenced by bunch compactness, bunch form and berry size, all of which are primarily regulated by this task. Grape farmers must count the number of berries in the working bunch and decide which berry should be eliminated during thinning, a process requiring extensive viticultural knowledge. However, the use of 2D pictures for automatic berry counting and identifying the berries to be removed has limitations, as the number of visible berries might vary greatly depending on the direction of view. In addition, it is extremely important to understand the 3D structure of a bunch when considering future automation via robotics. For the reasons stated, obtaining a field-applicable 3D grape bunch model is needed. Thus, the contribution of this study is a novel technology for reconstructing a 3D model of a grape bunch with uniquely identified berries from 2D images captured in the real grape field environment.
Wide-baseline view synthesis for light-field display based on plane-depth-fused sweep volume
2023, Displays
The advancements in three-dimensional (3D) display technology have led to a wide interest in light-field display. However, the need to simultaneously capture a large number of object views made content generation for light-field displays still a bottleneck. In this paper, we propose a method for light-field content generation based on plane-depth-fused sweep volume (PDFSV), focusing on handling wide-baseline views and exhibiting scene generalization when the camera array remains unchanged. Specifically, the proposed PDFSV exploits the prior depth of the images captured by a 4 × 4 spherical camera array to represent 3D information of scenes. Then two optimized sequential convolutional neural networks (CNN) are employed for implicit depth modeling and final color calculation, respectively. By doing these, the prior depth facilitates the synthesis of regions with complex textures in the target view. We produce a Wide-baseline Multi-view Image Set (WMIS) which has a field of view (FOV) angle reaching 54°and could be publicly available for access. In our experiments, we use only the 4 vertex views as input. Results demonstrate that the proposed approach can synthesize high-quality views at arbitrary positions between sparse views, outperforming existing neural-radiance-fields-based (NeRF-based) methods. Finally, we conduct autostereoscopic display experiments, achieving satisfactory results.
Sparse prior guided deep multi-view stereo
2022, Computers and Graphics (Pergamon)
Citation Excerpt :
In the past years, many successful traditional MVS frameworks based on depth map have been proposed. MVE (Multi-View Environment) [24] is an image-based multi-view geometry reconstruction pipeline, which includes structure-from-motion, multi-view stereo, surface reconstruction and texturing. COLMAP [7] is a PatchMatch-based [25] MVS framework, which leverages the photometric and geometric priors to jointly estimate the depth map and surface normal, together with the pixel-wise view selection information.
Recently, the learning-based multi-view stereo (MVS) methods have achieved notable progress. Despite this, the cost volumes of those methods are not robust enough for depth inference with the existence of noise, outlier and occlusion. In this work, we aim to reduce the matching ambiguities and increase the discrimination of the cost volume. For this purpose, we first propose a sparse prior guidance strategy which incorporates the sparse points of structure-from-motion (SfM) into cost volume construction. Further, we introduce a geometry-aware regularization module to enhance the representative power of cost volume regularization, which could adaptively fit the local geometric shapes. The two modules are straightforward yet effective, resulting in robust and accurate depth estimation. The detailed experiments are conducted on two public datasets (DTU and Tanks & Temples). The experimental results show that with the two components, the representative methods (e.g MVSNet and CasMVSNet) could make a significant gain in reconstruction quality.
Scene-graph-driven semantic feature matching for monocular digestive endoscopy
2022, Computers in Biology and Medicine
Citation Excerpt :
Some specific examples of reprojection errors are shown in Table 6. We compared the sparse and dense reconstruction using our proposed semantic descriptor with COLMAP [12,35]. We use the number of registered views (Registered ratio) and the number of sparse reconstruction points (SR points) to evaluate the performance of our method in the task of SfM in endoscopy.
Registration of the preoperative 3D model with the video of the digestive tract is the key task in endoscopy surgical navigation. Accurate 3D reconstruction of soft tissue surfaces is essential to complete registration. However, existing feature matching methods still fall short of desirable performance, due to the soft tissue deformation and smooth but less-textured surface.
In this paper, we present a new semantic description based on the scene graph to integrate contour features and SIFT features. Firstly, we construct the semantic feature descriptor using the SIFT features and dense points in the contour regions to obtain more dense point feature matching. Secondly, we design a clustering algorithm based on the proposed semantic feature descriptor. Finally, we apply the semantic description to the structure from motion (SfM) reconstruction framework.
Our techniques are validated by the phantom tests and real surgery videos. We compare our approaches with other typical methods in contour extraction, feature matching, and SfM reconstruction. On average, the feature matching accuracy reaches 75.6% and improves 16.6% in pose estimation. In addition, 39.8% of sparse points are increased in SfM results, and 35.31% more valid points are obtained for the DenseDescriptorNet training in 3D reconstruction.
The new semantic feature description has the potential to reveal more accurate and dense feature correspondence and provides local semantic information in feature matching. Our experiments on the clinical dataset demonstrate the effectiveness and robustness of the novel approach.
Cross-Polarized SfM Photogrammetry for the Spatial Reconstruction of Challenging Surfaces, the Case Study of Dobšiná Ice Cave (Slovakia)
2023, Remote Sensing
Shape Recovery from Polarization: A Review
2023, ACM International Conference Proceeding Series

View all citing articles on Scopus

View full text

40 years of Computer Graphics in DarmstadtMVE—An image-based reconstruction environment

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

System overview

Practical aspects

Reconstruction results

Conclusion

Acknowledgments

Comput Vis Image Understand (CVIU)

Comput Graph

Photo tourismexploring photo collections in 3D

Trans Graph

Accurate, dense, and robust multi-view stereopsis

Trans Pattern Anal Mach Intell (PAMI)

Screened Poisson surface reconstruction

Trans Graph

Computer Vision: Algorithms and Applications

3D recording for archaeological field work

Comput Graph Appl (CGA)

40 years of Computer Graphics in Darmstadt
MVE—An image-based reconstruction environment