Special Section on Cyberworlds 2017Distributed monocular visual SLAM as a basis for a collaborative augmented reality framework
Graphical abstract
Introduction
Markerless tracking has been a goal of many augmented reality applications, and the Simultaneous Localization and Mapping (SLAM) has been a robust framework to accomplish this. The robotics community defines the SLAM problem as an agent creating a map of an unknown environment using sensors while localizing itself in it. To localize the agent properly, an accurate map is required. To generate an accurate map, localization has to be done properly. This means that localization and mapping need to be done simultaneously to benefit each other.
Inexpensive, ubiquitous mobile agents with cameras and image processing tools made them a popular choice of a sensor for SLAM. Most Visual SLAM approaches relied on detecting features and generating sparse maps using them. More recent solutions with direct featureless methods [1] generate semi-dense maps of the environment. Dense maps provide many benefits over sparse maps including, better agent interactions with the environment or objects, better scene interaction for augmented reality applications, and better object recognition with enhanced data. However, in practice, direct, featureless methods require significant overlaps between key frames, with narrower baselines. This adds a limit to the movement of the camera. Furthermore, the direct method alone could not handle large loop closures.
Many researchers investigated how to use multiple agents to perform SLAM: called collaborative or distributed SLAM. Distributed SLAM increases the robustness of the SLAM process and makes it less vulnerable to catastrophic failures. Challenges in distributed SLAM are computing map overlaps and sharing information between agents with limited communication bandwidth.
We developed a collaborative augmented reality framework based on distributed SLAM. Agents in our framework do not have any prior knowledge of their relative positions. Each agent generates a local semi-dense map utilizing direct featureless SLAM approach. The framework uses image features in keyframes to determine map overlaps between agents. We performed a comprehensive analysis on state-of-the-art keypoint detector/descriptor combinations to improve the performance of our system reported in [2] by defining a quality measure to find the optimal combination. We created the publicly available DIST-Mono distributed monocular visual SLAM dataset to evaluate our system. Furthermore we developed a proof-of-concept augmented reality application to demonstrate the potential of our framework.
Section snippets
Related work
In a seminal paper, Smith et al. [3] introduced an Extended Kalman Filter (EKF) based solution for the SLAM problem (EKF-SLAM). The EKF incrementally estimates the posterior distribution over agent pose and landmark positions. The covariance matrix grows with the number of landmarks. Even a single landmark observation leads to an update of the covariance matrix, limiting the number of landmarks EKF-SLAM could handle due to the excessive computational cost. Furthermore, EKF-SLAM has Gaussian
System overview
Our framework consists of two types of distributed nodes; exploring node and monitoring node. These nodes are deployed on different physical machines and given a globally unique identifier. The framework has one monitoring node and multiple exploring nodes at any given time. The nodes use communication channels to pass messages between each other.
We use the Robot Operating System (ROS) [21] infrastructure for our framework. ROS includes nodes that are responsible for performing computations. We
Exploring node
Each exploring node performs semi-dense visual SLAM based on the work by [22]. It uses a single camera as the only input device. It maintains a list of key frames and a pose graph to represent its local map.
Monitoring node
Exploring nodes of our distributed framework do not know their relative poses at the beginning. Monitoring Node’s Map overlap detection module is responsible for detecting and computing corresponding relative pose between nodes. It also detects loop closure of each exploring node.
Monitoring node maintains an N number of key frame databases DBi. Here N equals to the number of exploring nodes in the framework. All incoming key frames are matched against all these key frame databases. The
Determining overlap between two maps
Fig. 8 is a flowchart that describes how overlap between two maps are determined. As discussed earlier, the maps used in our framework is represented using a set of keyframes and a pose graph. The ith keyframe, consists of an absolute pose ξWi, an image Ii, an inverse depth map Di, an inverse depth variance map Vi and a list of features Fi. Each feature in Fi is filtered for its Vi(xp) to determine its saliency, where xp is the location of the feature.
The pth feature in should satisfy,
Public datasets
To evaluate our system, we need a monocular visual SLAM dataset, with multiple trajectories covering a single scene. We considered publicly available datasets, and they did not satisfy our requirements. For example, the dataset EuRoC [34] contains pure rotations which did not work well with the monocular SLAM approach we used. The Kitti [35] is mainly a stereo dataset, even when we considered a single camera, the direct monocular SLAM process failed since the camera motion is along the optical
System implementation
We developed exploring nodes and monitoring nodes as ROS nodes. We used ROS Indigo Igloo infrastructure on Ubuntu 14.04 LTS (Trusty) operating system. Both framework implementation and the comprehensive analysis on state-of-the-art feature detector and descriptor combinations we used version 2.4.8 of the OpenCV library.
Nodes in the framework communicate with each other using ROS topics. We used ROS statistics to measure bandwidth utilization in those communication channels. In addition to that
AR application
We added an AR window to each exploring node to test our framework. The AR window, allows users to add a virtual object (a simple cube, in our example) into its map. This allows us to demonstrate the collaborative AR potential of the distributed SLAM framework. Each exploring node has its local map so that it can render the augmented scene from its viewpoint. It also knows its pose on the global map. This allows it to render objects added by the other exploring nodes as well. Furthermore,
Conclusion
In this paper, we introduced a distributed SLAM framework that identifies map overlaps based on an appearance-based method. For the appearance based method, we have done a comprehensive analysis on the state-of-the-art keypoint detector and descriptors and introduced a quality measure to select the best combination for a distribution visual SLAM framework. The framework operates with no prior knowledge of relative starting poses of its nodes. Using an AR application we have shown that our
References (38)
Distinctive image features from scale-invariant keypoints
Int J Comput Vision
(2004)- et al.
FREAK: Fast retina keypoint
Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR)
(2012) - et al.
LSD-SLAM: large-scale direct monocular SLAM
Computer vision ECCV 2014. Lecture Notes in Computer Science
(2014) - et al.
A collaborative augmented reality framework based on distributed visual SLAM
Proceedings of the international conference on cyberworlds (CW)
(2017) - et al.
Estimating Uncertain Spatial Relationships in Robotics
- et al.
FastSLAM: A factored solution to the simultaneous localization and mapping problem
Proceedings of the AAAI national conference on artificial intelligence
(2002) - et al.
MonoSLAM: real-time single camera SLAM
IEEE Trans Pattern Anal Mach Intell
(2007) - et al.
Parallel tracking and mapping for small AR workspaces
Proceedings of the 2007 6th IEEE and ACM international symposium on mixed and augmented reality
(2007) - et al.
Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography
Commun ACM
(1981) - et al.
Multiple View Geometry in Computer Vision
(2004)