Research at the intersection of robotics and computer vision currently targets the development of a handful of important applications in which some form of artificial, intelligent device is employed to assist humans in the execution of everyday tasks. A few examples are given by:
Intelligent transportation: whether talking about autonomous vehicles on open roads or Automatic Guided Vehicles (AGVs) within campuses, office buildings, or factories, transportation is a time and safety-critical problem that would benefit strongly from either partial or full automation.
Domestic robotics: demographic changes are increasingly impacting our availability to do household management or take care of elderly people. Service robots able to execute complex tasks such as moving and manipulating arbitrary objects could provide an answer to this problem.
Intelligence augmentation: eye-wear such as the Microsoft Hololens is already pointing at the future form of highly portable, smart devices capable of assisting humans during the execution of everyday tasks. The potential advancement over current devices such as smartphones originates from the superior spatial awareness provided by onboard Simultaneous Localisation And Mapping (SLAM) and Artificial Intelligence (AI) capabilities.
Common to these applications is their mobility and requirement of real-time operation and interaction with the real world, no matter if active or passive. The last major road-block towards the achievement of such functionality is believed to be the perception problem. More specifically, the perception problem asks for a real-time solution enabling an intelligent, embodied device to localise itself relative to its immediate environment as well as perceive the latter in the sense of generating a virtual scene representation that is useful towards the execution of a specific task.
In robotics, this problem is traditionally called Simultaneous Localisation And Mapping (SLAM). A typical realisation of a SLAM system would use either one or several exteroceptive sensors (e.g., camera, depth camera, Lidar) attached to a moving body (e.g., a human hand, a helmet, a robot) in combination with interoceptive sensors such as an Inertial Measurement Unit (IMU). The incoming sensor stream is processed in real-time to incrementally recover the relative displacement of the sensor as well as a geometric representation of the environment. While there have been many successful systems presented in the past, the generated model of the environment traditionally only covers physical boundaries in the form of low-level primitives such as a point-cloud, a mesh, or a binary occupancy grid. Such representations may be sufficient to solve the problems of path-planning and obstacle-free navigation for robots but are not yet amenable to the solution of complex tasks that require understanding the object-level composition of the environment as well as the semantic meaning of its elements.
Over the past decade, the identification of scene components (i.e., objects) and their semantic meaning has seen substantial progress through the rise of deep learning based image processing techniques. In an aim to increase the semantic, object-level understanding reflected by 3D environment representations, the SLAM community has therefore recently begun to include such algorithms for object detection and semantic segmentation in the front-end of SLAM systems and reflect the added information in an object-level partitioning and a semantic annotation of the generated 3D models. Real-time systems generating such joint geometric-semantic representations from image sequences represent an evolution of more traditional, visual SLAM into something referred to as Spatial AI (The term Spatial AI is coined by Andrew Davison in his study Future Mapping: The Computational Structure of Spatial AI Systems [1
In recent years, we have seen a number of exciting advancements in the development of Spatial AI systems. However, the performance of the latter is thus far mostly evaluated in qualitative terms, and we currently lack a clear definition of quantitative performance measures and easy ways to compare relevant frameworks against each other. An automated benchmark as has been introduced for traditional visual-inertial SLAM frameworks appears as an eminent gap to be filled in order to warrant research progress on Spatial AI systems. Our contributions are:
We provide a concise review of currently existing Spatial AI systems along with an exposition of their most important characteristics.
We discuss the structure of current and possible future environment representations and its implications on the structure of their underlying optimisation problems.
We introduce novel synthetic datasets for Spatial AI with all required ground-truth information including camera calibration, sensor trajectories, object-level scene composition, as well as all object attributes such as poses, classes, and shapes readily available.
We propose a set of clear evaluation metrics that will analyse all aspects of the output of a Spatial AI system, including localisation accuracy, scene label distributions, the accuracy of the geometry and positioning of all scene elements, as well as the completeness and compactness of the entire scene representation.
Our paper is organised as follows. In Section 2
, we provide a review of existing Spatial AI systems. In Section 3
, we introduce possible scene representations produced by a Spatial AI system that would aim at a general satisfaction of the requirements given by its applications. We also discuss how their form is reflected and used in the corresponding graphical optimisation problems. Section 4
then introduces our novel benchmark datasets including their creation, noise addition, and evaluation metrics. To conclude, Section 5
illustrates an application of the benchmark to an example Spatial AI system and discusses its performance as a representative of the current state-of-the-art.
2. Review of Current SLAM Systems and Their Evolution into Spatial AI
One of the first occurrences of the SLAM paradigm dates back to 1991, when Leonard and Durrant-Whyte [2
] exploited a Panasonic beacon sensor to perform simultaneous localisation and mapping with a small ground vehicle robot. Meanwhile, depth measurements in robotics are generated using much more powerful laser measurement devices. A review of all SLAM solutions for all kinds of sensors would however go beyond the scope of this paper, and we focus our discussion on visual-inertial SLAM implementations that solve the problem for regular cameras that are complemented by an IMU. Our main interest lies in cameras as they deliver appearance information, a crucial ingredient for the semantic perception of environments. For a more comprehensive, general review on SLAM, the reader is kindly referred to Cadena et al. [3
Two landmark contributions on monocular visual SLAM date back to the same year and are given by Davison’s [4
] Extended Kalman Filter based Mono-SLAM algorithm and Klein and Murray’s [5
] keyframe-based PTAM algorithm. The latter work in particular utilises two parallel threads to offload large-scale batch optimisation into a less time-critical background thread, a strategy that has been reused in subsequent SLAM implementations such as for example the state-of-the-art open-source framework ORB-SLAM [6
]. The batch optimisation thread essentially performs bundle adjustment [7
], an optimisation problem over camera poses and 3D point coordinates similar to the one solved in large-scale structure-from-motion pipelines [8
]. Visual SLAM is often complemented by an Inertial Measurement Unit to robustify localisation (e.g., Qin et al. [11
While the most established representation for real-time SLAM systems is a 3D point cloud, the community has explored various alternatives. Delaunoy and Pollefeys [12
] use a triangular mesh to directly model the surface of the environment and impose photometric consistency terms between different views. Newcombe et al. [13
] introduced DTAM, a tracking and mapping approach in which dense depth information is found as a minimal cost surface through a voxel space in which each cell aggregates photometric errors between a point in a reference view and corresponding points in neighbouring frames. As demonstrated in works such as Hornung et al. [14
] and Vespa et al. [15
], the RGBD-SLAM community has also explored the use of binary occupancy or continuous occupancy likelihood fields. Voxel-based volumetric discretisation of space has found another application in the form of implicit, distance-field based mapping. By using RGBD sensors and the early work of Curless and Levoy [16
], Newcombe et al. [17
] have shown the seminal KinectFusion framework for real-time dense tracking and mapping. The technique has inspired many follow-up contributions, including recent ones that bridge the gap to Euclidean distance fields [18
] and fuse multiple subvolumes for global consistency [19
]. Further popular alternatives for structure representation are given by lines [20
] or surfels, the latter one being utilised in Whelan et al.’s ElasticFusion algorithm [22
A commonality to all the above-mentioned algorithms is that they employ either explicit or implicit low-level primitives for representing the environment (i.e., points, lines, triangles, or grid cells). Each element is constrained directly by the measurements, and resilience with respect to noise and missing data may be increased by enforcing local smoothness constraints. However, the mentioned approaches do not impose less localised constraints given by semantic or other higher-level information about certain segments of the environment. As a simple example, consider a situation in which we would know that an entire set of 3D points lies on a plane. We would prefer to model this part of the environment using a single instance of plane parameters rather than many 3D points. However, such strategy assumes the existence of a front-end measurement segmentation module able to detect meaningful parts of the environment (e.g., planes, objects of a certain class), establish correspondences between such measurement and scene segments, or even understand certain geometric properties about them (e.g., 3D object pose). Examples of such front-end modules are given by the Yolo9000 object detection framework [23
], the Mask-RCNN framework for instance-level segmentation [24
], Chen et al.’s hybrid task cascade framework for instance-level segmentation [25
], or PointRCNN for 3D object pose estimation in 3D point clouds [26
], to name just a few.
A system that is able to then incorporate higher-level knowledge such as learned semantic priors into the 3D representation of a real-time mobile visual perception system is defined in Davison’s studies FutureMapping 1 & 2 [1
] as a Spatial AI system. In other words, a Spatial AI system is characterised by generating a joint geometric-semantic 3D understanding about an environment. In its most basic form, the extracted semantic front-end information is simply used to perform a segmentation and labelling of the 3D representation of the environment. Choudhary et al. [28
] and Gálvez-López et al. [29
] are among the first to propose a real-time SLAM system that actively discovers and models objects in the environment. Later, similar object-detector-based systems have been introduced by Sünderhauf et al. [30
] and Nakajima and Saito [31
]. Stückler et al. [32
], McCormac et al. [33
] and Pham et al. [35
] in parallel investigated the semantic 3D segmentation and labelling of dense representations. While initially only at the level of object-classes, Grinvald et al. [36
] and Rosinol et al. [37
] have most recently investigated object-instance level 3D segmentations of the environment.
As also outlined in Davison’s studies [1
], there are further desired properties of a Spatial AI system, which are memory efficiency and the imposition of higher-level prior knowledge into the geometric representation of the environment. While the above-mentioned systems already maintain a graphical model describing the object-level structure of the scene as well as the observabilities of objects in frames, complete surface representations still ask for dense, low-level representations to model the geometry of each individual part of the environment [33
]. The most straightforward way to reduce the dimensionality and employ higher-level priors is given by omitting shape optimisation altogether and employing complete object geometries such as CAD models as known priors. This strategy is pursued by the seminal work SLAM++ by Salas-Moreno et al. [38
], which employs a graphical model that simply parametrises camera and object poses. A more flexible approach was later proposed by Ulusoy et al. [39
], where the assumption of a known CAD model is relaxed to a fixed set of possible 3D shapes. Their work focuses on photometric multiview stereo, and the optimisation of the pose of each object is complemented by a probabilistic, discrete selection of the best element within the shape set. Gunther et al. [40
] propose a similar framework for indoor modelling with a given set of CAD models of pieces of furniture. Another interesting way of including semantic knowledge is proposed by Häne et al. [41
], who use the semantic information to adapt the smoothness constraint between neighbouring voxels in a dense representation.
One obvious disadvantage of frameworks such as SLAM++ [38
] is that they are limited to a small set of possible 3D shapes which are given upfront. In other words, they do not employ a generic class-specific model that would permit the continuous optimisation of a given object shape. As introduced in [42
], low-dimensional shape representations can be obtained using one of a few approaches of manifold learning (e.g., PCA, kernel-PCA, Isomap, LLE, auto-encoder). However, unsupervised learning and the resulting optimisability of such representations are far from trivial, which is why the first lower-dimensional models employed in the literature are explicit. Güney and Geiger [43
] and Chhaya et al. [44
] employ class-specific, optimisable meshes or wireframe structures, respectively. Hosseinzadeh et al. [45
] and Gay et al. [46
] employ quadrics as shape primitives to approximate the space occupied by certain objects.
Approaches that finally achieve dense, optimisable environment representations by low-dimensional graphical models and implicit, class-specific shape representations are given by Dame et al. [47
] and Engelmann et al. [48
], who both rely on PCA to learn a manifold for an admittedly simple class of shapes: cars. Alismail et al. [49
] and Zhu and Lucey [50
] later on propose more advanced frameworks that rely on deep neural networks to generate point clouds or occupancy grids from the latent low-dimensional representation. Hu et al. [51
] finally extend this approach to a complete SLAM framework that optimises a larger scale graph over many frames and multiple complex-shaped objects of different classes (e.g., chairs, tables) as well as latent shape representations for each object instance in parallel. Note that low-dimensional latent representations for modelling 3D geometry have also been utilised by Bloesch et al. [52
] and Zhi et al. [53
]. More specifically, they rely on photometric consistency to optimise codes in each keyframe that generate depth maps using a deconvolutional architecture. The representation however seems suboptimal as it does not respect object-level partitioning and furthermore leads to redundancy (i.e., one depth map in each keyframe but a potentially large overlap between neighbouring keyframes).
Spatial AI systems that aim at a real-time understanding of both the geometry and the semantics of an environment generally contrast with more time consuming offline approaches that start from known camera poses and known low-level geometry only to infer an object-level representation of the scene. For example, Gupta et al. [54
] and Li et al. [55
] make use of a large-scale database of 3D CAD models to estimate object poses and geometries and replace parts of the 3D geometry. Chen et al. [56
] further rely on contextual relationships learned from the 3D database to constrain the reconstruction. More recently, Huang et al. [57
] propose holistic scene grammars to infer scene layouts from single images based on analysis by synthesis. The nondifferentiable optimisation space is traversed using Markov Chain Monte Carlo (MCMC). In contrast, Grabner et al. [58
] propose a discriminative approach to predict an object model and its pose. Although interesting and related, the listed scene layout estimation frameworks are not designed for high efficiency and rely on large databases of object shapes rather than compact, optimisable shape representations.