Absolute Camera Pose Regression Using an RGB-D Dual-Stream Network and Handcrafted Base Poses

Absolute pose regression (APR) for camera localization is a single-shot approach that encodes the information of a 3D scene in an end-to-end neural network. The camera pose result of APR methods can be observed as the linear combination of the base poses. Previous APR methods’ base poses are learned from training data. However, the training data can limit the performance of the methods, which cannot be generalized to cover the entire scene. To solve this issue, we use handcrafted base poses instead of learning-based base poses, which prevents overfitting the camera poses of the training data. Moreover, we use a dual-stream network architecture to process color and depth images separately to get more accurate localization. On the 7 Scenes dataset, the proposed method is among the best in median rotation error, and in median translation error, it outperforms previous APR methods. On a more difficult dataset—Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors compared to the state-of-the-art APR methods.


Introduction
Single-shot camera localization (also known as the model-based localization, imagebased localization, visual localization, or camera relocalization) estimates the camera pose from a single image related to a previously visited scene. It has two phases. In phase 1, a model (e.g., 3D point cloud, neural network) is first built based on a set of images. In phase 2, the camera pose related to the model is estimated based on a single image. Single-shot camera localization can be divided into six categories: the Structure-Based Method [1], Structure-Based Using Image Retrieval [2], Pose Interpolation [3,4], Relative Pose Estimation [5,6], Scene Point Regression [7][8][9], and Absolute Pose Regression [10][11][12][13][14][15].
Assume that in phase 1, the scene model has been built. In Phase 2, for the Structure-Based Method [1], the first step is to extract local features of the query image. The second step is 2D-3D feature matching of the query image and the model. After obtaining the 2D-3D correspondences, the third step is to estimate the camera pose by the PnP algorithm. The Structure-Based Using Image Retrieval method [2] is similar to the Structure-Based Method. The difference between them is that the Structure-Based Using Image Retrieval method uses the image retrieval method to reduce the search region of 2D-3D feature matching, which can make the method estimate fast in a large environment. The Pose Interpolation method [3,4] is to find the top k similar images of the query image in the reference database and interpolate the camera poses of these retrieved images to obtain the pose estimate of the query image. After image retrieval, the Relative Pose Estimation method [5,6] employs a relative pose regression (RPR) to achieve more accurate localization. The query image and nearest neighbor reference image are simultaneously input into the RPR network. The task of this type of RPR network is to learn both the relative pose between the retrieved image pairs and the query image and the global features of a single image, regardless of the scene. The Scene Point Regression method [7][8][9] is similar to the Structure-Based Method. The difference between them is that the Scene Point Regression no longer relies on feature extraction to conduct 2D-3D feature matching. It uses the machine-learning method or deep-learning method to regress the correspondences of 2D image points and 3D model points. The Absolute Pose Regression method [10][11][12][13][14][15] uses an end-to-end deep-learning model to regress the camera pose of the query image.
Among these six categories, the Absolute Pose Regression (APR) method uses an endto-end deep-learning model to regress the camera pose of the query image quickly and needs less storage to store the neural network model. However, the localization accuracy of the APR method is not as good as the other methods. Moreover, the camera pose result of APR methods can be observed as the linear combination of the base poses [16]. The previous APR methods' base poses are learned from training data. APR methods are a data-driven camera localization approaches by which the distribution of the ground truth camera trajectories used for training can limit base poses learned from the training data. As long as three orthonormal vectors exist, their linear combination can cover the entire 3D space, as the number of base poses is often much more significant than three, and learning-based base poses are redundant. However, since the training trajectory does not necessarily include diverse directions, the redundant base poses may not cover the space outside the training trajectory. In other words, the APR model that adopts base poses based on learning might have no generalizability. To reduce the impact of this issue, we turn to handcrafted base poses instead of using learning-based base poses.
In this paper, we propose an absolute pose regression (APR) method for camera localization using an RGB-D dual-stream network and handcrafted base poses. This method fuses color and depth information for more accurate localization. We use a dualstream network architecture to process color images and depth images separately and combine this with handcrafted base poses to reduce the impact of the network's limitation to movement trajectories in the training data.
We conducted detailed ablation studies, which demonstrate the efficacy of the proposed method. Furthermore, comprehensive evaluation results demonstrate that on the 7 Scenes dataset, the proposed method is among the best in the median rotation error, and in the median translation error, it outperforms previous APR methods. On more difficult dataset-Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors, as compared to the state-of-the-art APR methods.
To summarize, this work has three contributions:

1.
We use handcrafted base poses instead of learning-based base poses (e.g., [12][13][14][15]) to estimate the camera poses, which can prevent the camera poses from overfitting the training data's camera trajectories.

2.
We use a dual stream network to fuse color and depth information to obtain more accurate localization results.

3.
On the 7 Scenes dataset, the proposed method is among the best in the median rotation error, and in the median translation error, it outperforms previous APR methods. On more difficult dataset-Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors, as compared to the state-of-the-art APR methods.
The remainder of this paper is organized as follows. In Section 2 we review related works, and in Section 3 we describe the proposed APR method. Section 4 presents the experimental results to attest to the effectiveness of the proposed method, and Section 5 concludes this study.

Related Work
In this section, we will focus on the related works of the learning-based single-shot camera localization approaches.

Structure-Based Method
For Structure-Based Method [1], the first step is to extract local features of the query image. The second step is 2D-3D feature matching of the query image and the model. After obtaining the 2D-3D correspondences, the third step is estimating the camera pose using the PnP algorithm.

Structured-Based Using Image Retrieval
Structure-based localization using a 3D map can lead to higher accuracy. However, as the scale of the scene increases, the number of key points increases, leading to increased memory requirements and longer run times for feature matching. This also produces more wrong 2D-3D matches, leading to large localization errors. Therefore, HF-Net [2] adopts coarse-to-fine localization to handle large-scale environments. First, image retrieval is leveraged to find the nearest neighbors of the query image, reducing the search space for subsequent matching. Then, a structure-based pipeline is applied to retrieved images to estimate an accurate and robust camera pose.

Pose Interpolation
Camera localization based on image retrieval [4] involves finding the top k similar images of the query image in the reference database and interpolating the camera poses of these retrieved images to obtain the pose estimate of the query image.

Relative Pose Regression
For a more accurate localization, relative pose regression (RPR) [5,6] can be used after image retrieval. The query image and its nearest neighbor reference image are simultaneously input into the RPR network. The task of this type of RPR network is to learn the relative pose between the retrieved image pairs and the query image, and the global features of a single image, regardless of the scene. Thus, these methods can be trained on a general scene or multiple scenes.

Scene Point Regression
Scene coordinate regression no longer relies on feature extraction but implements the first stage through machine learning or deep learning. SCRF [7] employs random forests to regress the global 3D points corresponding to the 2D points in the image. To train in an end-to-end manner, differentiable RANSAC (DSAC) [8] uses one network for scene coordinate regression and another for scoring regression. DSAC++ [9] solves the shortcomings of DSAC requiring a 3D model during training and counts soft inliers to select the best camera pose hypothesis instead of a scoring network.

Absolute Pose Regression
PoseNet [10] is the first method to directly regress 6-DoF camera poses from a single RGB image in an end-to-end manner. Based on the network architecture of PoseNet's network architecture, the Bayesian PoseNet [11] adds dropout layers after FC (fully-connected) layers of position and rotation regressors. Without regard to fine-tuning the hyperparameters of absolute position loss and absolute rotation loss, PoseNet-17 [12] proposes absolute pose loss with learnable weights. Li's method [13] is the first work to regress camera poses with a dual-stream network, taking color and depth images as the input. They propose an encoding method, minimized normal + depth (MND), to provide more information when expanding raw depth images into three-channel images. MapNet [14] achieves better performance by using the logarithm of a unit quaternion instead of the quaternion as the rotation formalism. Also proposed with MapNet, are MapNet+ and MapNet+PGO. Based on a CNN trained by MapNet, MapNet+ further updates model parameters in a self-supervised manner. In learning from unlabeled data, sensor measurements such as the GPS, IMU, or relative camera pose are calculated using a visual odometry algorithm as the pseudo ground truth for the relative pose. During the inference stage, MapNet+PGO feeds the camera poses of consecutive frames predicted by MapNet+ into the pose graph optimization (PGO) algorithm for optimization to obtain a smooth predicted pose trajectory. MapNet minimizes the absolute pose loss of a single image and the relative pose loss between image pairs in terms of the loss function. The model implicitly learns the visual odometry information. Tian's method [15] also expects the model to understand the motion between consecutive frames. In addition to the typical absolute pose loss, pixel-level photometric loss and image-level structured similarity (SSIM) loss are introduced as geometric reconstruction constraints. Therefore, depth image and prediction of the relative pose are leveraged to obtain the reconstructed image by 3D warping during the training process.
According to [16], camera pose result of APR methods can be observed as the linear combination of the base poses. Previous APR methods' base poses are learned from training data. However, the training data can limit the performance of the methods, which cannot be generalized to cover the entire scene. In this paper, we use handcrafted base poses instead of learning-based base poses, which prevents the overfitting of the camera poses of the training data.

Notation
, where c t ∈ R 3 is the camera position and r t ∈ R d r is the camera rotation. The d r symbol represents the dimension of the rotation parametrization. We usep t = (ĉ t ,r t ) to denote a camera pose predicted by absolute camera pose regression.

Theoretical Model
As shown in Figure 1, we divide the network architecture of APR methods into three components based on [16]. The first part is to extract features from the input image. Usually, this involves using the classic CNN network architecture, e.g., GoogLeNet, VGG, or ResNet. Weights of a model initialize the weights of this part pre-trained on a large-scale image recognition dataset (such as ImageNet), so that the network can learn more general and rich feature representations. The second part aims to reduce over-fitting by mapping the learned feature into an embedding space. The third part is to linearly transform the features or embeddings to the space of the 6-DoF camera pose, which is implemented by FC (fully connected) layers. Specifically, two FC layers are used as regressors: one for camera position and the other for camera rotation. The output of the position regressor represents the 3D coordinates of the query image in the scene; thus, the output dimension of its FC layer is 3. However, the output dimension of rotation regressor d r depends on the rotation parametrization used. Most APR methods use a 4-d unit quaternion or 3-d logarithm of a unit quaternion to represent rotation.
Let (x 1 , x 2 , · · · , x n ) and (w 1 , w 2 , · · · , w n ) be the projection matrix of the translation part and orientation part, respectively, where x j ∈ R 3 and w j ∈ R d r denote the j-th column. Their bias terms are represented by x 0 and w 0 . Features extracted from image I t are denoted as α t = (α t 1 , α t 2 , · · · , α t m ) T ∈ R m , and separate embeddings of feature α t are denoted by The camera pose predicted by APR can be expressed as

Rotation Formalisms
APR approaches generally utilize the unit quaternion or rotation vector (the so-called axis-angle) to represent rotation. The unit quaternion has the advantage of being interpolable, so it is often the preferred approach to representing rotation in the field of computer vision. The arbitrary 4-d output of the regressor can be normalized and then used to represent legitimate 3-DoF rotation, but when used in training, this step has been demonstrated to degrade performance [10]. Therefore, APR methods that use the quaternion as the rotation formalism generally ignore the sphere constraint of the unit quaternion during training. Unlike the over-parameterized unit quaternion, the rotation vector requires a 3-d vector to describe rotation. Its direction and magnitude, respectively, represent the axis and angle of rotation. The conversion formula (logarithm) to express the unit quaternion q with the axis of rotation n and the angle of rotation θ is As can be observed from the above formula, this rotation representation is similar to the rotation vector. The only difference is that the length of the logarithm of a unit quaternion is equal to half of the rotation angle. The proposed method also uses the logarithm of a unit quaternion to represent the camera rotation.

Base Poses
According to [16], we can regardĉ t (orr t ) as a linear combination of x j (or w j ) according to Equation (1). For a trained network, the parameters x (the 'base translations') and w (the 'base orientations') of the last two FC layers are unchanged and can be regarded as a set of base poses B = {(x j , w j )}. The value of an embedding (or feature) related to the appearance of the input image can be viewed as the coefficients of base poses to measure their importance. Therefore, we consider that APR methods mainly learn the distribution of base poses and the correlation between the appearance of the image and base poses, so that the matrix product of embedding (or feature) and base poses can be used to estimate the camera pose for a single image.
APR is a data-driven camera localization approach, by which the distribution of the ground truth camera trajectories used for training can limit base poses learned from the training data. As long as three orthonormal vectors exist, their linear combination can cover the entire 3D space. As the input dimension of the last FC layer, the number of base poses is often much more significant than three, and the learning-based base poses are redundant. However, since the training trajectory does not necessarily include diverse directions, redundant base poses may not cover the space outside the training trajectory. In other words, the APR model that adopts base poses based on learning might have no generalizability. To reduce the impact of CNNs, which cannot generalize with limited training data, we turn to handcrafted base poses instead of using FC layers to learn base poses.

Network Architecture
The proposed dual-stream network with handcrafted base poses is designed for RGB-D single-shot camera localization. The overall network architecture is shown in Figure 2. The color stream (top) and the depth stream (bottom) have the same architecture but different weights. In each stream, we use a partially modified ResNet-18 as the network's backbone. We discard its original FC and softmax layers designed for classification and instead use an FC layer with an output dimension of 2048. The global features of the image output by the modified ResNet-18 are subsequently mapped into the embedding space by two FC layers. One of the embeddings is viewed as the scale factors of the handcrafted base translations and the other as the scale factors of the handcrafted base orientations. A color image and its corresponding depth image are fed to the dual stream, after which every single stream outputs two embeddings. We fuse color and depth information by adopting the average of their embeddings as coefficient vectors of the handcrafted base poses. Finally, the predicted camera pose is obtained by matrix multiplication of the handcrafted base poses and the average coefficients.

Loss Function
As shown in Figure 3, the network model is expected to encode geometric information about the scene through the absolute pose loss of a single image and the relative pose loss between the temporally adjacent images.
Relative pose loss is helpful for a network to learn the features suitable for localization. Transfer learning is employed by using a model pre-trained for image classification as a feature extractor. Image classification identifies the category to which the subject of an image belongs; thus, images that are similar in appearance also have similar features. When training to regress camera poses, the same subjects often appear in many input images, and the network must learn the transformation of poses from slight differences in similar appearances. We use the linear combination of base poses to predict the relative pose: Note from Equation (4), that if the values of the base pose remains unchanged during training; then, the network focuses on learning the feature embedding difference of image pairs by minimizing the relative pose loss, which also takes advantage of the handcrafted base poses.

Dual Stream
Dual Stream shared weights

Training Mechanism
We adopt a two-stage training mechanism for the proposed RGB-D dual-stream network. In the first stage, as the color stream and depth stream are independent networks, there is no longer an operation to calculate the average value of their feature embeddings shown in Figure 2. Separate feature embedding is multiplied directly by handcrafted base poses. As a result, two predictions are obtained, one of which is related only to color information, and the other is related only to depth information. The first stage's goal is to allow two streams to converge separately in preparation for the next stage. The second stage is the end-to-end training of the overall RGB-D dual-stream network. The dual-stream network better integrates color and depth features to achieve more accurate camera localization by fine-tuning the weights.

Implementation Details
Before feeding RGB-D images to the proposed dual-stream network, we preprocess the images. Compared with color image preprocessing, there are two more operations for depth images: depth completion and stacking (see Section 3.6.1 for details). Both color and full depth images are resized to 341 × 256 pixels and then subtracted by their mean image and divided by standard deviation. We use Adam optimization with a learning rate of 1 × 10 −4 , a weight decay of 5 × 10 −4 , and the default values for the other parameters. With a mini-batch size of 10, we train two separate streams for 300 epochs in the first stage and train the overall dual-stream network for 100 epochs in the second stage. The β and ∆t hyperparameters are set to 3 and 10 for all scenes, and the γ is set to 10 for indoor and 1 for outdoor datasets. The algorithm was implemented based on PyTorch, and all experiments were performed on an NVIDIA GTX 1080Ti graphics card.

Depth Completion
As demonstrated in Figure 4a, invalid depth values sometimes occur when using RGB-D sensors such as Kinect to capture depth images. To maintain the performance of the depth stream, we impute and fill in these depth values through a modified colorization method [17], which was used in the NYU Depth Dataset V2 [18] for the same purpose. This depth completion optimization assumes that those adjacent pixels with similar intensity have similar depth. Missing values are imputed by minimizing the difference between the depth value at pixel r and the weighted sum of depth values at its neighbors N(r): min where D and F separately denote the raw and full-depth image, and w rs is the similarity measurement between intensity at pixel r and intensity at pixel s: w rs = exp − (G(r)−G(s)) 2 2σ 2 r . After depth completion, it is necessary to expand the single-channel depth images into three-channel images via stacking to obtain the final input of the feature extractor.

Selection of Handcrafted Base Poses
Based on the discussions of base poses in Section 3.2.3, we select six orthonormal vectors to represent base poses. To ensure the non-linearity, we choose not to adopt a minimum of three base poses of feature embedding. If only three orthonormal vectors in the positive direction are used as base poses, to guarantee that the calculated pose estimate covers the entire 3D space, the value range of the feature embedding-or the coefficients of base poses-should be the entire set of real numbers. This requirement can only be met by mapping the features to a linear embedding without an activation function. The linear transformation can be combined with the matrix multiplication of handcrafted base poses, resulting in the network learning the distribution of base poses. This results in six orthonormal vectors (three in positive and three in negative directions) used as handcrafted base poses. We adopt the leaky rectified linear unit (leaky ReLU) as the activation function for nonlinear embeddings to streamline network convergence.
We also designed two sampling methods for redundant handcrafted base poses. Figure 5b shows uniform sampling along the three coordinate axes x, y, and z, which is in line with the human understanding of translation. Figure 5c shows uniform sampling along the surface and the radius of a sphere, which is closer to the definition of the logarithm of a unit quaternion. The coordinates of each point in Figure 5 are viewed as the three values of a base translation or a base orientation. The direction and magnitude of a base translation respectively represent its moving direction and distance. In the same way, the direction and magnitude of a base orientation represent its rotation axis and half of its rotation angle. We compare the different base poses in detail in the next section.

Datasets
To compare the proposed method with previous APR methods, we evaluated it on the 7 Scenes [7] indoor dataset, a well-known benchmark for single-shot camera localization. The 7 Scenes dataset contains RGB-D camera frames of seven small-scale indoor scenes captured by a Kinect sensor. For each scene, several sequences are recorded and split into the training and test sets. The ground-truth camera poses and 3D models are obtained using an implementation of KinectFusion [19,20]. Note that the 3D model is not used in our proposed method. Details about 7 Scenes are provided in Table 1. is a slightly larger RGB-D indoor dataset, similar to 7 Scenes. Several sequences of twelve rooms were captured by a Structure Core sensor coupled with an iPad camera. The researchers used the VoxelHashing framework [22] combined with global bundle adjustment to obtain ground-truth camera poses. For each scene, we used the first sequence as the training set and all other sequences as the test set. Since previous APR methods were not evaluated on this dataset, we employed the 12 Scenes dataset only in the ablation studies. Details about 12 Scenes are provided in Table 2.
Oxford RobotCar [23] is a large-scale outdoor dataset for long-term autonomous driving. The researchers repeatedly traversed the same route of approximately 10 km over a year. Various sensors were mounted to the vehicle: a trinocular stereo camera, three monocular cameras, two 2D LIDARs, a 3D LIDAR, and an inertial and GPS navigation system. This yielded the Oxford RobotCar dataset, which captures a wide range of variations in appearance and structure, such as illuminance changes, seasonal changes, dynamic objects, and roadwork. The proposed method was evaluated on the Loop route with a total length of 1120 m, and the fused GPS+Inertial data were used as ground-truth camera poses.  Table 3 shows the quantitative comparison of median localization error on 7 Scenes with the state-of-the-art APR methods, including PoseNet-17 [12], Li's method [13], MapNet [14], and Tian's method [15]. The proposed method is very close to Tian's method in terms of the median translation error and outperforms all methods in terms of the median rotation error.

Comparison with the State-of-the-Art Methods
For the experiment on the Oxford RobotCar dataset, the color stream with handcrafted base poses was used to evaluate the performance of the proposed method, as depth information captured by a 2D LIDAR scanner is sparse. We used the 2014-05-14-13-59-05 sequence for training and the 2014-05-14-13-53-47 sequence for testing. To account for illuminance variation between training and test sequences, we augmented the data by randomly changing images' brightness, saturation, contrast, and hue during training. The result of PoseNet-15 in Table 4 was reproduced by Tian [15] using the same network and settings as that in [10]. Compared to Tian's method [15], the proposed method reduces the median translation error by about 36% and the median rotation error by about 55%.

.3. Ablation Studies
We conducted ablation experiments to explore the effects of depth completion, handcrafted base poses, and the dual-stream network. The network architectures evaluated in the ablation studies can be divided into three types: single-stream with learning-based base poses (Figure 6a), single-stream with handcrafted base poses (Figure 6b), and dual-stream with handcrafted base poses (Figure 2). Single-stream uses only color or depth images as input, whereas dual-stream uses RGB-D images. Note that if there are no extraordinary descriptions, the following depth stream, by default, takes the full depth images obtained by the depth completion algorithm as input.  Table 5 shows the difference in median localization the error between a single stream with raw depth images as the input and a single stream with full depth images obtained by the depth completion algorithm as the input.
The results demonstrate that the depth completion operation improves performance using learning-based or handcrafted base poses. When depth completion and handcrafted base poses are used simultaneously, the median localization error of the depth stream is significantly reduced.

Comparison between Base Poses
To select the best set of handcrafted base poses to help the network learn features suitable for localization, we trained the color stream and depth stream with three different sets of handcrafted base poses. Single streams with learning-based base poses were also trained to evaluate the effect of handcrafted base poses. As shown in Table 6, Min outperforms the other two sets of redundant base poses. If base poses are redundant, one camera pose can be represented by various linear combinations of base poses. As Min is the most compact form of handcrafted base poses, the correlation between base poses is relatively weak. It may be easier for the network to learn the relationship between the appearance of images and these base poses. e also observe from Table 6 that regardless of which set of base poses is adopted, the depth stream outperforms the color stream in the Heads scene. As reported in Table 1, the Heads scene has the smallest average valid depth of the frames. Depth images captured at close range are more likely to provide sufficient depth variation and structural information, which is exactly what the depth stream needs. Empirical experience suggests that if the scene is mostly flat and there are no complex 3D objects, the depth values captured by the depth sensor will be inaccurate, which may be why the results of the depth stream are inferior to those of the color stream in scenes such as Chess, Pumpkin, and Stairs.
To confirm this, we evaluated the performance of the color stream and depth stream on the 12 Scenes dataset, whose average valid depth is close to that of the Heads scene. Table 7 shows that the median rotation error of the depth stream is close to that of the color stream, and the median translation error of the depth stream is smaller than that of the color stream. This shows that the average valid depth value and the presence of complex 3D objects in the scene limit the performance of the depth stream.

. Comparison between Network Architectures
Dual stream in Tables 8 and 9 represents the proposed method using two-stage training. The performance of the dual-stream network on 7 Scenes is slightly better than using only the color stream or depth stream. The results on 12 Scenes demonstrate that the performance of the dual-stream network is inferior to the single depth stream in median translation error but is the best in median rotation error.

Conclusions & Future Work
We propose a novel APR method using an RGB-D dual-stream network and handcrafted base poses. On the 7 Scenes dataset, the proposed method is among the best in median rotation error, and in the median translation error, it outperforms previous APR methods. On the more difficult datase-the Oxford RobotCar dataset, the proposed method achieves notable improvements in median translation and rotation errors, as compared to the state-of-the-art APR methods. The main contribution of this work is the use of handcrafted base poses instead of learning-based base poses (e.g., [10][11][12][13][14][15]) for estimating camera poses, which prevents overfitting to the camera poses of the training data.
To further improve the proposed method, a depth completion network (such as PENet [24], NLSPN [25], and FusionNet [26]) could be added to the depth stream to achieve end-to-end training. In addition, we could use the weighted sum of embeddings instead of the average value for the base pose coefficients to integrate better the feature embeddings of the color stream and depth stream. The embedding weights could be obtained by learning the confidence of each stream.
If 3D point clouds are available, it may be better to use these as input instead of depth images and adopt a network designed for unordered point sets (such as PointNet [27] and PointNet++ [28]) to extract the features. There are three advantages to using 3D point clouds. First, compared to depth images, point clouds provide richer structural information. Second, since sparse point cloud data can be directly fed into the network, there is no need for depth completion, which can improve localization efficiency. Third, 3D LiDAR is usually adopted for distance measurement in outdoor scenes, so using point clouds as input can remove the dual-stream network's limitation to indoor scenes.