Next Article in Journal
Multilayered Network Model for Mobile Network Infrastructure Disruption
Next Article in Special Issue
Consistent Monocular Ackermann Visual–Inertial Odometry for Intelligent and Connected Vehicle Localization
Previous Article in Journal
Object Detection Based on Faster R-CNN Algorithm with Skip Pooling and Fusion of Contextual Information
Previous Article in Special Issue
Strapdown Inertial Navigation Systems for Positioning Mobile Robots—MEMS Gyroscopes Random Errors Analysis Using Allan Variance Method

A Recurrent Deep Network for Estimating the Pose of Real Indoor Images from Synthetic Image Sequences

Department of Infrastructure Engineering, The University of Melbourne, Parkville, Victoria 3010, Australia
Department of Manufacturing, Materials and Mechatronics, RMIT University, Carlton, Victoria 3053, Australia
Institute for Sustainable Industries and Livable Cities, Victoria University, Werribee, Victoria 3030, Australia
Author to whom correspondence should be addressed.
Current address: Room B310, Building 175, The University of Melbourne, Parkville, Victoria 3010, Australia.
Sensors 2020, 20(19), 5492;
Received: 25 August 2020 / Revised: 15 September 2020 / Accepted: 22 September 2020 / Published: 25 September 2020
(This article belongs to the Collection Positioning and Navigation (Closed))
Recently, deep convolutional neural networks (CNN) have become popular for indoor visual localisation, where the networks learn to regress the camera pose from images directly. However, these approaches perform a 3D image-based reconstruction of the indoor spaces beforehand to determine camera poses, which is a challenge for large indoor spaces. Synthetic images derived from 3D indoor models have been used to eliminate the requirement of 3D reconstruction. A limitation of the approach is the low accuracy that occurs as a result of estimating the pose of each image frame independently. In this article, a visual localisation approach is proposed that exploits the spatio-temporal information from synthetic image sequences to improve localisation accuracy. A deep Bayesian recurrent CNN is fine-tuned using synthetic image sequences obtained from a building information model (BIM) to regress the pose of real image sequences. The results of the experiments indicate that the proposed approach estimates a smoother trajectory with smaller inter-frame error as compared to existing methods. The achievable accuracy with the proposed approach is 1.6 m, which is an improvement of approximately thirty per cent compared to the existing approaches. A Keras implementation can be found in our Github repository. View Full-Text
Keywords: indoor localisation; camera pose regression; 3D building models; long short term memory indoor localisation; camera pose regression; 3D building models; long short term memory
Show Figures

Figure 1

  • Externally hosted supplementary file 1
    Doi: 10.26188/5dd8b8085b191
    Description: This data-set is a supplementary material related to the generation of synthetic images of a corridor in the University of Melbourne, Australia from a building information model (BIM). This data-set was generated to check the ability of deep learning algorithms to learn task of indoor localisation from synthetic images, when being tested on real images. ============================================================================= The following is the name convention used for the data-sets. The brackets show the number of images in the data-set. REAL DATA Real ---------------------> Real images (949 images) Gradmag-Real -------> Gradmag of real data (949 images) SYNTHETIC DATA Syn-Car ----------------> Cartoonish images (2500 images) Syn-pho-real ----------> Synthetic photo-realistic images (2500 images) Syn-pho-real-tex -----> Synthetic photo-realistic textured (2500 images) Syn-Edge --------------> Edge render images (2500 images) Gradmag-Syn-Car ---> Gradmag of Cartoonish images (2500 images) ============================================================================= Each folder contains the images and their respective groundtruth poses in the following format [ImageName X Y Z w p q r]. To generate the synthetic data-set, we define a trajectory in the 3D indoor model. The points in the trajectory serve as the ground truth poses of the synthetic images. The height of the trajectory was kept in the range of 1.5–1.8 m from the floor, which is the usual height of holding a camera in hand. Artificial point light sources were placed to illuminate the corridor (except for Edge render images). The length of the trajectory was approximately 30 m. A virtual camera was moved along the trajectory to render four different sets of synthetic images in Blender*. The intrinsic parameters of the virtual camera were kept identical to the real camera (VGA resolution, focal length of 3.5 mm, no distortion modeled). We have rendered images along the trajectory at 0.05 m interval and ± 10° tilt. The main difference between the cartoonish (Syn-car) and photo-realistic images (Syn-pho-real) is the model of rendering. Photo-realistic rendering is a physics-based model that traces the path of light rays in the scene, which is similar to the real world, whereas the cartoonish rendering roughly traces the path of light rays. The photorealistic textured images (Syn-pho-real-tex) were rendered by adding repeating synthetic textures to the 3D indoor model, such as the textures of brick, carpet and wooden ceiling. The realism of the photo-realistic rendering comes at the cost of rendering times. However, the rendering times of the photo-realistic data-sets were considerably reduced with the help of a GPU. Note that the naming convention used for the data-sets (e.g. Cartoonish) is according to Blender terminology. An additional data-set (Gradmag-Syn-car) was derived from the cartoonish images by taking the edge gradient magnitude of the images and suppressing weak edges below a threshold. The edge rendered images (Syn-edge) were generated by rendering only the edges of the 3D indoor model, without taking into account the lighting conditions. This data-set is similar to the Gradmag-Syn-car data-set, however, does not contain the effect of illumination of the scene, such as reflections and shadows. *Blender is an open-source 3D computer graphics software and finds its applications in video games, animated films, simulation and visual art. For more information please visit: Please cite the papers if you use the data-set: 1) Acharya, D., Khoshelham, K., and Winter, S., 2019. BIM-PoseNet: Indoor camera localisation using a 3D indoor model and deep learning from synthetic images. ISPRS Journal of Photogrammetry and Remote Sensing. 150: 245-258. 2) Acharya, D., Singha Roy, S., Khoshelham, K. and Winter, S. 2019. Modelling uncertainty of single image indoor localisation using a 3D model and deep learning. In ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, IV-2/W5, pages 247-254.
MDPI and ACS Style

Acharya, D.; Singha Roy, S.; Khoshelham, K.; Winter, S. A Recurrent Deep Network for Estimating the Pose of Real Indoor Images from Synthetic Image Sequences. Sensors 2020, 20, 5492.

AMA Style

Acharya D, Singha Roy S, Khoshelham K, Winter S. A Recurrent Deep Network for Estimating the Pose of Real Indoor Images from Synthetic Image Sequences. Sensors. 2020; 20(19):5492.

Chicago/Turabian Style

Acharya, Debaditya, Sesa Singha Roy, Kourosh Khoshelham, and Stephan Winter. 2020. "A Recurrent Deep Network for Estimating the Pose of Real Indoor Images from Synthetic Image Sequences" Sensors 20, no. 19: 5492.

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

Back to TopTop