Variational Bayesian Approach to Condition-Invariant Feature Extraction for Visual Place Recognition

: As mobile robots perform long-term operations in large-scale environments, coping with perceptual changes becomes an important issue recently. This paper introduces a stochastic variational inference and learning architecture that can extract condition-invariant features for visual place recognition in a changing environment. Under the assumption that a latent representation of the variational autoencoder can be divided into condition-invariant and condition-sensitive features, a new structure of the variation autoencoder is proposed and a variational lower bound is derived to train the model. After training the model, condition-invariant features are extracted from test images to calculate the similarity matrix, and the places can be recognized even in severe environmental changes. Experiments were conducted to verify the proposed method, and the experimental results showed that our assumption was reasonable and effective in recognizing places in changing environments.


Introduction
Autonomous robots operating over long periods of time, such as days, weeks, or months, face a variety of environmental changes over time. As the environment changes, robots should recognize places using their visual sensors, which is called long-term visual place recognition. It is an essential component for achieving long-term simultaneous localization and mapping (SLAM) and autonomous navigation [1]. One of the major problems in long-term visual place recognition is the appearance change problem caused by factors such as time of day or weather conditions [2].
To solve the appearance change problem in visual place recognition, global descriptors that can describe the whole-image have widely used [3,4]. Compared to local features such as SIFT [5] and SURF [6], global descriptors are not only robust to illumination changes, but also require less computation since they do not require a keypoint detection phase [1]. The classic hand-crafted global descriptors such as HOG [7] or gist [8,9] showed higher place recognition performance than the existing local descriptors in a changing environment [3,4]. However, hand-crafted descriptors have inherent limitations in generalization performance since features are extracted according to predefined parameters.
Recently, features from deep learning structures have proven to have superior generalization performances than existing hand-crafted methods. In particular, a deep convolutional neural network (CNN), a kind of neural network, is a structure that have shown excellent performance in image recognition and classification [10]. A variety of structures using CNNs have been widely used in visual place recognition [11][12][13][14]. A sequence of image features using CNNs was used to find the same places between different seasons in [15]. Sünderhauf et al. evaluated CNNs features from each layer of pretrained AlexNet [10] for visual place recognition in a changing environment. Another deep learning structure, the autoencoder, has been also used for visual place recognition because the output of each layer can be used as an image descriptor. Oh and Lee [16] used a deep convolutional autoen-coder (CAE) for feature extraction, and Park [17] proposed an illumination-compensated CAE for robust place recognition.
In this paper, we propose a novel feature extraction method based on variational autoencoders (VAEs) [18]. It is one of the popular models for unsupervised representation learning, and showed outstanding performance in feature learning [19,20]. It consists of a standard autoencoder component, and can approximate Bayesian inference for latent variable models. To obtain robust performances in a changing environment, we assume that the image x is generated from the latent variable z, and this latent representation is divided into the condition-invariant feature z p and the condition-sensitive feature z c . To find the same places from different conditions, comparing the condition-invariant features improves the performance of place recognition. The proposed procedure is shown in Figure 1. Our paper is organized as follows. Section 2 explains the basic preliminaries of VAEs. The proposed structures for feature extraction using the context information is explained in Section 3. Then, the robot localization using the extracted condition-invariant feature is discussed in Section 4. Section 5 presents the validation of the proposed method through publicly available datasets with other algorithms. Finally, Section 6 concludes the paper.

Preliminaries
Let us consider the dataset X consisting of N images X = {x (1) , x (2) , ..., x (N) }. The assumption of the generative model is that the observed images are generated by some stochastic process, involving an unobserved random variable z. To be specific, the latent representation z (i) is generated from a prior distribution p(z), and the image x (i) is generated from a conditional distribution p θ (x|z) where θ is the generative model parameter.
To efficiently approximate posterior inference of the latent variable z given an observed value x, a recognition model q φ (z|x) is introduced where φ is the recognition model parameter. This model is an approximation to the intractable true posterior p θ (x|z), and also referred as a probabilistic encoder. Instead of encoding an input image x as a single vector, the encoder produces a probabilistic distribution of the compressed feature z over the latent space. Similarly, p θ (x|z) is a probabilistic decoder since given a latent feature z it produces a probabilistic distribution over the possible corresponding values of x.
The VAE is a structure that implements an encoder q φ (z|x) and a decoder p θ (x|z) as a neural network as shown in Figure 2. Then, parameters φ and θ become the weights of the neural network. The objective is to find the φ and θ maximizing the variational lower bound L(θ, φ; x) on the marginal likelihood [18] as the following: where D KL (·) stands for the Kullback-Leibler divergence, which measures the difference between two probability distributions. The objective function consists of a reconstruction likelihood and a regularization term. The prior distribution p θ (z) is usually set to a Gaussian distribution so that the reparameterization trick can be used to train the network [18]. After training the VAE, it can compress the input image to the low-dimensional latent vector z. Since the encoded vector z contains the information of the whole input image, it can be used as a global descriptor for comparing similarities between images [19].

Proposed VAE Using Context Information
Although the compressed vector z can be used as a useful global descriptor, it is insufficient to cope with environmental changes. To find the same place obtained from different environments, external factors such as weather or seasonal changes should be removed from the vector z. If the vector z is divided into the condition-invariant feature z p and the condition-sensitive feature z c , we would be able to reliably distinguish places even in changing environments using only the condition-invariant feature z p .
To achieve this goal, we assume that observed images are affected by both structural information p such as unique landmarks, and context information c due to environmental changes such as light or weather changes. Since structural information is robust and context information is sensitive to environmental changes, each information is contained in the condition-invariant feature z p and the condition-sensitive feature z c , respectively. To divide the latent feature z into z p and z c , we propose a structure for generating the context vector c from z c and the image from both z p and z c . Therefore, the generative model is changed from p θ (x|z) to p θ,ϕ (x, c|z p , z c ), and is factorized as the following: where θ and ϕ are parameters of the generative model to generate x and c, respectively. The comparison between the existing and proposed generative model is shown in Figure 3. Then, the variational lower bound is also modified from L(θ, φ; x) to L(θ, φ, ϕ; x, c) on the marginal likelihood as follows: In order to learn the probability distributions, our proposed structure named C-VAE is shown in Figure 4. The encoding part is considered as the inference model q φ (z p , z c |x), and the decoding part is the generative model p θ (x|z p , z c ) and p ϕ (c|z c ).  A detailed examination of this structure reveals the following characteristics in comparison with the existing VAE. The reconstruction of the input image x is the same as the existing structure. The difference is that z c , a subset of z, is used not only to reconstruct x, but also to create the context vector c. During the learning process, information that is sensitive to environmental influences is concentrated in z c , and condition-invariant information is compressed into z p . Therefore, z p can be used as a feature of an image which is robust to environmental changes.
If not only context information c but also structural information p is provided, we propose a model named CP-VAE as shown in Figure 5, which improves the independence between z p and z c of the previous model. The generative model is modified to p θ,ϕ,ψ (x, p, c|z p , z c ), and factorized as the following: where θ, ψ, and ϕ are parameters of the generative model to generate x, p and c, respectively. The variational lower bound is also modified as follows: The difference from the previous model is that z p generates not only the image x, but also the position information vector p. Since z p generates the structural information vector p, the independence between z p and z c is enhanced, and the more robust conditioninvariant feature z p can be extracted to recognize places under substantial environmental changes. However, this model has a limitation in that it requires a fairly strong assumption that the training data are aligned with the same places in order to obtain the position vector information p.

Robot Localization Using Condition-Invariant Features
After training the model, the encoding part of the proposed structure can be used to extract the condition-invariant feature z p from the image. If there are two image sequences u X = { u x (1) , u x (2) , ..., u x (M) } and v X = { v x (1) , v x (2) , ..., v x (N) } from different environments u and v, we can extract feature sequence u Z = { u z (1) p , u z (2) p , ..., u z (M) p }, respectively. Then, the similarity matrix S ∈ R M×N can be constructed from the affinity score between features. The component of the S is the affinity score s ij between u z (i) p and v z (j) p , where 1 ≤ i ≤ M and 1 ≤ j ≤ N. It is calculated using the cosine similarity as follows: The affinity score s ij has a value between [0, 1], and the closer it is to 1, the higher the probability of the same place. From the similarity matrix S, we can find the correspondence between the query sequence v X and the database sequence u X, and the location of the mobile robot can be successfully recognized.

Experimental Results
In this section, various experiments were performed to verify the performance of the proposed algorithm. We used the Nordland dataset [21,22], which comprises images of all seasons from four journeys on a 728 km train route across Norway, and the KAIST dataset [23], which includes six sequences in various illumination conditions: day, night, sunset, and sunrise. They are challenging datasets widely used for long-term place recognition because images between sequences show drastic appearance changes. In each sequence, 1600 images were used for training, and 6400 images were used as a test. All the images were resized to 224 × 224 pixels.
The output shape of the encoding part in our model is shown in Table 1. To effectively compress the data, several layers of convolutional and fully connected layers were used. Then, the output from the sampling layer is the latent feature z, and this feature is divided into z p with 96 nodes and z c with 32 nodes. The decoding part includes a part that reconstructs the input image x from the z p and z c similar to a typical VAE, and a part that generates a context vector c from the z c . Since the dataset has four seasons, the context vector c is defined as a four-dimensional one-hot encoding vector. The first experiment is a visualization test to confirm if the model has been trained to make z p and z c independent as intended. Let u x and v x be images obtained from different environments u and v, respectively. Then, we can extract the latent features u z = { u z p , u z c } and v z = { v z p , v z c } from each image respectively using the encoder of the trained model. Since the reconstructed image from the decoder is mainly affected by the condition-sensitive feature z c , not the condition-invariant feature z p , we can expect the reconstructed image from the combined feature { u z p , v z c } will be x v . The results of combining z p and z c extracted from each sequence image are shown in Figures 6 and 7. As can be seen from the reconstructed image results in Figure 6, there is no significant change in the z p change, whereas different season images are created in z c change. Similarly, there is no significant difference in the change of z p , but it can be seen that images at different times are created according to the change of z c in Figure 7. We can conclude that the environmental information is compressed in z c since the reconstructed image is changed by the influence of the z c rather than z p .
As the z c plays a significant role in reconstructing the image, similar images would be generated if the same z c is used to reconstruct the image. In other words, if we define o z c as a constant vector, { u z p , o z c } and { v z p , o z c } will reconstruct condition-invariant images o x since the image is mainly affected by the condition-sensitive feature o z c . The results of the condition-invariant image are shown in Figures 8 and 9. As expected, we can see that similar images are generated regardless of time or season changes if we use the same o z c . The visualization results showed that the independent assumption between z p and z c is reasonable because the reconstructed images are mainly influenced by the condition-sensitive feature z c . Therefore, we can conclude that our model can extract the condition-invariant feature z p and perform robust place recognition in changing environments using this feature.
To compare the place recognition performance, precision-recall analysis was conducted. Various thresholds were applied to the values of the similarity matrix. We compared the proposed method C-VAE (VAE+C) and CP-VAE (VAE+C+P) with the sum of the absolute difference (SAD) [24], FAB-MAP [25], AlexNet [10], and VGG19 [26]. The precision-recall results are shown in Figures 10 and 11.  Precision-recall results showed that the proposed method CP-VAE outperformed other methods in most cases. Existing handcraft features such as SAD and FAB-MAP showed they are not suitable for place recognition in a changing environment. Pre-trained deep learning models such as AlexNet and VGG19 showed reasonable performance in various situations. However, the performances were degraded when environmental changes between images were substantial, such as winter images with snow and other seasonal images without snow. This is a fatal weakness of the pre-trained model from the viewpoint of securing stability for long-term operation of the robot. Since the proposed method recognizes a place using condition-invariant features, it shows robustly high performance even in these cases. From the results of the precision-recall analysis, we were able to verify the validity of the proposed method's place recognition performances in a changing environment.

Conclusions
Variational Bayesian methods can perform efficient inference and learning in the presence of continuous latent variables with intractable posterior distributions, and large datasets. We introduced a stochastic variational inference and learning architecture that can extract condition-invariant features. Under the assumption that a latent representation of the variational autoencoder can be divided into condition-invariant and condition-sensitive features, a new structure of the variation autoencoder is proposed and a variational lower bound is derived to train the model. After training the model, condition-invariant features are extracted from test images to calculate the similarity between them, and the places can be recognized even in severe environmental changes. Experimental results showed that our assumption was reasonable, and the validity of the proposed method was proved by the precision-recall analysis. In the future, it is necessary to develop a method that can be applied even when several environmental factors are mixed. For example, if we develop a place recognition method that is robust to seasonal and weather changes, the robot will be able to operate in a variety of environmental conditions.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: