Visual Odometry and Place Recognition Fusion for Vehicle Position Tracking in Urban Environments

In this paper, we address the problem of vehicle localization in urban environments. We rely on visual odometry, calculating the incremental motion, to track the position of the vehicle and on place recognition to correct the accumulated drift of visual odometry, whenever a location is recognized. The algorithm used as a place recognition module is SeqSLAM, addressing challenging environments and achieving quite remarkable results. Specifically, we perform the long-term navigation of a vehicle based on the fusion of visual odometry and SeqSLAM. The template library for this latter is created online using navigation information from the visual odometry module. That is, when a location is recognized, the corresponding information is used as an observation of the filter. The fusion is done using the EKF and the UKF, the well-known nonlinear state estimation methods, to assess the superior alternative. The algorithm is evaluated using the KITTI dataset and the results show the reduction of the navigation errors by loop-closure detection. The overall position error of visual odometery with SeqSLAM is 0.22% of the trajectory, which is much smaller than the navigation errors of visual odometery alone 0.45%. In addition, despite the superiority of the UKF in a variety of estimation problems, our results indicate that the UKF performs as efficiently as the EKF at the expense of an additional computational overhead. This leads to the conclusion that the EKF is a better choice for fusing visual odometry and SeqSlam in a long-term navigation context.


Introduction
Autonomous vehicles have recently received great attention in the robotics, intelligent transportation, and artificial intelligence communities. Accurate estimation of a vehicle's location is a key capability to realizing autonomous operation. Currently, the leading technology in this setting is GPS receivers to estimate their absolute, georeferenced pose. However, most commercial GPS systems suffer from limited precision and are sensitive to multipath effects (e.g., in the so-called "urban canyons" formed by tall buildings), which can introduce significant biases that are difficult to detect in addition to sometimes being unavailable (e.g., in tunnels). To provide alternatives to GPS localization, many sensory devices have been applied to carry out the localization task, including visual sensors that are often desirable due to their low-cost, wide availability and passive nature.
Vision-based localization techniques fall into several broad categories including real-time Structure from motion (SfM) or Visual Odometry (VO) and Place Recognition. While VO methods calculate the egomotion by incrementally estimating the rotation and translation undergone by the vehicle using only the input of a single or multiple cameras, Place Recognition methods are based on learning a database of images (the map) and the vehicle consecutively tries to find matchings between this database and the actual visual input (query image(s)). Within the place recognition paradigm, numerous research papers have addressed visual appearance-based place recognition by ground vehicles especially in varying environments [1][2][3][4][5]. A leading method is SeqSLAM aiming at matching image sequences under strong seasonal and illumination changes through the computation of image-by-image dissimilarity scores between all query and database images. Despite its performance under strong changes, SeqSLAM tends to fail in correctly matching images when dealing with changes in viewpoint that can be found between them [6].
Hence, our motivation is to create a vision-driven localization method in urban environments that uses cues from SeqSLAM and VO. Our method is able to localize vehicles in the global reference system if a georeferenced database of images has been learned beforehand. However, in this paper, we address the problem of long-term navigation where loop closures, referring to when the vehicle has returned to a past location after having discovered new terrain for a while, are used to reduce the drift caused by VO. Such detection makes it possible to increase the precision of the actual pose estimate. To this end, we present the integration of the robust sequence-based recognition capabilities of the SeqSLAM system with the accurate 3D metric properties of monocular VO in a probabilistic framework through the use of the Extended Kalman Filter (EKF), the efficient recursive estimator [7] and the unscented Kalman filter (UKF). Keeping power-, space-and weight-constrained applications in mind, we prefer to avoid additional sensors and to utilize only visual cues available in monocular sequences. In fact, despite years of research, monocular-based-localization systems are still an exciting open problem.
The paper proceeds as follows: Section 2 presents some related work. Section 3 summarises the SeqSLAM's main components and its improved version. Then, our approach details including the position tracking, the uncertainty estimation and the fusion through the filtering techniques are presented in Section 4. Finally, Section 5 presents the results, discusses the outcomes of this paper and presents suggestions for future work.

Related Work
Several approaches have been developed in the past few years, which frame self-localization as a retrieval task. Towards this goal, multiple representations of the world have been adopted, namely, visual Bag-of-Words (BoW), visual features and 3D point clouds. BoW representation has been first used by the popular algorithm FAB-MAP [8] that relies on extracting scale-invariant image keypoints and descriptors. The descriptor vectors are then quantized using a dictionary trained on prior data. Fab-Map achieves robust image recall performance for outdoor image sequences up to 1000 km [9]. Other approaches dealing with visual features either use a trained georeferenced map of simple visual 3D features [10], or visual features from 3D building geometry [11]. In the case of 3D point clouds, the localization is performed by retrieving point clouds which are similar to the current scene [12].
However, the success of visual features-based approaches is very dependent on the quality of the visual vocabulary, and in turn on the prior data and the reliability of extracting the same visual keypoints and descriptors in images with similar viewpoints. The latter is particularly problematic when there are large lighting variations and scene appearance changes due either to seasonal changes or to day and night cycles. Significant performance improvements over Fab-MAP under extreme lighting and atmospheric variations have been achieved within the SeqSLAM system, which relies on the use of a whole image regardless of its content by searching in a pre-learnt database of images for the most similar sequence to the query sequence [1,2]. SeqSLAM proved its success with very low resolution imagery [2,13], on large-scale environments of 3000 km over four seasons [6], with UV imagery [4] and with ConvNet features [5]. However, one of SeqSLAM's most significant drawbacks is its lack of viewpoint invariance inherited from the use of global image matching.
On the other hand, when the initial position is known, self-localization is achieved by visual odometry that yields relative motion estimates, integrated to obtain an estimate of the vehicle's current position. High accuracy has been achieved using stereo-based SfM systems [14]. Good performance of monocular systems has also been demonstrated on the KITTI visual odometry benchmark [15], for example in [16,17], but the incremental nature of these methods inevitably leads to drift. Simultaneous Localization and Mapping (SLAM) methods attempt to reduce this drift by using landmarks and jointly optimizing over all or a selection of poses and landmarks [18,19]. Drift can be further reduced by revisiting places several times and detecting loop closures in the traveled trajectory [20].
However, SLAM methods suffer from issues in terms of speed and map size which limit their application at large scales. Even though efficient optimization strategies using either incremental sparse matrix factorization [21] or relative representations [22] have been used, several recent works have instead relied on publicly available maps [23,24], road networks [25] or satellite images [26] to address the localization of ground vehicles.
Other approaches are based on the fusion of two or multiple algorithms. In fact, data from different algorithms might be combined to derive a more accurate estimate of the vehicle's pose, as their uncertainties might be complementary. Illustrated in [27] is the fusion of visual odometry, used to track the position of a camera-equipped Micro Aerial Vehicle (MAV) flying in urban streets with an air-ground image matching algorithm using a cadastral 3D city model by means of a Kalman filter. Additionally, our approach is based on the fusion of VO and the appearance-based place recognition algorithm SeqSLAM. We adapted SeqSLAM for use in the context of long-term navigation for loop closure detection in order to reduce the drift caused by VO. The database of images is, therefore, created online. However, in the case of learning a geo-referenced database beforehand, the method remains valid and more accurate estimates could be obtained in the global reference frame without assuming a loopy trajectory.

Appearance-Based Global Positioning System
In this section, we briefly present the main components of the state-of-the-art appearance-based global positioning algorithm SeqSLAM and then show some components of the Sequence Matching Across Route Traversals (SMART) system, built upon SeqSLAM to overcome some of its limitations. The SeqSLAM algorithm, using a simple whole-image comparison with Sum of Absolute Differences (SAD), demonstrated impressive place recognition performance across significant condition variance such as seasonal changes and day to night transitions where feature-based methods failed entirely and in the case of low quality imagery (low resolution, low depth, and image blur). In order to increase the discriminative nature of the observation and to avoid the problem of false-positives, location is represented in SeqSLAM as a sequence of images, rather than a single image from one pose. Images are, beforehand, resolution-reduced and patch-normalised to enhance contrast and are, therefore, converted into visual templates. Then, to compare each query image I q from the query sequence Q = (I 1 , ..., I Q ) where Q = |Q| to each database image I D from the database D = (I 1 , ..., I D ) where D = |D|, SeqSLAM calculates the difference score d where R x and R y are the horizontal and vertical image dimensions, respectively. The difference scores are assembled into the so-called difference matrix. A next key processing step is to normalize the image difference values (d) within their (spatially) local image neighborhoods. Subsequently, a search for diagonals of low difference values is performed over the defined sequence length (Q) as depicted in Figure 1.
However, one of SeqSLAM's most significant drawbacks is its lack of viewpoint invariance inherited from the use of global image matching. To compensate for a small viewpoint invariance, variable offset image matching and distance-based template learning have been used in the SMART system built upon SeqSLAM [28]. In fact, to improve performance on traverses with small shifts in lateral pose, a variable offset image matching is used where each query frame (template) is compared to each database frame (template) at a range of offsets with the sum of absolute differences (SAD) performed on the overlapping region and the minimum difference score is used: where u and v are the coordinates in the difference score, X A , Y A , X B and Y B are vectors representing the overlapping region of images a and B and x a and x b are the pixel coordinates in a and B respectively ( Figure 2). Finally, the sub-route with the smallest score and the second smallest score are used to compute the best database matching to the input query sequence. On the other hand, SMART learns (creates the database of templates) and queries templates at regular distance intervals (fixed distance f dist between templates) rather than fixed time or frame intervals as SeqSLAM does (fixed number of frames between database and query templates).
To function effectively, odometry information (sourced either from wheel encoders or vision) is needed [28].

Proposed Approach
We propose a kind of hybrid approach that combines local metrical localization (e.g., visual odometry) with a topological one (e.g., SeqSLAM). The vehicle tracking is performed by VO leading to drift due to the run-time error accumulation. Loop closure recognition via SeqSLAM allows incremental pose drift [29] to be overcome and the state and position of the vehicle to be recovered in cases where the tracking is lost. The use of a two algorithms fusion-based approach is mainly motivated by the fact that VO is prone to drift and place recognition approaches generally suffer from false positive (FP) detections. Although false positives have an impact on the short-term deviation of the system inside the filtering framework, this latter robustly recovers after some time. Combined methods give, therefore, an increased robustness to the system. The template library can be loaded beforehand (when a learning has already been done) or created in real-time (without prior learning). In such a case, the template database is populated with the acquired frames from the monocular stream converted into visual templates along with their poses acquired from the VO module. The query sequence is a FIFO buffer holding the n last templates. In fact, when a new frame is captured, features are detected and matched with the last acquired frame and the relative motion is estimated using the VO module, which will subsequently allow the pose of the vehicle to be predicted. The frame is also converted into a visual template and added to the query sequence with the aim of matching with a database template. The SeqSLAM module performs the matching that will be tested to determine whether it is a true positive based on both a matching score provided by SeqSLAM and the prediction made by the VO module. If a place is recognized, a correction of the pose is made. Otherwise, the template is assumed to belong to a new place and is, therefore, added to the database as depicted in Figure 3.

Position Tracking
The goal of this section is to track the state of the vehicle over several images. The vehicle state in time k is composed by the position in the 2D plane and the orientation with respect to the first reference frame. Thus, we consider the reduced state vector x k ∈ R 3 where p k ∈ R 2 denotes the position and θ k ∈ R denotes the orientation (heading angle).
We adopt a Bayesian approach [30] to track and update the position of the vehicle. We compute the posterior probability density function (pdf) of the state in two steps. To compute the prediction update of the Bayesian filter, we use VO. To compute the measurement update, we integrate the topological localization update, whenever it is supplied by SeqSLAM, described in the previous section.
The system model f describes the evolution of the state over time. The measurement model h relates the current measurement z k ∈ R 3 to the state. Both are expressed in a probabilistic form: where u k−1 ∈ R 3 denotes the output of the VO algorithm at time k − 1, x k|k−1 denotes the prediction of estimate x at time k and x k−1|k−1 denotes the updated estimate of x at time k − 1. The functions f and h are in general non-linear functions.

Visual Odometry System
Visual Odometry (VO) is usually referred to as the problem of incrementally estimating the egomotion of a vehicle using a single or multiple cameras [31]. In order to deal with all central camera models including perspective, dioptric, omnidirectional and catadioptric imaging devices, image measurements are represented as 3D bearing vectors: a unit vector originating at the camera center and pointing toward the landmark. Each bearing vector has only two degrees of freedom, which are the azimuth and elevation inside the camera reference frame as formulated in the OpenGV library [32]. Because a bearing vector has only two degrees of freedom, we refer to it as 2D information and it is normally expressed in a camera reference frame. The fundamental matrix solver within the OpenGV library computes the relative pose of a viewpoint with respect to another viewpoint given a number of eight correspondences between bearing vectors expressed in the respective camera frames within a Random Sample Consensus (RANSAC) framework to deal with false matchings [33].

State Prediction and Uncertainty Estimation
At time k, two consecutive images I k and I k−1 are given as input to the VO algorithm. This latter returns an incremental motion estimate with respect to the local camera reference frame. We define this estimate as δ * k,k−1 ∈ R 3 δ * k,k−1 = (∆s * k , ∆θ k ), where ∆s * k ∈ R 2 denotes the translational component of the motion and ∆θ k the heading angle increment. As we are using monocular visual odometry using only the input of a single camera, we deal with a scale ambiguity where the norm of the translational component cannot be recovered and we only obtain the direction of the translation. ∆s * k is valid up to a scale factor and, thus, the metric translation of the vehicle in the ground-plane ∆s k ∈ R 2 at time k with respect to the local camera frame is equal to where λ ∈ R is the scale factor. Several approaches have shown their efficacy in solving the scale factor ambiguity within the monocular scheme including the use of prior knowledge of the camera height relative to ground-plane [34]. Given that we predicted the state of the vehicle x k using x k−1 and the incremental motion estimate δ k,k−1 ∈ R 3 , the uncertainty of the pose has to be estimated as well, represented by a 3 × 3 covariance matrix.
In order to estimate the covariance matrix Σ δ k,k−1 ∈ R 3×3 , we use the Monte Carlo technique [35]. In general, when dealing with a nonlinear function f (.) that relates a random variable Y to an N-dimensional random variable X such as Y = f (X), where X has a meanX and a covariance Σ X , the transformed mean and covariance of Y can be obtained via a Monte Carlo simulation that relies on a repeated random sampling. In fact, a large number of samples {X 1 , X 2 ..., X n } are randomly drawn from X and the function f (.) is evaluated for each sample. For the transferred samples from function f (.), {Y 1 , Y 2 ..., Y n }, the meanȲ and covariance Σ Y are estimated according to: In the case of VO, the algorithm uses at every step a set of 2D corresponding bearing vectors between images k and k − 1 and provides an incremental estimate δ k,k−1 . In fact, features which are detected and matched between every consecutive pair of frames are, subsequently, converted into bearing vectors. We randomly sample eight correspondences from corresponding bearing vectors and feed them to the RANSAC procedure to compute an estimate {δ i }. All the estimates for which the number of inliers exceeds a prefixed threshold t are saved in a set S = {δ i }. Estimating the covariance matrix using the Monte Carlo technique requires a large number of samples to be randomly drawn from the initial N-dimensional random variable x. We usually obtain a high number of valid Monte Carlo estimates (e.g., more than 500 out of the 1000 iterations result in a valid estimation) and the covariance Σ δ k,k−1 is, finally, estimated using (9) and (10).
The error of the VO is propagated throughout consecutive camera positions as follows. At time k, the state x k|k−1 depends on x k−1|k−1 and δ k,k−1 We compute the associated covariance Σ x k|k−1 ∈ R 3×3 by the error propagation law: assuming that x k−1|k−1 and δ k,k−1 are uncorrelated. We compute the Jacobian matrices numerically, the rows of the jacobian matrices ∇( i f x k−1|k−1 ), ∇( i f δ k,k−1 ) ∈ R 1×3 (i = 1, 2, 3) are computed as: where i x k−1|k−1 and i δ k,k−1 denote the i-th component of x k−1|k−1 and δ k,k−1 respectively.

Loop Closure-Based Measurement Update
In the SeqSLAM algorithm, the input is the query sequence that refers to the last acquired images converted to visual templates. The algorithm selects the path in the difference matrix that connects the query templates to the database ones. The observations are the sequence of n query images I q , and the state space S is the set of database images I d . For each observation there is a state corresponding to one of the database images in the state space as depicted in Figure 4 where primitive high-resolution images are used instead of low-resolution templates for the purpose of clarity. Thus, the observables here are the query and database templates, which are continuously calculated from the monocular sequence and the tracked state x k = (x, y, θ) is hidden. However, to each database template is connected a pose where this image has been taken and to the query template to be matched is connected the predicted state. Hence, the unobservable state variables are directly obtained from the observable templates and the transformation from the state space to the observation space is a simple linear model.
where H = I 3 . In fact, when a place is recognized, the corresponding information of the node (matched in the database) can be used as an observation of the filter.

False Positives Filtering
The Gaussian assumption underlying the Kalman Filter and its variants implies that when an observation that significantly differs from the state estimate, as determined from the state covariances, is incorporated into the state estimate, the resulting Gaussian deviates significantly from its prior shape. This means that the current estimate of the true state is no longer useful which has a negative impact on the system robustness. Although the filter recovers after some time, several applications depend on the short-term result in which false positives (FP) have a disastrous impact. In fact, even with the use of a sequence matching instead of a single image by the SeqSLAM algorithm, this latter still suffers from false-positives. Thus, to filter them out, we use the state and covariance estimates and the Gaussian basis of filtering techniques to estimate the likelihood of a given observation. The conditional pdf P(z k |x k , Σ x k ) is evaluated for the given observation and state estimates to reject observations with too low likelihood as false-positives. The conditional (pdf) is simply a multi-dimensional Gaussian with the mean the estimated state and its covariance. Both the mean and the covariance have to be transformed into the observation space. Hence, the likelihood has the following form where C k = HΣ x k H T + R is the covariance matrix in the observation space, n the state dimension and R is the measurement covariance matrix. If the evaluated likelihood is above a preset, empirically determined threshold, the measurement update step is performed. Otherwise, only the prediction is done and the covariance of the state estimate will grow until a successful observation is reported. As well as the case of rejected observations, the dynamics update is still performed in the case of false negatives and they, therefore, do not affect the system performance. SeqSLAM uses a threshold on similarity between sequences to determine whether a match refers to a true positive determined by precision-recall tests. The lower the threshold is, the more similar are the matched images. However, a too low threshold leads to false negatives (e.g., non-detections). The better compromise has been achieved when using both the similarity score and the likelihood to assess true positives without too many non detections as presented in Algorithm 1.

Algorithm 1: False positives filtering.
Similarity Score← S score ; Similarity threshold low← T low ; Similarity threshold high← T high ; Likelihood threshold← T likelihood ; if ((S score < T low ) or ((S score < T high ) and (likelihood< T likelihood ))) then Do correction; end

Filtering-Based Fusion of Visual Odometry and SeqSLAM
Our aim is to reduce the uncertainty associated to the state estimate by fusing the prediction estimate with the measurement, whenever a valid measurement issued from a recognized loop closure by SeqSLAM is available. The outputs of this fusion step are the updated estimate x k|k and its covariance Σ x k|k ∈ R 3 . We compute them according to the Kalman filter (KF) update mode as we are dealing with a linear measurement model. The update step is first performed by predicting the measurement and its uncertainty given by Then, the measurement update adjusts the prediction results according to with K k a gain matrix that is optimal in the minimum variance sense and is given by where M k is the cross-covariance between the state and output predictions.

Unscented Kalman Filter
The Unscented Kalman Filter (UKF) is an alternative to the Extended Kalman Filter (EKF) demonstrating a superior performance and ease of implementation for nonlinear estimation. The UKF uses an alternative form for dealing with nonlinearity based on the sigma points. The time update includes the weights and sigma points calculations, whereas the measurement update uses the sigma points to generate the covariance matrices and the Kalman gain respectively. The augmented state matrix Y is constructed from the prior belief of the state (x k−1|k−1 , Σ k−1|k−1 ): where Q k is the process noise covariance matrix estimated via a Monte Carlo simulation as explained in the previous section. The prior belief (x k−1|k−1 , Σ k−1|k−1 ) is converted into a sigma point representation via: where κ is a scaling parameter that affects fourth and higher order moments of the pdf . The associated weights β i are the following: The subsequent step consists in passing the sigma point through the state evolution model The mean and covariance estimates for γ are calculated as The estimate of the posterior (x k|k , Σ k|k ) is performed through passing each sigma point according to the observation model The predicted measurement and innovation covariance are then computed where R k is the measurement noise covariance matrix. The state measurement covariance and Kalman gain are, next, built according to: Finally, the posterior belief, x k|k , Σ k|k is computed according to:

The Experimental Dataset
The Kitti odometry dataset [15], recorded from cars driven in urban and rural areas and on highways, consists of 22 sequences where the first 11 are provided with ground truth.
In order to evaluate our method, we conducted experiments on three data sequences from the odometry category with a loopy trajectory. Together, these sequences, presented in Figure 5, include 1249 images of loop closures. The total driving distance for loop closures in these sequences is 1169.6 m and are highlighted in blue in Figure 5. The locations of loop closures for the Sequence 00 in terms of the amount of travelled meters from the first reference pose are presented in Table 1.  We recall that the Kitti dataset provides challenging benchmarks to the computer vision community that are available online www.cvlibs.net/datasets/kitti.

Parameters
The parameters used for SeqSLAM are those reported in the literature and presented in Table 2. In order to calibrate SeqSLAM and evaluate the uncertainty of the position determination, we used almost 50% of the 1249 images providing two traversals of the same route. The first traversal was used as a database sequence and the second as a query sequence. We assimilated the measurement noise ν k = N (0, R). The matching results were compared to the ground truth. In the case of true positives, the deviations have been chosen to be respectively σ x = 1, σ y = 1 and σ θ = 0.1 so that the returned result is always within 3 standard deviations from the ground truth.

Accuracy Evaluation
We display the ground truth trajectory measured by a precise GPS in green in Figures 6-8 for Sequences 00, 05 and 06 respectively from the Kitti odometry benchmark. We show the path estimated by VO in red and the corrected one (VO+SeqSLAM) in blue. Even though the position is generally accurately detected, the non-detections (FN) are sometimes visible (the correction is not made in the beginning of a loop-closure). In fact, when an update from SeqSLAM is integrated into the Bayesian tracking to correct the trajectory after an important drift, the correction is suddenly made, resulting in a non-smooth trajectory (marked by the blue stars). The drift, though, is greatly reduced for a precise localization afterwards.
We show the mean error per travelled meter of position and heading in Figure 9 for the Sequence 00.     We show the comparative results between the corrected and non-corrected journeys in Table 3 in terms of average error in meters and the percentage of the error with respect to the whole trajectory. The performance of SeqSLAM-based loop closure detection and correction is almost two times better than only VO. We also present in Table 4 the error of the ending pose for all the test sequences.

Fusion Method Evaluation
The UKF is a superior alternative to the EKF for a variety of estimation and control problems. However, its effectiveness depends on the nonlinearity of the problem. In our case, we are dealing with a nonlinear prediction model and a linear measurement model. As listed in Table 5, our experimental results and analysis indicate that UKF performs as efficiently as the EKF. However, the additional computational overhead of the UKF and the linear nature of the update step with SeqSLAM's observations lead to the conclusion that the EKF is a better choice for fusing VO and SeqsLAM.

Timing and Storage
In this section, we briefly describe the storage and computational requirements of the system. In fact, all the templates are down-sampled from 1241 × 376 to 64 × 16 (0.22%) pixels. Consequently, the storage and computational requirements are greatly reduced. To deal with the computational complexity of building a difference matrix between the query and all the database images, a CUDA-based solution was designed in [36], allowing less than 30 ms to be achieved for the query and database sequences used in our experiments. This timing is based on the use of a mid-range GPU, the CUDA NVIDIA GeForce GTX 850M running at 876 MHz with 4096 MB of GDDR device memory. The acceleration is mainly based on the allocation of the three major steps of SeqSLAM, namely: the difference matrix computation, the difference matrix contrast-enhancement and the route searching to three GPU kernels that exploit the parallel CUDA threads and thread-blocks and the fast shared memory. The database, in this design, is stored in the GPU's global memory. Regarding the timing using a commercially available laptop with an 8 core-2.40 GHz clock, it is around 100 ms.

Discussion
The template library is created online using navigation information from the VO. That is, when there is not a match in the database, a node is added to this latter with the template and the corresponding information of pose obtained from the VO module. The sequence searching strategy used in this work is based on a regular time interval but could be optimized for a regular distance interval between templates to make the localized sequence searching algorithm more efficient as reported in [28]. In fact, we have not relied on the regular distance strategy as we have used the Kitti dataset for evaluation. a regular distance consists in taking templates at a regular distance of 1 m for example. Moreover, this work is a proof of concept that an enhancement could be obtained by only detecting loop closures without jointly optimizing over landmarks and poses as a SLAM system does. However, the greatest interest would be generated when a learning of a geotagged database is already performed which would allow an absolute localization in the global reference frame. The SeqSLAM algorithm could also be adapted to work with different camera systems to make it possible to share the database of templates between different users. Furthermore, even though the drift has been greatly reduced, other methods could be fused with SeqSLAM and VO to make the output suitable for autonomous cars.

Conclusions
This paper presented a solution to localize a vehicle in urban environments by means of the integration of VO and SeqSLAM. SeqSLAM is a well-known successful approach for place recognition in varying conditions such as seasonal changes and day and night cycles when not dealing with a viewpoint change. The localization was performed using a single on-board camera and the template library was created online using navigation information from the VO. The integration was performed with a Bayesian filtering through the use of an EKF. That is, when a place is recognized, the corresponding information is used as an observation of the filter, otherwise the prediction is performed by the VO module. The performance of the system can be increased by using a pre-learnt geo-tagged database of images to develop a reliable alternative to satellite-based global positioning. In such a case, the importance of SeqSLAM is even more important due to its ability to deal with severe lighting differences between the learnt database and the actual visual input. The results showed the superiority of the visual odometry integrated with SeqSLAM algorithm in real-time localization as the navigation errors are greatly reduced by loop-closure detection.
Author Contributions: S.O. conceived, designed and developed the proposed method and its experimental validation; All other authors contributed their expertise to validate the proposal, evaluate and question the experimental results. All authors contributed in framing the writing of the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: