In this section, we elaborate the experiments on promoting the performance of Visual Localizer. Firstly, robust ConvNet-based image descriptor is determined after evaluating different layers from five ConvNets. Subsequently, the parameters in the net flow model of data association graph are tuned for the optimal performance of image matching.
4.2. Evaluation Criteria
Herein, we define true positives, false positives and false negatives, according to the consistency between ground truths and localization predictions.
True positive (). The localization system matches the query image with a database image and the matching result is consistent with the ground truth.
False positive (). The localization system matches the query image with a wrong image, which is different from the ground truth.
False negative (). The localization system gives no response for a query image, but there are database images associated with the query image.
The performance of Visual Localizer is evaluated and analyzed in terms of precision and recall. As defined in Equation (
1), precision is the proportion of true positives out of all predicted positives. Moreover, recall is the proportion of true positives to all of actual positives, as defined in Equation (
2).
Considering both the precision and the recall, score is the harmonic average of the precision and recall. score reaches its best value at 1 and worst at 0.
We inspect each individual layer of different ConvNets on viewpoint, illumination and cross-season performance. Each individual layer extracted from networks is taken as a holistic descriptor of input image, which is resized to 224 × 224 in advance. Originally in a float format, the descriptors extracted from ConvNets should be cast into a normalized 8-bit integer format.
The length of the vectorized descriptors can be calculated as exposed in Equation (
5), where
,
, and
are the height, width and dimensions of each layer, respectively.
n is the number of layers to be concatenated together.
During the evaluation of ConvNet layers, the simplest single-image nearest-neighbor matching based on Hamming distance [
47] is adopted as image matching strategy, so as to avoid the influence of image matching on the performance evaluation of ConvNets. All of the layers in different ConvNets are checked, and their performance are presented as precision–recall curves. The variable creating the precision–recall curves is the threshold of ratio test [
32], which is the ratio of the distance of the best over the second best match found in the nearest neighbor search. The localization result of a query image is regarded as a true positive only if the image matching result passes the ratio test.
4.3. Performance Analysis and Comparison between Different ConvNet Layers
In this section, the detailed layer-by-layer analysis concerning visual localization performance is presented. To evaluate the performance of the prevailing ConvNets, we used the representative datasets featuring three aspects: viewpoint changes, illumination changes and cross-season changes. The precision–recall curves of different ConvNet layers on different datasets are shown in
Figure 13.
AlexNet. As presented in
Figure 13, the features extracted from the AlexNet have a similar behavior to the observation of [
32]. The mid-level features derived from conv3 are most robust against appearance changes. Furthermore, conv3 achieves a precision of around 50% at 100% recall rate on the viewpoint change dataset, which is merely inferior to the performance of high-level fc6 of AlexNet.
VGG16. Features extracted from layers ranging from conv4 to fc6 of VGG16 have similar robustness against several changes, as the Top 6 results presented in
Figure 13. However, the features have sub-optimal performance on viewpoint invariance and illumination invariance, in view that the precision at 100% recall rate are less than 40%. The poor performance on cross-season invariance illustrates that the features from single layer are not able to overcome all kinds of appearance variances. For example, the features extracted from conv4_1 display opposite performance between illumination invariance and cross-season invariance.
GoogLeNet. The Inception module was proposed to eliminate the influence of different size filters on the recognition task [
27]. The extraordinary performance of GoogLeNet illustrates that using different filters simultaneously is able to allow the ConvNet to choose the most appropriate features for visual localization, as shown in
Figure 13. On the viewpoint change dataset, the precision of Inception5b/1 × 1 at 100% recall rate is about 60%. The features from Inception3a/3 × 3 are most robust against both illumination and cross-season changes, and the precisions on illumination and cross-season changes dataset are more than 70% and 80% respectively. The second best feature for illumination invariance and cross-season invariance are Inception3a/3 × 3_reduce and Inception3b/3 × 3_reduce, respectively. We find that the filter size of most top-rank layers is 1 × 1. In other words, the capability of feature extraction is not proportional to the filter size. Instead, the feature maps of convolutional layers with small filter size (especially 1 × 1) have better performance on appearance invariance and viewpoint invariance.
SqueezeNet. The high-level features ranging from Fire6 to Fire9 are robust against appearance changes and mid-level features ranging from Fire2 to Fire5 are robust against viewpoint changes. Particularly, Fire9/squeeze1 × 1, consisting of 13 × 13 × 64 feature maps, performs well both on illumination and viewpoint invariance. However, it displays a limited performance on cross-season invariance. The precision at 100% recall rate of all the layers (except Fire9/squeeze1 × 1) are less than 40% on the three datasets because of the drastic compressing operation of squeeze layer, which might loses key attributes for visual localization.
MobileNet. The lightweight depthwise convolution of MobileNet not only requires less computational resources than fully convolution, but also retains higher accuracy on image recognition tasks compared with AlexNet and GoogLeNet [
30]. Nevertheless, on the visual localization task, each layer of both MobileNet performs inferiorly compared with AlexNet and GoogLeNet. The depthwise separable convolutional block impedes information flow between different channels, which might result in the degradation of an individual convolution filter and weaken the representation of the corresponding feature map. It is the most critical reason the ability of feature extraction of the depthwise separable convolutional block is far worse than standard convolutional layer.
Furthermore, the performance comparison between the best results selected from each ConvNet is also shown in
Figure 14. In
Table 1, we conclude the performance of each ConvNet trained on ImageNet datasets and the assessment on the three aspects. From the results of experiments, we summarize four conclusions:
In AlexNet, VGG16 and GoogLeNet, the features extracted from the mid-level layers are more robust against appearance changes, which is consistent with the conclusion made by Sünderhauf et.al. [
32]. If the feature is illumination-invariant, it also exhibits season-invariant robustness, such as conv3 of AlexNet, conv4 and conv5 of VGG16 and Inception3a module of GoogLeNet. However, lightweight CovNets, such as SqueezeNet, seem contrary to the conclusions mentioned above.
As shown in
Table 1, the object recognition accuracy on the ImageNet dataset of VGG16 and MobileNet are 71.5% and 70.6%, respectively, which are better than other ConvNets. However, the features from VGG16 and MobileNet have inferior performance on appearance invariance. It illustrates the fact that the performance on the object recognition is not completely transferable to the task of visual place localization.
As presented in
Figure 14, the most layers of each ConvNet exhibit satisfactory precision on the viewpoint changes dataset, which illustrates the fact that convolutional layer features the nature of translation invariance. Given this insight, appearance changes will be paid more attention to in our selection of robust convolutional layer.
GoogLeNet has overwhelming advantages against other ConvNets because of best performance on both appearance invariance and viewpoint invariance as well as modest computational complexity. Based on this observation, we choose GoogLeNet as the optimal ConvNet, from which we select robust layers to depict images.
4.5. Concatenation and Compression
Because the features from single layer are not able to overcome all kinds of changes, it is obvious that utilizing concatenated layers rather than a single layer as the holistic image descriptor gets more robust performance. However, it is necessary to select and concatenate a handful of layers that perform robust against changes, as the computational cost of image matching is proportional to the dimension of holistic image descriptor. As shown in
Figure 16, we compare the performance of different layers of GoogLeNet and different combinations of layers on the Freiburg dataset. The combination of the Inception3a/3 × 3 and Inception3a/3 × 3_reduce achieves robust and satisfying performance while maintaining less computational complexity than other combinations. Hence, we concatenate Inception3a/3 × 3 and Inception3a/3 × 3_reduce together as the holistic descriptor.
It matters that computing substantial distances between 175,616-dimensional descriptors is an expensive operation and is a bottleneck of the ConvNet-based place recognition. To reduce the size of the ConvNet descriptors without losing a great accuracy, the redundant and irrelevant features should be omitted to compress the size of descriptors. We perform a random selection [
47] of features (i.e., randomly choosing a specific set of elements among the descriptors), which only sacrifices little precision to reduce most of the descriptor size. Our compression proposal is supported by the satisfactory results displayed in
Table 2. Finally, we choose the descriptor with the size of 8192 dimensions as the optimal descriptor.
4.6. Parameter Tuning of Global Optimization
To achieve the best global optimization performance, the parameters in network flow model need to be tuned to adequate values. The parameters to be tuned are summarized in
Table 3. Quantity of flow (
f) denotes the number of matching images retrieved from database images for a query image. Number of children nodes in graph (
) denotes the number of possible database images for the next query image. Cost of edges pointing to hidden nodes (
c) denotes the threshold of descriptor distance that are used to refuse the corresponding mismatched images.
Out of the three parameters, we first determine
f is 1, which means more than one image could be retrieved from database for a query image. The other two parameters (
and
c) are determined by the grid search [
48] on Bonn dataset. Different values of the two parameters constitute a grid, where the parameter combination that achieves highest precision (defined as Equation (
1)) is the optimal parameters. According to the testing on Bonn dataset, the parameter
does not make a difference in a large range, hence
is set to 4. The precisions and recalls using different values of parameter
c are shown in
Figure 17, from which we conclude that
is the optimal value. If the parameter
c is too small, the fluid flow is prone to going through the hidden nodes, so that the recall of image matching is too low. Contrarily, the hidden nodes with too large
c result in their invalidity. It is obvious that localization by the net flow model is superior to single-image matching.
To validate the superiority of network flow in coping with complicated scenarios, we test the performance of network flow model on a modified dataset, where one-third of images from the start of database are removed. It is to simulate a sporadically occurring situation that the image to be queried does not correspond to any database image. As presented in
Figure 18, single image matching does not refuse bad matching results, meanwhile network flow model achieves higher localization precision.
4.7. Real-World Experiments
To achieve navigational assistance for people with visual impairments, we developed a wearable assistance system Intoer [
49], which is comprised of the multi-modal camera RealSense [
50], a customized portable processor with GNSS module, and a pair of bone-conduction earphones [
51], as shown in
Figure 19a. Based on the system, we have previously achieved various assisted utilities, including traversable area and hazard awareness [
52,
53,
54], crosswalks, traffic lights detection [
55,
56], etc. In this paper, we use the Intoer to capture color images, and the image matching is processed off-line.
Utilizing the off-the-shelf system, we perform the visual localization experiments in the real-world environments. Five visually impaired volunteers (as shown in
Figure 19b) are invited separately to wear Intoer, which is set to autonomously capture color images when the user traverses a route. The routes lie in the Yuquan Campus of Zhejiang University (as shown in
Figure 20a) and the landscape area of the West Lake (as shown in
Figure 20b), Hangzhou City, China. On those practical routes, the volunteers walk on the sidewalk and encounter pedestrians or vehicles going along or in the opposite direction. Besides, illumination changes, season changes and viewpoint changes exist in the real-world images, which offer the possibility to validate the proposed Visual Localizer in practical environments.
The Visual Localizer’s robustness against viewpoint changes is validated firstly. For the sake of it, both the query images and the database images are captured on the same route by a visually impaired volunteer in a summer afternoon. The localization results are shown in
Figure 21. Because the volunteer with visual impairments has no concept of viewpoint, query and database images exhibit viewpoint variations, which are also caused by the movement during visually impaired volunteer’s walking. From the visual localization results, Visual Localizer is robust against viewpoint changes. It further confirms the consumption that the concatenation of Inception3a/3 × 3 and Inception3a/3 × 3_reduce is also viewpoint invariant. With the partial occlusions caused by dynamic objects (e.g., vehicles, pedestrians, etc.), Visual Localizer successfully matches query images with correct database images, as shown in
Figure 21.
To validate the robustness against illumination changes and season changes, we invited three volunteers with visual impairments to travel three different routes in the landscape area of the West Lake. The database images were captured on a sunny summer day, while the query images are captured on a rainy winter day. The corresponding visual localization results of the three different routes are presented in
Figure 22, from which it is not hard to find the significant examples of image matching under illumination changes. Thereby, it demonstrates that Visual Localizer is robust enough under contrast sunlight intensity. The foliage color changes and vegetation coverage changes appearing in
Figure 22a demonstrate that our approach also addresses the visual localization in the cross-season conditions. Furthermore, the appearance changes of the outdoor cafe (
Figure 22b) also illustrate the appearance robustness of Visual Localizer. In all of those situations, the visual localization delivers accurate results, even when there are some partial occlusions caused by dynamic objects in the images.
We also performed an experiment to validate the robustness against route changes. The experiment results are presented as
Supplementary Material Video S1, where the user firstly visit a recorded route, then the user turn into a new route. When the visually impaired navigator travels into a new route which is not recorded in the database, the query images are no longer matched with any database image. Only if the user returns back to the recorded route, the query images are matched with the corresponding database images again. The localization results under the condition of route changes illustrate that our approach performs well when the user enters a new place.
In general, our approach is robust against viewpoint changes, illumination changes (dim light vs. bright daylight), cross-season changes (winter vs. summer) and route changes. Therefore, Visual Localizer performs well under the visual changes of outdoor assisted navigation.
We performed localization experiments to compare the positioning accuracy of our approach with GNSS-based localization. The route of comparison experiment is shown in
Figure 23. The query images were captured on an afternoon in summer, and the database images were captured on an early evening in winter. The query sequence are composed of 255 images, and localization results are shown in
Figure 24 and
Table 4. Mean error refers to the mean index error between the localization results and the ground truths. Precision is defined as the percentage of matching pairs with index error less than 5. Visual Localizer and GNSS-based localization retrieve 58 and 255 matched positions respectively. Despite Visual Localizer has the lower recall of image matching, the positioning error and precision of Visual Localizer are superior to those of GNSS-based localization.
According to
Figure 24, Visual Localizer achieves better localization precision than GNSS-based approach. As shown in the fourth and fifth rows of
Figure 24, two query images from different location are matched with the same wrong database image by GNSS-based localization, which is avoided by Visual Localizer.