In this section, a comparative evaluation is carried out to test the performance of the two clustering methods and the five description approaches in the hierarchical mapping problem. Firstly, the sets of images used to develop the experiments are described. Secondly, the preliminary experiments that permit selecting the appropriate clustering method for each task are outlined. After that, the two main experiments are addressed: the creation of the high-level and the intermediate-level maps.

#### 4.3. Experiment 1: Creating Groups of Images to Obtain a High-Level Map

To make an exhaustive analysis of the hierarchical clustering method in the creation of a high-level map, the influence of the next parameters will be assessed.

Image description method. The performance of the five methods presented in

Section 2 and the impact of their parameters is assessed:

${k}_{1}$ (number of columns retained) in the Fourier signature;

${k}_{3}$ (number of PCA components) and

${N}_{R}$ (number of rotations of each panoramic image) in the case of rotational PCA;

${k}_{4}$ (number of horizontal cells) in the HOG descriptor;

${k}_{6}$ (number of horizontal blocks) and

m (number of Gabor masks) in gist; and finally, the descriptors obtained from the layers fc7 and fc8 in CNN.

Method to calculate the distance $dist({C}_{q},{C}_{s})$. All the traditional methods in hierarchical clustering (

Table 1) have been tested:

- -
Single. Method of the shortest distance.

- -
Complete. Method of the longest distance.

- -
Average. Method of the average unweighted distance.

- -
Weighted. Method of the average weighted distance.

- -
Centroid. Method of the distance between unweighted centroids.

- -
Median. Method of the distance between weighted centroids.

- -
Ward. Method of the minimum intracluster variance.

Distance measurement between descriptors. All the distances presented in

Section 2 are considered in the experiments. The notation used is:

- -
${d}_{1}$. Cityblock distance.

- -
${d}_{2}$. Euclidean distance.

- -
${d}_{3}$. Correlation distance.

- -
${d}_{4}$. Cosine distance.

- -
${d}_{5}$. Weighted distance.

- -
${d}_{6}$. Square-root distance.

To analyze the correctness of the results provided by each experiment, the next data will be shown on the figures: (a) the accuracy of the classification; (b) how good the dendrogram represents the original entities; and (c) the consistency of the final clusters. Firstly, to calculate the precision of the classification, the NMI (normalized mutual information) algorithm has been used. It obtains the confusion matrix of the resulting groups and, from the information of the principal diagonal (correctly classified entities) and the rest of the information, it provides us with an index

c that takes values in the range

$[0,1]$ and which indicates the accuracy of the resulting clusters. In this application, the value 1 indicates that the resulting clusters reflect perfectly the rooms that compose the environment. The lower this value is, the more mixed the information is in the resulting clusters (that is, images captured from several rooms are classified into the same cluster). Secondly, the correlation

${\gamma}_{coph}$ between the cophenetic distances and the distance among entities is obtained to estimate how naturally the dendrogram represents the visual data. High values indicate that the dendrogram build during the clustering process reflects faithfully the descriptors of the original scenes. Thirdly, the consistency of the final clusters is evaluated through the inconsistency coefficient

${\delta}_{inconsist}$, as presented in

Section 3.2.1.

Figure 6 shows the results obtained with the Fourier signature, using (a)

${k}_{1}=8$, (b)

${k}_{1}=32$ and (c)

${k}_{1}=128$ components per row in the magnitudes’ matrix. The blue bars show the accuracy of the classification

c, the red tendency is the correlation coefficient

${\gamma}_{coph}$ and, finally, the green tendency reflects the inconsistency coefficient

${\delta}_{inconsist}$. The value of the inconsistency coefficient has been normalized to the interval

$[0,1]$. The higher this coefficient is, the more natural is the final division into clusters. These figures show that the best results are those provided by the shortest distance method (single). Nevertheless, the maximum accuracy is under

$90\%$ in all the cases. Analyzing this method in depth, if we increase

${k}_{1}$ (size of the descriptor), it leads to more consistent trees, because there is a general improvement in the correlation coefficient. However, the accuracy of the classification does not change as the size of the descriptor increases. This is a general effect that can be observed in

Figure 6; increasing

${k}_{1}$ (size of the Fourier signature descriptor) does not lead to a clear improvement of the classification results. The centroid and median methods tend to present especially unsuccessful classification results. As a conclusion, the Fourier signature has not been able to create clusters that separate completely and accurately the rooms of the Bielefeld dataset while creating the high-level map.

Next, the

Figure 7 shows the results obtained with the rotational principal component analysis, considering that the number of eigenvectors is

${k}_{3}=50$ and the number of rotations is (a)

${N}_{R}=4$, (b)

${N}_{R}=16$ and (c)

${N}_{R}=64$. In this experiment, the size of the descriptor

${k}_{3}$ is kept constant because preliminary experiments showed that this parameter had little influence on the results. The results obtained with rotational PCA are somewhat similar to those obtained with the Fourier signature.

Figure 7 shows that the correlation distance (

${d}_{3}$) along with the single method provides an accuracy near 1. However, in this case, the correlation coefficient has a significantly low value. It indicates that the dendrogram does not reflect well the original visual data. Therefore, the clustering process has not been carried out in a natural fashion.

The next experiment has been carried out with the HOG descriptor.

Figure 8 shows the results obtained when considering: (a)

${k}_{4}=1$; (b)

${k}_{4}=4$; and (c)

${k}_{4}=16$ horizontal cells. Further experiments considering a higher number of cells have shown no improvement. These figures show that the results obtained with the

${k}_{4}=1$ cell are predominantly unsuccessful. The results of the single method stand out due to their wrongness. Nevertheless, the cosine method distance (

${d}_{4}$) along with the methods weighted, centroid and Ward, provide remarkably good results, but they do not arrive at

$100\%$ success rate, and the correlation takes comparatively low values in the three cases. Therefore, considering only the

${k}_{4}=1$ cell leads to a descriptor that does not contain distinctive-enough information to carry out the global mapping process successfully. Notwithstanding that, a clear improvement is shown when the number of cells increases and very accurate results are obtained when considering

${k}_{4}=16$ cells. With this configuration, several experiments have provided an accuracy equal to

$100\%$, with relatively good correlation coefficients (around

$0.8$) and inconsistency coefficients near 1. This indicates that this cluster division is not only perfect but also consistent. The images captured in each room have been assigned to separate clusters, and this division reflects the visual input data in a natural way. It is worth highlighting the performance of the methods single and weighted (except when using the correlation distance

${d}_{3}$) and the Ward method.

After that, the results obtained with the gist descriptor are shown. In this case, the influence of the two main parameters of the descriptor is assessed: the number of horizontal blocks

${k}_{6}$ and the number of Gabor masks

m. On the one hand,

Figure 9 shows the results obtained when considering the gist descriptor,

$m=4$ masks and (a)

${k}_{6}=4$; (b)

${k}_{6}=8$; and (c)

${k}_{6}=16$ blocks. On the other hand,

Figure 10 shows the results obtained with

$m=16$ masks and (a)

${k}_{6}=8$; (b)

${k}_{6}=16$; and (c)

${k}_{6}=32$ blocks. From

Figure 9 and

Figure 10, several conclusions can be reached. First, the number of masks

m is of utmost importance to obtain accurate groups. In general, the results obtained with

$m=16$ are better than those obtained with

$m=4$. Second, the influence of the number of horizontal blocks

${k}_{6}$ is not remarkable. Some methods, along with several specific distances, show a higher accuracy when the number of cells increases, but this is not usually accompanied by an improvement of the correlation and inconsistency coefficients. Third, the method of the longest distance (complete) does not perform well, independently of the distance measure used. Finally, when gist is used, the best absolute results are obtained with

$m=16$ masks and (a) single method and cityblock distance; (b) average method along with correlation or cosine distances; and (c) weighted method with cosine distance. All these combinations provide us with an accuracy of the classification equal to

$100\%$ and comparatively good values of the correlation and inconsistency coefficients. Also, it is not necessary to build the descriptor with a high number of horizontal cells. This will rebound to a low dimension of the descriptor and a reasonably good computational cost.

Finally, the results obtained with the CNN descriptor are shown in

Figure 11. Some configurations of this descriptor also offer an accuracy equal to

$100\%$. It is worth highlighting the behavior of the descriptor obtained from the layer fc8 along with the Ward method, because not only does it provide an accuracy equal to

$100\%$, but the clustering also presents a high consistency. The distance

${d}_{6}$ provides in all the cases relatively bad results with the CNN descriptor.

To conclude the high-level mapping experiments,

Table 3 summarizes the best results obtained with each descriptor (optimal configurations that appear in

Figure 6,

Figure 7,

Figure 8,

Figure 9,

Figure 10 and

Figure 11). This table includes the configuration of the descriptor’s parameters, the clustering method and the distance measure that have provided the best results. Also, the value of the accuracy, the inconsistency coefficient

${\delta}_{inconsist}$ and the correlation coefficient

${\gamma}_{coph}$ are shown. In light of these results, the performance of the descriptors HOG, gist and CNN can be highlighted in the task of global clustering to separate images considering the room they have been captured in, because all of them provided an accuracy equal to 1, and also relatively high consistency and correlation coefficients. Despite the visual aliasing phenomenon that is present in most indoor environments, when any of these descriptors are used, a high number of configurations of the algorithms provide us with successful results.

#### 4.4. Experiment 2: Creating Groups of Images to Obtain an Intermediate-Level Map

In the previous subsection, a complete evaluation has been carried out to study the performance of the descriptors and the clustering methods in the high-level mapping task. This evaluation has permitted knowing the optimal configurations to separate completely the images in cluster, so that each cluster represents one room. Once this task has been solved, the next step consists of creating groups with the images that belong to each room, to create smaller groups, with the purpose of obtaining the intermediate-level map. With this aim, the only source of information will be the global visual appearance of the panoramic scenes, like in the previous experiment. This second-level clustering will be carried out by means of the spectral clustering, because the preliminary experiments showed its viability (

Section 4.2).

As a result, the intermediate-level clustering process is expected to create groups that contain images that have been captured from geometrically near points. Since the criterion to cluster the images is the similitude of their global-appearance descriptors, the result may not be the expected one, owing to the visual aliasing phenomenon. Therefore, the main purpose of the experiment laid out in this subsection is to know if any description method is able to cope with this effect and create geometrically compact groups from pure visual information. This is a challenging problem, and finding a successful solution to it would be crucial to enable a robust and efficient localization subsequently.

First of all, this section analyzes the kind of experiments that will be carried out, and the results to obtain. They are different from those of the previous section because the objective is also different. To measure the correctness of the resulting clusters, two parameters will be used: the silhouette of the descriptors (entities) after having classified them into clusters, and the silhouette of the coordinates of the points from which the images of each cluster were captured. The silhouette is a classical method to interpret and validate clusters [

44]. It provides a succinct graphical representation of the degree of similitude between each entity and the other entities of the same cluster, comparing it with the similitude with the entities belonging to the other clusters. The silhouette value

${s}_{i}$ for a specific entity

${\overrightarrow{g}}_{i}^{Pos}$ after the clustering process can be calculated with Equation (

8),

where

${a}_{i}$ is the average distance between the entity

${\overrightarrow{g}}_{i}^{Pos}$ and the other entities contained in the same cluster, and

${b}_{i}$ is the minimum average distance between

${\overrightarrow{g}}_{i}^{Pos}$ and the entities contained in the rest of the clusters. The Euclidean distance is used in this section to make this calculation. The silhouette takes values in the range

${s}_{i}\in [-1,1]$. High values of

${s}_{i}$ indicate that

${\overrightarrow{g}}_{i}^{Pos}$ fits well within the cluster it has been assigned to and is relatively different from the entities in the other clusters. In contrast, low values denote that

${\overrightarrow{g}}_{i}^{Pos}$ is quite similar to the entities in the other clusters and does not belong consistently to the cluster it has been assigned to. After the clustering process, if the majority of entities have a high silhouette, the result of the clustering process can be considered successful. However, if many entities present a low (or even negative) silhouette, the result can be considered a failure. This may be produced by an incorrect choice of the parameters, the number of clusters or the clustering method. The complexity of the input data could also make them prone to be incorrectly clustered.

To illustrate the kind of results to obtain,

Figure 12 shows the groups created after clustering the kitchen of the Bielefeld dataset. On the left, the capture points are shown as colored squares. The colors indicate the cluster they have been assigned to. Considering this, the result could be reported as unsuccessful, because several clusters are not geometrically compact; the capture points of the images within them are quite dispersed. The center of the figure also shows the silhouette calculated by considering the global-appearance descriptors as entities (which is the only information used during the clustering process). This silhouette diagram shows the data of each cluster with different colors. The vertical axis contains the number of each cluster. Within each cluster, the entities appear ordered from the higher to the lower silhouette, which is the value that appears in the horizontal axis. As an example, the cluster 5 has an average silhouette that is negative, which indicates that this cluster is extremely unnatural and inconsistent.

However, it is worth highlighting that the objective of this step consists of grouping together images that have been captured from neighboring or close points. The silhouette calculated from the descriptors does not provide this kind of information; images that were captured far away may be assigned to the same cluster, owing to the visual aliasing. For this reason, an additional silhouette value has been calculated, but considering the coordinates of the capture points

${\overrightarrow{p}}_{i}={[{x}_{i},{y}_{i}]}^{T}$ as entities. This value contains information on how geometrically compact are the clusters, considering the capture points of the images. This information is used only with validation purposes, and it is not considered during the clustering process. The right side of

Figure 12 shows this silhouette graphically. Those clusters that are geometrically sparser present a lower average silhouette (as clusters 3, 4 and 5 do), and compact clusters tend to exhibit a higher average silhouette (such as the cluster 1).

Next,

Figure 13 shows the result of an intermediate-level clustering process that has offered more successful results than the previous case. The figure shows that the resulting clusters are geometrically more compact, and the silhouettes computed from the coordinates reflect that evidence, in general terms (they are substantially higher than in the previous case).

All this information considered, a complete set of experiments has been conducted. In each experiment, the images belonging to each room have been considered as the input data (since this is the result of the high-level mapping). The output variables that permit evaluating the correctness of the process are the average silhouette calculated (a) considering the descriptors as entities, and (b) considering the coordinates of the capture points as entities. The experiment has been repeated with all the rooms, the five description methods and different configurations for the parameters that define the size of the descriptors. The results obtained with all the description methods are shown in

Table 4. All these results are discussed hereafter.

Table 4 shows that the best absolute results are obtained with the gist descriptor when considering

$m=16$ masks and

${k}_{6}=4$ horizontal cells. These results are shown in bold font. In this case, the silhouette calculated from the coordinates of the capture points is

${s}_{coor}=0.4277$, which is the maximum value obtained. This would indicate that the resulting clusters are geometrically compact. A bird’s eye view of the resulting clusters will be shown hereafter to confirm this extent. Additionally, some other conclusions can be extracted from the table, if we analyze the silhouette calculated from the coordinates of the capture points. First,

${k}_{1}$ has little influence on the performance of the Fourier signature. Relatively good results are obtained independently of the size of the descriptor. Meanwhile, the behavior of rotational PCA changes substantially as the number of rotations

${N}_{R}$ does. As this number increases, the results tend to get worse. However, if this number is too low, the final model will not include enough information about the rotation of the robot. If

${N}_{R}=4$ this means that the model includes information on robot rotations of

$0,90,180$ and 270 degrees around each capture point, but no further information about intermediate angles. Considering this, the model could not be useful to estimate the position of the robot if its orientation changes substantially with respect to these four ones.

When HOG is used, the clustering algorithm tends to present low silhouette values, unless the number of horizontal cells is relatively high. Notwithstanding that, HOG exhibits a less-favorable performance than the Fourier signature under all circumstances. About gist, the number of masks m are of paramount importance to achieve successful results. If this number is low ($m=4$), the results are particularly poor, especially when the number of horizontal cells ${k}_{6}$ is also low. Finally, in the case of the CNN descriptor, the best results are obtained with the layer fc7, which provides a silhouette calculated from the coordinates of the capture points equal to $0.3577$.

To complete the experimental section, some additional figures are included to illustrate how compact the clusters created in the intermediate-level map are. Four of the configurations that have provided successful results in

Table 4 have been selected and studied in depth: (a) gist with

$m=16$,

${k}_{6}=4$ (this is the optimal result as far as

${s}_{coor}$ is concerned); (b) Fourier signature with

${k}_{1}=32$; (c) HOG with

${k}_{4}=32$; and (d) CNN descriptor obtained from the layer fc7.

Firstly,

Figure 14 shows the shape of the clusters of the intermediate-level map, considering the four rooms of the Bielefeld dataset and gist with

$m=16$,

${k}_{6}=4$. The capture points of the images are shown as small squares, whose colors indicate the cluster they belong to. Each room was previously separated through a high-level mapping process and then the images of every room underwent an intermediate-level clustering process. The results of this process are shown with an independent code of colors per room. All the clusters are noticeably compact and the number of images per cluster is balanced, independent of having different grid sizes per room (

Table 2). This result could be reported as successful, and it confirms the capability of the gist descriptor to address mapping tasks. Continuing with this descriptor and configuration,

Figure 15 shows graphically the silhouettes of the descriptors (top row, one graphical representation per room) and the silhouettes of the coordinates of the capture points (bottom row). In general terms, they take relatively high values, and this confirms the validity of this configuration.

Secondly,

Figure 16 shows graphically the clusters created in the intermediate-level layer when using the HOG descriptor with

${k}_{4}=32$ cells. Thirdly,

Figure 17 presents the results of the clustering process with the Fourier signature and

${k}_{1}=32$ components per row in the magnitudes’ matrix. Finally,

Figure 18 shows the intermediate-level clusters obtained with the CNN descriptor associated with the layer fc7.

These figures reveal that either gist, HOG or the Fourier signature can be configured to arrive at successful results since visually, no substantial difference can be appreciated between the distribution of the clusters in these three cases. The clusters tend to be geometrically compact and the number of entities of each group is balanced among clusters, despite the different grid sizes and the visual aliasing phenomenon. Notwithstanding that, gist with an intermediate number of masks has proved to be the description method that leads to the mathematically more-accurate results when creating the clusters of this level. By considering these results together with the conclusions of previous works [

19], gist can be reported as a robust global-appearance option to build models of the environment. It presents a good relationship between compactness and computational cost [

19], and it presents a remarkable ability to represent, in an aggregated form, the visual contents of the scenes captured from near positions. As far as the CNN descriptor is concerned, it has presented a result that is slightly less competitive, because some clusters in the laboratory and living room are not completely compact.

#### 4.5. Final Tests

To conclude the experimental section, two additional experiments are carried out with the objective of (a) testing the performance of the mapping approach in additional environments, and (b) showing how the hierarchical maps could be used to solve the localization problem and comparing it with a global localization approach (with no hierarchical map available).

To test the feasibility of the approach in additional environments, some supplementary sets of images are considered, apart from those shown in

Table 2, and the algorithms are run using all the sets of images. More concisely, four additional rooms are considered, whose main features are shown in

Table 5. In this table, the first two sets belong to the database captured by Möller et al. [

43] at Bielefeld University. The second two sets were captured by ourselves in two spaces of Miguel Hernandez University using a catadioptric vision system composed of an Imaging Source DFK 21BF04 camera pointing towards a hyperbolic mirror (Eizoh Wide 70). In all cases, the capture points of the images form a regular grid whose size is different for each room. To carry out the following experiments, a whole dataset composed of the images of

Table 2 and

Table 5 is considered (that is, 8 rooms and 1660 images are used in the following tests).

Firstly, an experiment is carried out to create groups of images in order to obtain the high-level map, using the approach presented in

Section 4.3. In this experiment, the gist descriptor with

$m=16$ masks and the CNN descriptor are used (since both descriptors have presented a relatively good performance in

Section 4.3). The results obtained with gist are shown in

Figure 19, and those obtained with CNN are shown in

Figure 20. Firstly,

Figure 19 shows that the performance of some gist configurations tends to get worse when a larger database is considered. However, we can also find some configurations that provide us with an accuracy equal to

$100\%$ (that is, the algorithm is able to separate correctly the images into eight clusters, corresponding to each room in the complete dataset). Among them, it is worth highlighting the single method along with the cityblock distance, because it also provides comparatively good values of the correlation and inconsistency coefficients. Secondly,

Figure 20 depicts the performance of CNN when the descriptor is obtained from layer (a) fc7 and (b) fc8. In both cases, the Ward method tends to present relatively good results when the algorithm is extended to additional sets of images, compared to the results initially obtained in

Figure 11. The layer fc7, along with the Ward methods and either

${d}_{1}$ or

${d}_{2}$, presents an accuracy equal to

$100\%$ with the complete dataset. However, the correlation coefficients in these cases are slightly lower than those obtained with the optimal gist configurations.

Secondly, another experiment is performed to create groups of images in order to obtain the intermediate-level map, using the approach presented in

Section 4.4. In this experiment, we make use of the gist descriptor with

$m=16$ masks, since it has presented the best results in the previous test (

Figure 19).

Figure 21 shows the results obtained with the four additional rooms considered (the results obtained with the other rooms are the same as those shown in

Figure 14). Like in the experiment of

Section 4.4, the results are successful when all the rooms are considered. Despite the different grid sizes and the visual aliasing phenomenon, the clusters tend to be geometrically compact.

Figure 22 shows some sample images extracted from the datasets considered in this experiment. Firstly,

Figure 22a,b were captured from two distant positions of the hall (

Table 2). Secondly,

Figure 22c,d belong to the hall 2 (

Table 5), and they were captured from two distant positions. Finally,

Figure 22e,f were captured from two different poses of the events room (

Table 5). The two halls are visually quite similar. Also, the images captured within each room have many visual similitudes. Despite that, the proposed algorithm is able to separate the images into rooms, and to create clusters within each room including images captured from geometrically near positions.

Finally, we test the validity of the hierarchical map shown in

Figure 14 and

Figure 21. With this aim, the hierarchical localization process is solved and the results are compared with a global localization process. In both cases, five images are randomly selected from each room (40 images in total), these images are removed from the map and the localization process is solved as an image-retrieval problem (that is, the most similar image from the map is obtained and extracted). No prior information about the capture points of these test images is used—only visual information is considered.

On the one hand, to address the localization hierarchically, a representative descriptor is calculated for each cluster in the high-level and intermediate-level maps, by calculating the average entity of each cluster. After that, the next steps are followed for each test image: (1) The distance between the descriptor of the test image and the representatives of the high-level map is calculated. The most similar one is retained. This permits knowing from which room the test image was captured, and only the intermediate-level map contained in this room is considered in the next step. (2) The distance between the test image descriptor and each representative of the intermediate-level clusters is calculated, and the most similar one is retained. Only the descriptors contained in this cluster are considered in the next step. (3) The distance between the test image descriptor and the descriptors contained in the intermediate-level cluster is calculated, and the most similar one is retained. This localization process is considered successful if the capture point of this image is one of the four nearest neighbors to the capture point of the test image.

On the other hand, to address the localization globally, the distance between the test image descriptor and the descriptors of all the other images is calculated, and the most similar one is retained. The localization process is considered successful again if the capture point of this image is one of the four nearest neighbors to the capture point of the test image.

Table 6 shows the results obtained with both localization methods, when considering the gist descriptor to create the representatives of the clusters. A variety of configurations of the descriptor are considered in the experiment. The table shows the percentage of correct localizations and the average localization time for both methods. According to these results, the hierarchical localization proves to be a competitive alternative to the global localization, except when a low number of masks and cells is considered. When

$m=[16,32]$, the hierarchical localization shows the same percentage of correct localizations as the global one, and the necessary time to obtain the most similar images is substantially lower in the case of the hierarchical localization.