Development of Height Indicators using Omnidirectional Images and Global Appearance Descriptors

: Nowadays, mobile robots have become a useful tool that permits solving a wide range of applications. Their importance lies in their ability to move autonomously through unknown environments and to adapt to changing conditions. To this end, the robot must be able to build a model of the environment and to estimate its position using the information captured by the different sensors it may be equipped with. Omnidirectional vision sensors have become a robust option thanks to the richness of the data they capture. These data must be analysed to extract relevant information that permits estimating the position of the robot taking into account the number of degrees of freedom it has. In this work, several methods to estimate the relative height of a mobile robot are proposed and evaluated. The framework we present is based on the global appearance of the scenes, which has emerged as an efﬁcient and robust alternative comparing to methods based on local features. All the algorithms have been tested with some sets of images captured under real working conditions in several indoor and outdoor spaces. The results prove that global appearance descriptors provide a feasible alternative to estimate topologically the relative altitude of the robot.


Introduction
Over the last few years, mobile robotics has become a technology that has gained presence in many kinds of environments to solve different problems, both in industries, educative centres and in households. To increase their range of applications, mobile robots must be able to solve the task they have been designed for in a truly autonomous way. With this aim, two crucial abilities must be developed: the robot must be able to build a model of the environment where it has to move and to estimate its position and orientation within this model.
Mobile robots may be equipped with different kinds of sensors that provide them with information that permits solving the mapping and localization problems. These sensors can be categorised into proprioceptive and exteroceptive. On the one hand, proprioceptive sensors measure the state of the robot. Encoders installed in the wheels are an example. Through an odometry process they permit estimating the displacement of the robot, but the error in this estimation tends to grow indefinitely. This is why enconders tend to be used in combination with other sources of information [1]. On the other hand, exteroceptive sensors measure some information from the environment where the robot is moving. Among them, some researchers have made use of GPS, SONAR and laser sensors in previous works. First, GPS (Global Positioning Systems) constitute a robust choice outdoors, but the information they provide close to buildings and on narrow streets is less reliable [2]. Second, SONAR and laser sensors permit measuring the distance to the objects around the robot using different technologies. The use of laser sensors has extended substantially in mobile robotics [3] because they tend to present an improved precision and angular resolution comparing to SONAR. Nevertheless, they also present a higher cost, weight and power consumption.
More recently, vision sensors have gained popularity because they present several advantages. They capture a big quantity of information from the environment, which can be used to carry out many high level tasks, apart from mapping and localization, such as people and objects detection and recognition. They also present a relatively low cost and power consumption comparing to laser rangefinders and the information they provide is stable both outdoors and indoors, unlike GPS, whose signal tends to degrade indoors. Vision sensors can be used either as the only source of information from the environment or in combination with other kinds of sensors [4]. Initially, monocular configurations were used but later works tried to expand the field of view through other configurations such as binocular [5] or omnidirectional [6][7][8]. In recent years, the use of omnidirectional vision sensors has expanded thanks to the big quantity of information they capture (as they are able to take images with a field of view of 360 deg around the robot) with a relatively low cost. Nevertheless, working with omnidirectional visual data is a complex task when the robot has to create models of large and complex environments while moving with more than 3 Degrees Of Freedom (DOF). In these cases, it is necessary to extract some information from the scenes to build a robust model that gathers relevant and invariant knowledge from the environment and that permits estimating the position of the robot with the required DOF. Extracting this useful information from the scenes is a key point since images are very high dimensional data that change not only when the robot moves along any DOF but also under other circumstances such as variations in lighting conditions. Two main approaches have been used by the researchers to compile such information. On the one hand, some relevant landmarks or outstanding points or regions (either natural or artificial) can be extracted and described using any local descriptor that captures the appearance of the landmarks' neighbourhood trying to get invariance to position, scale and rotation [9][10][11][12]. On the other hand, each scene can be represented through a unique global appearance descriptor that contains information on the whole scene [13][14][15][16].
Many authors have addressed the mapping and localization problems using vision sensors. The different approaches proposed can be roughly classified into three different types, depending on the contents and the internal structure of the models. First, a metric map can be built trying to define the position of some outstanding points of the environment with respect to a reference system. These models permit estimating the position of the robot with geometrical accuracy up to a specific error [17][18][19]. Second, topological maps usually represent some characteristic places of the environment and the connectivity relationships between them. They tend to be simpler representations but they contain usually enough details for most applications [20]. Third, hybrid maps try to gather the advantages of the two previous approaches. They arrange the information into several layers with different levels of detail, containing topological models in the top layers that permit a rough localization and metric models in the bottom layers to refine this localization [21][22][23].
Traditionally, metric maps have been built using methods based on the extraction, description and tracking of local features along a set of scenes captured by the vision sensor mounted on the robot. This information is often combined with other sources of information, such as odometry or laser [24]. However, these models tend to be quite complex and not easily interpretable by a human operator, and the localization process is usually elaborated and computationally expensive. In contrast, topological maps offer more intuitive representations of the environment [25]. Visual global-appearance approaches may be used to create such models, since no metric information can be extracted from global descriptors. They lead to simpler models where the localization process is more straightforward, based mainly on the pairwise comparison between image descriptors [26].
The problem of robot localization using visual models of the environment has been thoroughly studied when the trajectory of the robot is contained in the ground plane. Researchers have made use both of local features [27] and global appearance [28] to build image descriptors that permit solving this problem. Beyond that, in some applications, it may be useful to estimate the altitude of the robot with respect to this plane, without changing the visual map nor including any further visual information. The proposal of the present work fits within this area. The main objective consists in extracting some information from the images that permits estimating the relative altitude of the robot, using the previously built visual model of the environment.
About the choice of the approach to describe the scenes, some researchers have addressed the altitude estimation problem using local descriptors [29][30][31][32]. However, the literature on altitude estimation using only visual information and global appearance descriptors is quite sparse and few works can be found on this topic, despite the advantages that global appearance descriptors can offer to the mapping and localization problems [33]. Taking this into account, this work is focused on studying the performance of global-appearance methods to solve this problem. The only source of information used is a catadioptric vision sensor mounted on the mobile platform.
The contribution of this paper is twofold. On the one hand, some global appearance approaches are defined to incorporate the altitude information in the descriptors and some methods are proposed to estimate robustly this relative altitude without changing nor including any additional information to the visual model of the environment. On the other hand, a comparative evaluation of these methods is carried out to analyse their behaviour both indoors and outdoors. The altitude estimators we propose go beyond the classical topological notion of connectivity and introduce the concepts of closeness and farness, thus some altitude estimators are proposed that estimate the relative height of a robot except for a scale factor. The results of this paper jointly with the developments presented in [26,34] prove the usefulness of the global appearance descriptors to estimate the position and orientation of the robot in the ground plane and its altitude with respect to this plane, in a straightforward way. This supposes a step ahead towards the definition of global appearance descriptors that permit building models of the environment and localization when the robot moves with 6 DOF.
The remainder of the paper is structured as follows. Section 2 makes a review of some techniques to describe globally the appearance of omnidirectional visual information. Then, in Section 3 the methods implemented to estimate topological height from visual information are detailed. After that, Section 4 describes the geometry of the vision system and the sets of images used to carry out the experiments, whose results are presented in Section 5. At last, the conclusions and future works are outlined in Section 6.

Omnidirectional Imaging and Global Appearance Descriptors
Along the paper, the use of omnidirectional visual information along with global appearance descriptors is proposed to develop some height indicators. This section presents some fundamentals on the kind on sensors used to obtain the omnidirectional images (Section 2.1) and on the mathematical methods used to describe the global appearance of these images (Section 2.2). The information contained in these descriptors will be used in Section 3 to estimate topological height.

Catadioptric Vision Sensors
Catadioptric vision sensors consist of a conventional camera pointing towards a convex mirror with their axes aligned. In this work, a hyperbolic mirror will be considered and the axes of the mirror and the camera will be always parallel to the z-axis of the world reference system. The world information reflects onto the mirror and the camera captures this reflection, composing the omnidirectional image. Figure 1 shows the projection model of the catadioptric vision system, showing the World Reference System (WRS), the Camera Reference System (CRS), centered on the focal point of the hyperbolic mirror F , and the Image Reference System (IRS). F is the focal point of the camera. The figure shows the projection of a world point P onto the mirror Q and from the mirror to the image plane m. The image projected onto the image plane (plane of projection) is the omnidirectional view. The calibration of the catadioptric system provides us with a function f (a) that permits calculating, for each pixel in the image, the coordinates of the point of the mirror that has produced this pixel, respect the CRS [35]. a is the distance between the pixel considered and the center of the image, expressed in pixels. From the omnidirectional image, other projected versions of the visual information can be obtained, such as the cylindric projection (panoramic image), the orthographic projection (projection onto a plane) or the unit sphere projection [36]. Figure 2 shows (a) a sample omnidirectional image and three different projections obtained from it; (b) orthographic projection onto a plane parallel to the ground plane; (c) unit sphere projection and (d) cylindrical projection or panoramic view.

Global Appearance Descriptors
Descriptors based on the global appearance of images captured by a catadioptric vision system have proved a good performance both in position and orientation estimation, when the movement of the robot is restricted to the floor plane, as Gaspar et al. [36] and Payá et al. [26] show. These methods extract the most relevant information from each image and reduce the amount of memory necessary to store the visual information working with the image as a whole, i.e., avoiding the extraction of landmarks or local features. In this work, three different techniques based on the Discrete Fourier Transform (DFT) are considered: the Fourier Signature (FS), the two-dimensional Discrete Fourier Transform (2D-DFT) and the Spherical Fourier Transform (SFT). In the next subsections a brief outline of these description methods is made and some relevant mathematical properties are presented. After that, in Section 3 some methods are proposed to develop height indicators using these description methods and their properties.

Fourier Signature
The Fourier Signature (FS) was firstly described by Menegatti et al. [37], who used it to carry out mapping and localization with a robot whose movement is restricted to the ground plane. It consists in the representation of a panoramic image calculating the one-dimensional DFT of each row. Therefore, When the DFT of every row of the image im(x i , y i ) ∈ R M×N is calculated, a new matrix I M(u, y i ) ∈ C M×N is obtained, being u the frequency variable (cycles/pixel). The components of this matrix are complex numbers thus it can be decomposed into a magnitudes matrix A(u, y i ) ∈ R M×N and an arguments matrix Θ(u, y i ) ∈ R M×N . Taking the properties of the DFT into account [37], only a subset of k columns can be retained to represent the image: A(u, y i ) ∈ R M×k and Θ(u, y i ) ∈ R M×k , k ≤ N. The FS presents another interesting property when used to describe a panoramic image. If the image comes from an omnidirectional vision sensor mounted vertically on the robot, then, the modules matrix A(u, y i ) is invariant against rotations around the vertical axis. Let's consider two panoramic images captured from the same position on the ground plane but having the robot different orientations with respect to the vertical axis, with relative orientation φ, as shown in Figure 3. If the row m of the first image is represented as the sequence {r m } = {r m,n }, n = 0, . . . , N − 1 then the same row in the second (rotated) image is {r m,n−q }, where q is the shift between images, measured in pixels, which is proportional to the relative rotation between images q = N · φ/360, where φ is measured in deg.
Visually this shift appears as a circular shift of the columns of the image (Figure 3).
The rotational invariance can be expressed by the DFT shift theorem as: where F [{r m,n−q }] is the one-dimensional DFT of the shifted sequence, and R m,l are the components of the one-dimensional DFT of the non-shifted sequence (row {r m }). Taking this theorem into account, when the movement of the robot is contained in the ground plane, the magnitudes matrix can be used to estimate the position (since it is invariant to rotation), and the arguments matrix to estimate the relative orientation.

Two-dimensional Discrete Fourier Transform
The two-dimensional Discrete Fourier Transform (2D-DFT) of an image im(x i , y i ) ∈ R M×N can be expressed as a new matrix I M(u, v) ∈ C M×N that can be split into two matrices, one containing the magnitudes A(u, v) ∈ R M×N (or power spectrum) and other with the arguments Θ(u, v) ∈ R M×N . Since the most relevant information in the Fourier domain concentrates in the low frequency components and the high frequency information is usually more affected by noise, retaining only a number of low frequency components may lead to better results in localization with an improved computational cost. Taking this fact into account, the number of rows retained from the matrices A and Θ will be k 1 ≤ M and the number of columns k 2 ≤ N.
Another interesting property when working with panoramic images is the rotational invariance, which is reflected in the shift theorem: where I M(u, v) is the 2D-DFT of the original image im(x i , y i ) and im(x i − x 0 , y i − y 0 ) is a shifted version of this image. According to this theorem, the power spectrum of the shifted image remains the same of the original image and only a change in the argument of the components of the transformed image is produced, whose value depends on the shift along the x i -axis (x 0 ) and the y i -axis (y 0 ). Thanks to this property, this transform has previously been used to estimate the position and orientation of a robot when it moves on the ground plane [26]. In this case, if the robot captures two panoramic scenes im 1 and im 2 from the same position on the ground plane but with different orientations, the magnitudes matrices A 1 and A 2 are the same and only a shift along the x i -axis is produced, which can be calculated from the theorem and used to estimate the relative orientation.
Equation (2) also shows that the first row of the 2D-DFT, which corresponds with v = 0, is only affected by shifts along the x i -axis of the image, whereas the first column of the transform, which corresponds with u = 0, is only affected by shifts along the y i -axis.

Spherical Fourier Transform
Omnidirectional images can be projected onto the unit sphere when the intrinsic parameters of the catadioptric vision system are known. Being θ ∈ [0, π] the colatitude angle, and φ ∈ [0, 2π) the azimuth angle, the projection of the omnidirectional image im(x i , y i ) ∈ R M×N onto the 2D sphere can be expressed as f (θ, φ). As shown in [38], the spherical harmonic functions Y lm form a complete orthonormal basis over the unit sphere. Any square integrable function defined on the sphere f ∈ L 2 (s 2 ) can be represented by its spherical harmonic expansion as: with l ∈ N and m ∈ Z, |m| ≤ l. f lm ∈ C denotes the spherical harmonic coefficients, and Y lm the spherical harmonic function of degree l and order m defined by: where P m l (x) are the associated Legendre functions. It is also possible to build a rotationally invariant representation of omnidirectional images using the Spherical Fourier Transform (SFT). Considering B the band limit of f , the coefficients of e = (e 1 , ..., e B ) are not affected by 3D rotations of the signal, where: More information and examples of applications of the SFT in navigation tasks can be found in [39][40][41][42]. Makadia et al. [39] introduce the estimation of 3D rotations extending the shift theorem to the SFT. Schairer et al. [40] present a rotation estimation algorithm based on the SOFT (SO(3) Fourier Transform), being SO(3) the 3D Rotation Group. On the other hand, Huhle et al. [41] and Schairer et al. [42] show a localization method using the SFT applied to omnidirectional images and a predictive model of Gaussian probabilistic regression.
In this work, we take advantage of the rotational invariance properties of the DFT to describe the scenes, using e l with this aim.

Development of Height Indicators Using Global Appearance Descriptors
Our previous works [26,34] have focused on building a visual model of the environment and estimating the position of the mobile platform when its movement is contained in the ground plane, using the global appearance of the scenes with this aim. However, as stated previously, it is also interesting to study the possibility of estimating the altitude of the vehicle with respect to the plane where it moved when the visual model was created. With this goal, several methods are proposed in this section and analysed in the subsequent sections in order to know the accuracy and advantages of each one.
In all cases, only visual information will be used to estimate the relative altitude. The two images to compare are named reference and test image (im R and im T respectively) and the algorithms estimate the height of im T with respect to im R . Since the objective of this work consists in studying the performance of some methods in altitude estimation, we consider the images to compare have been captured along a line which is parallel to the z-axis of the WRS, being the axis of the catadioptric system in vertical position ( Figure 1). This way, we isolate the effect of height changes in the images.
To compare two scenes using their global appearance, a distance measurement must be defined. In this work, the image distance is defined as the Euclidean distance between descriptors. Being d T ∈ R n×1 the descriptor of the test image, and d R ∈ R n×1 the descriptor of the reference image, the image distance can be obtained as: The best match among a set of different comparisons is found by choosing the one with minimum distance.
The next subsections present the relative height estimation methods in detail. Four methods based on global appearance have been implemented and tested. They are based on the descriptors presented in Section 2.2. Also, for comparative purposes, an additional method that uses local features is proposed in Section 3.5.

Method 1: Central Cell Correlation of Panoramic Images
In a panoramic image, the most distinctive information is usually in the central rows of the scene. In outdoor environments, the bottom rows normally correspond to the terrain, and the upper rows to the sky and in indoor environments they correspond to the floor and ceiling respectively. If the altitude of the catadioptric system changes whether upwards of downwards, the area constituted by the central rows of the panoramic image is less likely to go out of the camera field of view. Taking this fact into account, in this method, the global appearance of the central rows of the reference and test images is compared to estimate the relative height between their capture points.
First, the algorithm computes a global appearance descriptor of the central cell of im R (the portion composed by the central rows). To obtain this descriptor either the FS (Section 2.2.1) or the 2D-DFT (Section 2.2.2) can be used. This process is repeated for different cells situated above and below the central cell. In Figure 4 a sample image and some cells extracted from it are shown. The central cell is emphasized with a wider line, and some additional cells have been defined both above and below it. d is the vertical distance (measured in pixels) from each additional cell to the central one.
Considering now a test image im T , the algorithm computes the descriptor of the central cell, compares it with all the descriptors of the cells extracted from im R and retains the best match. The position (d) of the cell in im R that best matches the central cell of im T is a measurement of the relative altitude. Therefore, the displacement is measured in pixels and it can be considered as a topological distance To illustrate this method, two 128 × 512 panoramic images captured from different heights are considered ( Figure 5). On the one hand, Figure 5a is the reference image (im R ). On the other hand, Figure 5b is the test image (im T ) and it was captured from a height 60 cm higher than im R . In this example, the size of the cells is equal to 64 × 512 pixels and FS is used to describe these cells. First, im R is considered and its central cell is extracted. After that, the FS of this cell is calculated and its magnitudes matrix is obtained. The result is the descriptor A T . When all this information is available, the algorithm calculates the Euclidean distance (Equation (6)) between the descriptor A T can be considered as a height indicator. Figure 6 shows dist(A In this case, the minimum is produced at d = 3. Since d > 0, the height of im T is higher than im R .

Method 2: 2D-DFT Vertical Phase
This method is based on the use of the 2D-DFT and the shift theorem presented in Section 2.2.2. Traditionally, this method has been used to estimate the relative orientation of the robot with respect to the vertical axis when its movement is contained in the ground plane [26]. In this case, a change in the orientation of the robot produces a circular shift of the columns of the panoramic image which can be estimated through the shift theorem (Equation (2)).
Besides, this descriptor can also be used to estimate relative height since a vertical displacement of the robot will produce a shift of the rows of the panoramic image that can also be estimated through the shift theorem. However, the use of the theorem in this case is not direct because, unlike a rotation around the vertical axis, a vertical movement does not produce a circular shift of rows and the information in the scene is thus modified; after the vertical displacement some rows of the original image will go out and new rows will appear. Taking this fact into account, the magnitudes matrix of the transformed image will experience some changes hence the shift theorem is not exactly met.
Despite the issues described above, the preliminary experiments showed that the great majority of the visual information remains after a vertical displacement therefore a circular shift of the image's rows will be assumed when using this method. For this reason, in order to estimate topologically the vertical displacement between the reference and the test images, we use the arguments matrices of their 2D-DFT, Θ R and Θ T , where only the k 1 × k 2 first components have been retained. We consider As stated before, a vertical movement in the spatial domain produces a change in the phase of the coefficients in the frequency domain. Our approach simulates different shifts on the matrix Θ R (using Equation (2)), compares each shifted matrix with Θ T and retains the shift that produces the best match. A circular shift of S deg of the rows of the reference image produces a change on its arguments matrix Θ R ∈ R N F ×N F that can be simulated through the next expression: where VRM is the Vertical Rotation Matrix, defined as: To illustrate this property we consider a sample panoramic image im R ∈ R M×N , where M = 500 and N = 2000. From it, a new image im rotated R with the same size is generated, considering a circular shift of N R = 25 rows (the displacement is towards the top of the image). This is equivalent to generating a shift S = N R · 360/M = 18 deg. The original and the shifted images are shown in Figure 7a Using these two images, the next sequence of operations is carried out. First, the 2D-DFT of both images is calculated, resulting the matrices I M R and I M rotated R . Second, the magnitudes and the arguments matrices of both transforms are obtained and just the first N F = 4 rows and columns are retained. Equation (9) shows the two magnitudes matrices and Equation (10)   On the one hand, Equation (9) shows that both magnitudes matrices are identical, as expected according to Equation (2). On the other hand, the relationship between both arguments matrices meets Equation (7), as detailed in the next equation: 18 Taking all these facts into consideration the next steps are followed in the height estimation application. First, from the reference image im R , a set of rotated versions is generated Θ rotated R considering S = [−180 + ∆S, −180 + 2∆S, . . . , 180] deg. In the experiments, ∆S is given a value equal to 0.5 deg.
Second, when a new test image im T arrives, the arguments matrix of its 2D-DFT is obtained Θ T ∈ R N F ×N F and compared with the set of matrices Θ rotated R generated from the reference image. The coefficient S that produces the best match (the minimum distance) is a topological measurement of the relative altitude between images.

Method 3: Multiscale Analysis of the Orthographic View
In this method a multiscale analysis is carried out to estimate the relative height. This analysis consists in carrying out several artificial zoomings of the central area of the scenes and has been used previosly to estimate the topological distance between the capture points of two scenes when the robot moves in the ground plane [43]. To obtain consistent results, the projection plane of the images must be perpendicular to the direction of the movement. Since we consider vertical movements in this work, an orthographic projection of the omnidirectional image onto a horizontal plane must be used.
The method consists in generating several orthographic projections of the reference image, considering different focal distances to the plane where the image is projected. This is equivalent to generating a set of orthographic projections with different zooms. Figure 8 shows a sample omnidirectional image and three of its orthographic views, assuming three different focal distances f c for the projection plane. After that, the different projections are described using global appearance. This way, a set of descriptors is generated from im R , each one with a focal distance associated: When a new omnidirectional test image arrives, the algorithm computes an orthographic view with a specific focal distance and calculates its descriptor, obtaining the pair ( d T , f c T ). Next, the descriptor of the test image d T is compared with all the descriptors of the reference image and the best match (minimum distance) is retained.
We retain the focal distance of the reference image where the minimum is found, f c i 0 . The difference between the focal distance of the test image projection and the focal distance of the matched reference image projection is a topological measurement of relative height: In the experiments, the set of focal distances for the reference image are generated in the range f c i = [4,11], while the focal distance of the test image is f c T = 7.

Method 4: Change of the Camera Reference System (CRS)
The fourth method consists in simulating an artificial movement of the camera and calculating the new coordinates of the pixels of the image after the movement. Some researchers (such as Valiente et al. [44]) have used this technique to simulate a displacement of the Camera Reference System (CRS) using the epipolar geometry.
To obtain the new image after the artificial displacement, the next steps are followed. First, each pixel of the original omnidirectional image is retroprojected to obtain its coordinates with respect to the WRS. Let m ∈ R 2 be the coordinates of one pixel of the omnidirectional image (with respect to the IRS) and a the distance from this pixel to the center of the omnidirectional image. The function f (a) (an example is shown in Equation (15)), obtained from the calibration of the catadioptric system, permits calculating the point of the mirror Q that projects onto this pixel m. This point can be retroprojected onto the unit sphere and, as a result, the coordinates of this projection M ∈ R 3 respect to the CRS can be obtained. After that, a movement of the camera is simulated through a change of the CRS and the new coordinates M respect the new CRS system will be: being T the unitary displacement vector, and ρ the scale factor, which is proportional to the amount of displacement. In this work, to simulate a vertical displacement T = [0, 0, 1].
Using the new coordinates of the projection of the point onto the unit sphere M , the corresponding point on the mirror Q can be obtained and projected onto the new image plane, where the new coordinates respect the IRS will be m . Repeating this operation for all the pixels of the original image, the result will be the new omnidirectional image after the simulated movement. After this process, some pixels of the original image may lay out the new image plane and some pixels of the new image may be empty. In this case, the value of these pixels is estimated as the average value of its 8 nearest neighbours.
Once the new omnidirectional image after the artificial movement has been calculated, different projections can be obtained. Specifically, in this work, the orthographic view, the panoramic image and the unit sphere projection are considered. Figure 9 shows the simulated vertical movement of the catadioptric system. On the one hand, Figure 9a shows the projection of a world point P onto the original image plane m (plane of projection 1) and onto the new image plane after the simulated vertical movement m (plane of projection 2) obtained using epipolar geometry. F 1 and F 2 are the focal points of the hyperbolic mirror before and after the movement and F 1 and F 2 are the focal points of the camera. On the other hand, Figure 9b shows two sample omnidirectional images and panoramic projections considering the original image plane (ρ = 0) and the new image plane after the simulated movement (ρ = 0.4).
In order to estimate the relative height between two scenes, the algorithm simulates several displacements of the reference image im R by giving different values to ρ, and compares each of them with the test image im T , considering ρ = 0, i.e., without CRS movement, and using global appearance. Finally, the algorithm selects the best match. The coefficient ρ associated with this match is a topological measurement of the vertical distance between images. In the experiments, the coefficient ρ of the reference image will be given values between −0.3 and 0.3.

Method 5: Matching of SURF Features
The last method makes use of local features extracted from the omnidirectional scenes to estimate topological height. SURF features are used with this aim [45]. This method has been introduced for comparative purposes, since using local features is a mature approach to solve the localisation problem.
The method starts extracting and describing the SURF features of the reference and test omnidirectional scenes. After that, a matching process is carried out; the points of the test image are matched with the points of the reference image. Considering a purely vertical movement, the points of the test image will move along the radial direction, towards the centre of the image if the movement is upwards and towards the periphery if the movement is downwards.
Taking this fact into account, the matching process can be optimized by searching the possible match along the radial line associated to each point in the test image. Figure 10a shows two sample omnidirectional images captured indoors and superimposed, with a relative vertical movement between them. The SURF points have been extracted from the reference image (red points) and from the test one (green crosses), described, and matched (yellow lines show the matches). This figure shows a number of outliers (those matches which are not produced in the radial direction). Figure 10b shows the same two images, but the matches of the test image SURF points have been searched along the radial line in the reference image. In these sample images, the local features of the test image tend to be closer to the centre of the image. This means that the robot has moved upwards to capture the test image with respect to the reference one. If P R = {p R 1 , p R 2 , . . . , p R n } and P T = {p T 1 , p T 2 , . . . , p T n } are, respectively, the set of SURF points extracted and matched from the reference and the test image (where the point p R j matches the point p T j ), then, the average distance between each pair of matched points (Figure 10b) can be considered as a topological measurement of the height difference between images: where dist{p T i , p R i } is the Euclidean distance between p T i and p R i . This distance is considered positive when p T i is closer to the centre than p R i and negative otherwise. This indicator is expected to provide a topological estimation of height. Furthermore, its linearity will depend on how linear the average displacement of the corresponding SURF points is when the omnidirectional vision system changes its altitude.
A popular alternative to obtain more robust results from the corresponding landmarks is the use of RANSAC (RANdom SAmple Consensus) [46]. An example of application in mobile robotics can be found in [47]. This way, in this paper, we also propose using RANSAC to estimate the relative topological altitude. Initially, a random subset of matched points can be used to have an estimation of the relative altitude, d 2 , calculated using Equation (14). After that, the rest of matched points can be used to corroborate this estimation. A pair of matched points corroborates this estimation if the distance between them is equal to d 2 plus or less a specific threshold. After repeating this process a number of times with different initial subsets of matched points, the estimation d 2 which is corroborated by a higher number of matched points is considered a measurement of the relative altitude.
During the experiments, both methods will be considered. First, the average distance between all the matched points will be calculated (the estimated height is d 1 ) and second, the RANSAC-based method will be considered (d 2 ) and the linearity of both methods will be assessed using a variety of images captured both indoors and outdoors.

Sets of Images
Several complete sets of images captured both indoors and outdoors have been used to test the five methods proposed in the previous section. They have been captured by ourselves inside and in the surroundings of the Innova building at Miguel Hernandez University (Spain), and they are available from [48]. In this section, the main features of these sets of images are presented.
The catadioptric vision sensor used consists of an Imaging Source DFK 21BF04 color camera with 1280 × 960 pixels resolution, and a hyperbolic mirror, whose model is Eizoh wide 70. Table 1 shows the main specifications of the mirror. Additional information on the mirror can be found in [49]. The camera has been adapted to a tripod that permits capturing images with a range of 165 cm along the z-axis of the WRS. Also, the mirror has been mounted above the camera, with their axes aligned. The distance between the focal points of the mirror and the camera is equal to 65 mm. Figure 11 shows the equipment used. The calibration of the camera has provided the following equation: As stated in Section 2, this function permits obtaining the coordinates of the retroprojection of each pixel of the image onto the hyperbolic mirror with respect to the CRS. a represents the distance in pixels between the pixel considered and the center of the omnidirectional image.  Table 2 shows the z-coordinate of each capture point and the number of images captured from each z-coordinate, considering both the oudoor and the indoor database. Figure 12 shows the position of the different ground positions above which each set of images has been captured. For each omnidirectional image, a cylindrical and an orthographic projection have been calculated with 256 × 1024 and 256 × 256 pixels each.  About the choice of the capture positions, on the one hand, the outdoor images were captured both close to and far from buildings, in a parking area and gardens. Some sample omnidirectional images captured outdoors, from different positions, are shown in Figure 13. On the other hand, indoor scenes have been captured in several rooms of the building, including a laboratory, the hall and common areas. These rooms present very different visual appearances. Figure 14 presents 3 scenes of different indoor locations. The images have been captured in different times of the day in order to include different lighting conditions. Also, although the coordinates (x, y) of each set and the orientation of the system around the z-axis are considered to be constant, the nature of the acquisition system has introduced small position and orientation changes between capture points. There are two main phenomena that produce these changes. On the one hand, the camera may suffer a small swing angle, due to the bending of the tripod. On the other hand, when changing the height of the tripod, the camera may experience a small change of orientation around the z-axis. In our experiments, the maximum value of both angles is equal to 3 deg. This way, it constitutes a challenging database that permits testing the algorithms under real working conditions. The whole set of images and more information on it is available on [48]. To see the effect that a height change has on the images, Figure 15 shows three images captured above the same position but with different z-coordinates. As the height increases, most of the visual information corresponds to the ceiling.

Experiments and Results
This section presents the results of the comparative analysis of the methods we propose to estimate relative altitude. First, the configuration of the experiments is detailed. Second, the results are shown and analysed.

Configuration of the Experiments
In the previous section, five methods are proposed to estimate the altitude using visual information. On the one hand, the methods 1 and 2 make use of the panoramic image, the method 3 employs the orthographic view, the method 4 can use any of the three projections: panoramic, orthographic or unit sphere projection and finally, the method 5 uses the omnidirectional scene. On the other hand, different appearance descriptors can been used to describe each kind of image projection. Taking this into account, a total of 12 combinations method + image projection + descriptor are considered during the experiments, to test their feasibility. Table 3 shows the configuration of each combination, specifying the height estimation method, the image projection, the description method and the final measurement of topological relative height obtained. Table 3. Combinations of height estimation methods, kind of image projection and description method considered to carry out the experiments. The final topological height measurement is also shown.

Height Estimation Method
Image Projection Descriptor Height Indicator

Central Cell Correlation
Panoramic Image FS d (pixels) 2D-DFT d (pixels)

Multiscale Analysis
Orthographic

Matching Local Features
Omnidirectional Scene SURF d 1 (pixels) SURF-RANSAC d 2 (pixels) As far as the choice of the reference and test images is concerned, the following three conditions have been considered in order to broaden the scope of the experiments: (c1) The bottom image of each set (h = 1 in Table 2) is considered to be the reference image, and the rest of images of each set are considered as test images. Since each test image presents a different altitude with respect to the reference image in each set, this situation allows us to analyse the linearity of the estimated relative altitude versus the actual relative altitude. (c2) The image captured at h = 5 (intermediate position, equivalent to z = 185 cm according to Table 2) is considered to be the reference image, and the rest of images of each set are considered as test images. This permits studying the behaviour of the methods to estimate both positive and negative relative altitudes and analysing the symmetry of the behaviour. (c3) Different reference images and altitude gaps are considered. This permits assessing the behaviour of the algorithms independently on the image chosen as reference image and on the altitude gap. For each set of images, we carry out as many comparisons as possible considering different images as reference. For example, considering a gap ∆h = 2, equivalent to 30 cm, we compare the first image with the third, the second with the fourth, and so on until carrying out all the experiments that the range of height permits. Table 4 shows the number of experiments for each height gap and data set in this condition. All these experiments are carried out both with positive and negative relative heights.

Results
This subsection presents the results of the experiments. First, the computational cost of the altitude estimation process is shown. Second, the accuracy of the methods to estimate the altitude is analysed, according to the experimental configurations explained above.
To study the computational cost, the experiments have been carried out using Matlab running on a 2.4 GHz Intel Core i5 processor. Table 5 shows the results. We analyse separately the necessary time to describe a reference image (column t Re f in the table) and the necessary time to describe a test image, compare it with the reference image and estimate the relative height (column t Test in the table). In general, in the case of global-appearance methods, t Re f tends to be higher than t Test because describing the reference image implies using different cells in the method 1 (Central Cell Correlation method), several scales in the method 3 (Multiscale Analysis) and a number of artificial movements in method 4 (CRS Movement). On the one hand, methods 1 and 2 are the computationally lighter methods. Both of them need less than 0.08 s to describe a reference image and to estimate the relative height of a test image. On the other hand, methods 3 and 4 are more expensive to describe each reference image. Comparing both methods t Test is lower in the case of the method 4, except when using SFT to describe the images. The combination of the method 4 along with the SFT is, computationally, the most expensive global-appearance choice. Comparing with the local-features method (method 5), all the global-appearance methods present a lower t Test , except method 4 with the SFT.
In typical applications, a set of images is usually available initially to create a model of the environment. Using this model, the height of the robot can be estimated subsequently. This way, the description of the reference image is a process that can be carried out offline, during the creation of the model. Also, we have given priority to the precision over the computational cost to describe the reference image in this work. In a real application, the number of scales, cells or artificial movements could be reduced depending on the accuracy required. Once the model is available, during the localization process, the necessary time to estimate the height is t Test . This time is critical so that the robot can navigate in real time. Table 5 shows that all the algorithms proposed present a reasonable t Test to be implemented in a real application. The greatest part of this time is used to describe the test image. This implies that once described, comparing it with a number of reference images would be a very quick process. Table 6 shows the size of the descriptor for every configuration analysed. The table shows separately the necessary memory to store the descriptor of a reference image (Mem Re f ) and the descriptor of a test image (Mem Test ). In general, the size of the FS descriptor is similar to the SFT, and the 2D-DFT is the most compact descriptor. Global-appearance methods tend to produce more compact descriptors for the test image, compared to the local-features method. Table 5. Computational cost to describe a reference image, t Re f and to describe a test image and estimate its relative height, t Test .

Height Estimation Method Image projection
Descriptor t Re f (s) t Test (s)

2D-DFT Vertical Phase
Panoramic Image 2D-DFT 0.0032 0.0662  Table 6. Necessary memory to store all the necessary information from a reference image, Mem Re f , and to store the information from a test image, Mem Test .

Height Estimation Method Image Projection
Descriptor Mem Re f (KB) Mem Test (KB)

2D-DFT Vertical Phase
Panoramic Image 2D-DFT 8 8 After studying the computational cost, the accuracy in height estimation will be analysed in the next paragraphs. The results will be represented graphically using the average value of the estimations vs the actual height of the test image h test , according to Table 2. Standard deviation bars are also shown on each data point. Figure 16 presents the results obtained outdoors in conditions (c1) and (c2) (considering h = 1 and h = 5 as reference images, blue continuous and green dashed curves respectively). In all cases, the height measurement behaves monotonically increasing as the actual height of the test image rises. Moreover, when the height of the test image is below the reference image (in the case of h re f = 5), indicators are negative. When the actual height difference is lower than ∆h = 3 (∆z = 45 cm), the standard deviation in all cases is low enough to determine the relative height between images unequivocally. In general, the variance of the results is high for test images that were captured far away of the reference image. This effect is specially pronounced when using panoramic images, both with the central cell correlation (Figure 16a,b), and with the CRS movement methods (Figure 16f,g). A high variance in the results means that the indicator is not reliable in those height gap ranges. On the other side, the configurations that use orthographic projections tend to present a relatively low variance.

Multiscale Analysis
The results of the 2D-DFT Vertical Phase method (Figure 16c) present a high variance for those test images which are distant from the reference image. It is important to highlight that, as presented in Section 3.2, this method is based on the DFT shift theorem, which assumes that a pure circular shift of rows occurs. However, when the visual system moves vertically, new information is introduced and other existing disappears. This fact produces a difference in the 2D-DFT coefficients that implies an intrinsic error on the vertical phase estimation. As the height gap between images increases, this error tends to be more significant.
As far as description methods are concerned, in the case of panoramic and orthographic projections, results do not present remarkable differences between the FS and the 2D-DFT, although the variance is slightly lower using the first descriptor. About the unit sphere projection, no comparative can be carried out as the only method to describe this projection is the SFT. Its results show a clear linear tendency, whereas deviation of the results is high for vertical gaps higher than 60 cm.
As a conclusion, the techniques that use the orthographic view tend to present a quite linear behaviour with a relatively low variance in outdoor environments. Also, no relevant differences can be found in the linearity of global-appearance methods comparing to local-features methods. However, some global-appearance configurations present an improved deviation. Figure 17 shows the results obtained indoors in conditions (c1) and (c2) (considering h = 1 and h = 5 as reference images, blue continuous and green dashed curves respectively). To make it possible a homogeneous comparison, the same scale is used in the equivalent subfigures in Figures 16 and 17. Compared to the outdoors results, the sensitivity of the height indicators is higher as the slope of the curves is greater, specially when using the panoramic projection. The main reason for this behaviour is the relative distance of the elements with respect to the catadioptric vision sensor. Indoors, the elements of the environment are generally closer to the catadioptric system than outdoors. For this reason, when the height of the visual system changes, the distribution of the elements in the image suffers a greater variation, since the angle of incidence of the rays that represent the objects has also a higher variation. Figure 18 presents the variation of the angle of incidence of two world points P 1 and P 2 whose distance to the vision system is different, when the height of this system changes. The figure shows that α 1 , which represents the change of the angle of projection of the closest point (P1) is higher than the angle of P2 (α 2 ). Moreover, the objects in an indoor scene generally present a greater range of distances with respect to the visual system. For that reason, when the catadioptric system moves vertically, the objects contained in the scenes experience movements with different magnitude depending on the distance from these objects to the catadioptric system. Figure 19 shows the panoramic view of two images captured from the same ground point but with different heights. We can observe that the element highlighted in red suffers a greater height variation in the scene (h 1 − h 1 ) comparing to the object highlighted in green (h 2 − h 2 ), which is more distant to the sensor. Also, when the camera is near the ceiling, there is a loss of visual information as the ceiling is more present in the scene.
Therefore, the results obtained indoors ( Figure 17) show how the linear trend tends to degrade and the standard deviation tends to be higher in the case of the greater height gaps when using the panoramic image. This behaviour is specially remarkable when the 2D-DFT Vertical Phase method is used (Figure 17c). The techniques based on the orthographic projection present again a quite linear behaviour with a relatively low deviation, since they gather elements that are located at a similar distance from the camera (mainly the floor plane). The use of local-features presents the same general problems (lack of linearity and relatively high deviation in high altitude gaps).    Table 4. These results show again a quite linear behaviour, which is similar, in several cases, to the behaviour of local-appearance methods, and the ability of the algorithms to distinguish between positive and negative displacements. It is worth highlighting the results of the orthonormal projection because they present, in general, a quite linear behaviour with a relatively low standard deviation.
In general, outdoor experiments present a more linear tendency comparing to indoors. Also, the height indicators present clearly higher absolute values when working with indoor images, except for methods based on the orthographic projection. Multiscale analysis techniques and the CRS applied to the orthographic projection are the techniques that present a lower difference between indoor and outdoor images and the standard deviation of their results is relatively low.

Conclusions
In this work, five methods to estimate the height of a mobile platform have been proposed and a comparative evaluation has been carried out. All the methods use only omnidirectional images captured by a catadioptric vision sensor mounted on the platform. Four of them are based on global appearance and, also, an additional method based on local features has been included, with comparative purposes. A complete and exhaustive set of experiments has been carried out to test the validity of each approach. Some challenging sets of images captured both indoors and outdoors have been used to carry out the experiments.
Next, we enumerate the principal conclusions obtained from the results: • All the methods proposed are able to detect the relative height between the capture points of two images captured along a vertical line, dealing successfully with little displacements in the floor plane and small changes in the orientation of the visual system produced during the capture. • Some of the indicators present a quite linear tendency. In general, this linear tendency is clearer when using images captured outdoors.

•
The sign of the indicators provides information about the direction of the vertical movement. Therefore, a negative sign indicates that the test image is below the reference image.

•
In some cases, the results present a relatively high standard deviation, mainly when the height gap between the reference and the test images increases. In general, this effect is more clearly noticeable indoors.

•
Techniques based on the orthographic projection of the omnidirectional images present the most linear behaviour and the lowest deviation, specially with the method based on the Camera Reference System (CRS) movement. This way, a larger working range can be obtained with this method.

•
The different techniques rely on the movement of the scene objects to estimate the relative height. Since this movement is quantitatively higher indoors, the indicators obtained with this database present, in general, higher absolute values. As the orthographic projection mainly gathers the floor information, the methods based on this projection present less difference between indoor and outdoor scenes. Therefore, the magnitude of the indicators based on this projection is less dependent on the capture environment. This is an additional advantage of this kind of projection, specially when using the CRS method along with the FS.

•
In the indoor environment, the slope of some indicators tends to decrease as the height increases. It happens mainly in the methods based on multiscale analysis and in CRS movement. Also, the effect is more pronounced when the reference image is h re f = 1, what means estimating higher height gaps. The effect shown in Figure 19 may have an influence on this behaviour: the objects in the scene experience movements with different magnitude as the height of the camera changes, and this effect will be more pronounced in the case of the higher height gaps, leading to a loss of linearity in these cases.

•
When comparing to methods based on local features, only the global-appearance methods that make use of the panoramic image have shown relatively worse results (as they present a higher standard deviation in most cases). The other global appearance-methods prove to be an efficient alternative to local features both considering their computational cost, and the linearity and standard deviation of the results.
In the light of the above, this work has demonstrated the possibility of using descriptors based on global appearance to carry out height estimation with accuracy. The results would permit the development of integrated visual navigation systems as future developments. Provided that a robot has 4 DOF ((x, y, z) and change of orientation θ around the z-axis), first, an estimation of the coordinates x, y, and θ could be carried out using the principles presented in the work [26]. Once this position and orientation is known, the methods exposed in the present paper could be used to estimate topologically the relative height z. The methods based on the orthographic projection, both using the Multiscale Analysis and the CRS Movement, have presented the best performance. It is interesting to highlight that, despite the fact that the height estimators calculated cannot be considered metric estimators, they go beyond the classical concept of topological distance because they contain not only connection or neighborhood relations but they also give us an idea of closeness or farness from the reference image. This is an interesting conclusion as it would permit using these estimators to build hybrid maps of a large environment, including height information, using the methods explained in [34].
Future works will focus on this mapping line, since having complete maps of an environment would be very useful in many applications, when the robot has to estimate this position with more than 3 DOF. Also, these methods may be complemented to estimate the pose of the mobile platform when it has 6 DOF, using a unique initial model that contains information on all the necessary DOF. For this purpose, the SFT seems to be the most suitable descriptor, thanks to its invariance against rotations around any axis, and a deeper experimentation with this method could be carried out.