#### 4.1.1. Human Eye Fixation Estimation

The saliency maps used in this work are generated by a GAN [

47]. The generated saliency maps derive from human eye fixation points and thus, they make the significance of a region in a scene more instinctual. Such information can be exploited for the obstacle detection procedure, and at the same time, enhance the intuition of the methodology. Additionally, the machine learning aspect enables the extensibility of the methodology, since it can be trained with additional eye fixation data, collected from individuals during their navigation through rough terrains. An example of the saliency maps estimated from a given image can be seen in

Figure 4. Since the model is trained on human eye-fixation data, it identifies as salient those regions in the image on which the attention of a human would be focused. As it can be observed in

Figure 4, in the first image, the most salient region corresponds to the fire extinguisher cabinet; in the second image, to the people on the left side; and in the last image, to the elevated ground and the tree branch.

The GAN training utilizes two different CNN models, namely, a discriminator and a generator. During the training, the generator learns to generate imagery related to a task, and the discriminator assists to the optimization of the resemblance to the target images. In our case, the target data are composed of visual saliency maps based on human eye tracking data.

The generator architecture is a VGG-16 [

40] encoder-decoder model. The encoder follows an identical architecture to that of VGG-16 unaccompanied by fully connected layers. The encoder is used to create a latent representation of the input image. The encoder weights are initialized by training the model on the ImageNet dataset [

48]. During the training, there was no update of the weights of the encoder, with an exception to the last two convolutional blocks.

The decoder has the same architectural structure with the encoder network, with the exception that the layers are placed in reverse order, and the max pooling layers are replaced with up-sampling layers. To generate the saliency map, the decoder has an additional 1 × 1 convolutional layer in the output, with sigmoidal activation. The decoder weights were initialized randomly. The generator accepts an RGB image I_{RGB} as stimulus and generates a saliency map that resembles the human eye fixation on that I_{RGB}.

The discriminator of the GAN has a simpler architecture. The discriminator model consists of 3 × 3 convolutional layers, combined with 3 max pooling layers followed by 3 Fully Connected (FC) layers. The Rectified Liner Unit (ReLU) and hyperbolic tangent (tanh) functions are deployed as activation functions for the convolutional and FC layers, respectively. The only exception is the last layer of the FC part, where the sigmoid activation function was used. The architecture of the GAN generator network is illustrated in

Figure 5.

#### 4.1.2. Uncertainty-Aware Obstacle Detection

In general, an object that interferes with the safe navigation of a person can be perceived as salient. Considering this, the location of an obstacle is likely to be in regions of a saliency map that indicate high importance, i.e., with high intensities. A saliency map produced by the model described in

Section 4.1.1 can be treated as a weighted region of interest, in which an obstacle may be located. High-intensity regions of such a saliency map indicate high probability of the presence of an object of interest. Among all the salient regions in the saliency map, we need to identify these regions that may pose a threat to the person navigating in the scenery depicted in

I_{RGB}. Thus, we follow an approach, where both a saliency map and a depth map deriving by an RGB-D sensor are used for the risk assessment. The combination of the saliency and depth maps is achieved with the utilization of Fuzzy Sets [

49].

For assessing the risk, it can be easily deduced that objects/areas that are close to the VCP navigating in an area and are salient with regard to the human gaze may pose a certain degree of threat to the VCP. Therefore, as a first step, the regions that are in a certain range from the navigating person need to be extracted, so that they can be determined as threatening. Hence, we consider a set of 3 fuzzy sets, namely,

R_{1},

R_{2}, and

R_{3}—describing three different risk levels, which can be described with the linguistic values of high, medium, and low risk, respectively. The fuzzy sets

R_{1},

R_{2}, and

R_{3} represent a different degree of risk and their universe of discourse is the range of depth values of a depth map. Regarding the fuzzy aspect of these sets and taking into consideration the uncertainty in the risk assessment, there is an overlap between the fuzzy sets describing low and medium and medium and

high risk. The fuzzy sets

R_{1},

R_{2}, and

R_{3} are described by the membership function

r_{i}(z), i = 1, 2, 3, where

z ∈ [0, ∞). The membership functions are illustrated in

Figure 6c.

A major aspect of an obstacle detection methodology is the localization of obstacles and the description of their position in a manner that can be communicated and easily perceived by the user. In our system, the description of the spatial location of an object is performed using linguistic expressions. We propose an approach based on fuzzy logic to interpret the obstacle position using linguistic expressions (linguistic values) represented by fuzzy sets. Spatial localization of an obstacle in an image can be achieved by defining 8 additional fuzzy sets. More specifically, we define 5 fuzzy sets for the localization along the horizontal axis of the image, namely,

H_{1},

H_{2}, H_{3},

H_{4}, and

H_{5} corresponding to far left, left, central, right, and far right portions of the image. Additionally, to express the location of the obstacle along the vertical axis of the image, we define 3 fuzzy sets, namely,

V_{1},

V_{2}, and

V_{3} denoting the upper, central, and bottom portions of the image. The respective membership functions of these fuzzy sets are

h_{j}(x),

j = 1, 2, 3, 4, 5 and

v_{i}(y), i = 1, 2, 3, where

x, y ∈ [0, 1] are normalized image coordinates. An illustration of these membership functions can be seen in

Figure 6.

Some obstacles, such as tree branches, may be in close proximity to the individual with respect to the depth but at a certain height that safe passage would not be affected. Thus, a personalization step was introduced to the methodology eliminating false alarms. The personalization aspect and the minimization of false positive obstacle detection instances are implemented through an additional fuzzy set

P, addressing the risk an obstacle poses to a person with respect to the height. For the description of this

P fuzzy set, we define a two dimensional membership function

p(

h_{o}, h_{u}), where

h_{o} and

h_{u} are the heights of the obstacle and the user, respectively. The personalization methodology is described in

Section 4.1.3.

For the risk assessment, since the membership functions describing each fuzzy set were defined, the next step is the creation of 3 risk maps,

${R}_{M}^{i}$. The risk maps

${R}_{M}^{i}$, derive from the responses of a membership function,

r_{i}(z), and are formally expressed as:

where

D is a depth map that corresponds to an RGB image

I_{RGB}. Using all the risk assessment membership functions, namely

r_{1},

r_{2}, and

r_{3}, 3 different risk maps,

${R}_{M}^{1}$,

${R}_{M}^{2}$, and

${R}_{M}^{3}$, are derived. Each of these risk maps depicts regions that may pose different degrees of risk to the VCP navigating in the area. In detail, risk map

${R}_{M}^{1}$ represents regions that may pose high degree of risk,

${R}_{M}^{2}$ medium degree of risk, and finally

${R}_{M}^{3}$ low degree of risk. A visual representation of these maps can be seen in

Figure 7.

Figure 7b,c illustrates the risk maps derived from the responses of the

r_{1}, r_{2}, and

r_{3} membership functions on the depth map of

Figure 7a. Brighter pixel intensities represent higher participation in the respective fuzzy set, while darker pixel intensities represent lower participation.

In the proposed methodology, the obstacle detection is a combination between the risk assessed from the depth maps and the degree of saliency that is obtained from the GAN described in the previous subsection. The saliency map

S_{M} that is produced from a given

I_{RGB} is aggregated with each risk map

${R}_{M}^{i}$, where

i = 1, 2, 3, using the fuzzy AND (∧) operator (Godel t-norm) [

50], formally expressed as:

In Equation (2),

F_{1} and

F_{2} denote two generic 2D fuzzy maps with values within the [0, 1] interval, and

x,

y are the coordinates of each value of the 2D fuzzy map. The risk maps

${R}_{M}^{i}$ are, by definition, fuzzy 2D maps, since they derive from the responses of membership functions

r_{i} on a depth map. The saliency map

S_{M} can be considered as a fuzzy map where its values represent the degree of participation of a given pixel to the salient domain. Therefore, they can be combined with the fuzzy AND operator to produce a new fuzzy 2D map

${O}_{M}^{i}$ as follows:

The non-zero values of the 2D fuzzy map

${O}_{M}^{i}$ (obstacle map) at each coordinate (

x, y) indicate the location of an obstacle and express the degree of participation in the risk domain of the respective

${R}_{M}^{i}$.

Figure 8d illustrates the respective

${O}_{M}^{i}$ produced using the fuzzy AND operator with the three

${R}_{M}^{i}$. Higher pixel values of the

${O}_{M}^{i}$ portray higher participation on the respective risk category and the probability of the location of an obstacle.

Theoretically, the ${O}_{M}^{i}$ can be directly used to detect obstacles posing different degrees of risk to the VCP navigating in the area. However, if the orientation of the camera is towards the ground, the ground plane can be often falsely perceived as obstacle. Consequently, a refinement step is needed to optimize the obstacle detection results and reduce the occurrence of false alarm error. Therefore, a simple but effective approach for ground plane extraction is adopted.

The ground plane has a distinctive gradient representation along the

Y axis in depth maps, which can be exploited in order to remove it from the

${O}_{M}^{i}$. As a first step, the gradient of the depth map

D is estimated by:

A visual representation of a normalized difference map

$\frac{\partial D}{\partial y}$ in the [0, 255] interval can be seen in

Figure 9. As it can be seen, the regions corresponding to the ground have smaller differences than the rest of the depth map. In the next step, a basic morphological gradient

g [

51] is applied on the gradient of

D along the

y direction

$\frac{\partial D}{\partial y}$. A basic morphological gradient is basically the difference between dilation and erosion of the

$\frac{\partial D}{\partial y}$ given an all-one kernel

k_{5×5}:

where

δ and

ε denote the operations of dilation and erosion and their subscripts indicate the used kernel. In contrast to the usual gradient of an image, the basic morphological gradient

g corresponds to the maximum variation in an elementary neighborhood rather than a local slope. The morphological gradient is followed by consecutive operations of erosion and dilation with a kernel

k_{5×5}. As it can be noticed in

Figure 9c, the basic morphological filter

g gives higher responses on non-ground regions, and thus, the following operations of erosion and dilution are able to eliminate the ground regions quite effectively. The product of these consecutive operations is a ground removal mask

G_{M}, which is then multiplied with

${O}_{M}^{i}$, setting the values corresponding to the ground, to zero. This ground removal approach has been experimentally proven to be sufficient (

Section 5) to eliminate the false identification of the ground as obstacle. A visual representation of the ground mask creation and the ground removal can be seen in

Figure 9 and

Figure 10, respectively.

Once the obstacle map of the depicted scene is estimated following the process described above, the next step is the spatial localization of the obstacle in linguistic values. This step is crucial for the communication of the surroundings to a VCP. For this purpose, Fuzzy Sets are utilized in this work. As presented in

Section 4.1.1, 5 membership functions are used to determine the location of an obstacle along the horizontal axis (

x-axis) and 3 along the vertical axis (

y-axis).

Initially, the boundaries of the obstacles depicted in the obstacle maps need to be determined. For the obstacle detection task, the

${O}_{M}^{1}$ obstacle map, through which the high-risk obstacles are represented, is chosen. Then, the boundaries

b_{l}, where

l = 1, 2, 3…, of the obstacles are calculated using a border following the methodology presented in [

52]. Once the boundaries of each probable obstacle depicted in

${O}_{M}^{1}$ are acquired, their centers

c_{l} = (

c_{x},

c_{y}),

l = 1, 2, 3, … are derived by exploiting the properties of the image moments [

53] of boundaries

b_{l}. The centers

c_{l} can be defined using the raw moments

m_{00},

m_{10}, and

m_{01} of

b_{l} as follows:

where

q = 0, 1, 2, …,

k = 0, 1, 2, … and

x, y denote image coordinates along the

x-axis and

y-axis respectively. An example of the obstacle boundary detection can be seen in

Figure 11, where the boundaries of the obstacles are illustrated with green lines (

Figure 11b) and the centers of the obstacles are marked with red circles (

Figure 11c).

Once the centers have been calculated, their location can be determined and described with linguistic values using the horizontal and vertical membership functions, h_{j}, where j = 1, 2, 3, 4, 5, and v_{i}, where i = 1, 2, 3. If the response of h_{j}(c_{x}) and v_{i}(c_{y}) is greater than 0.65, then the respective obstacle with a boundary center of c_{l} = (c_{x}, c_{y}) will be described with the linguistic value that these h_{j} and v_{i} represent. Additionally, the distance between object and person is estimated using the depth value of depth map D at the location of D(c_{x}, c_{y}). Using this information, the VCP can be warned regarding the location and distance of the obstacle and, as an extension, be assisted to avoid it.

#### 4.1.3. Personalized Obstacle Detection Refinement

The obstacle map depicts probable obstacles that are salient for humans and are within a certain range. However, this can lead to false positive indications, since some obstacles, such as tree branches, can be within a range that can be considered threatening, but at a height greater than that of the user, not affecting his/her navigation. False positive indications of this nature can be avoided using the membership function p(h_{o}, h_{u}). To use this membership function, the 3D points of the scene need to be determined by exploiting the intrinsic parameters of the camera and the provided depth map.

To project 2D points on the 3D space in the metric system (meters), we need to know the corresponding depth value

z for each 2D point. Based on the pinhole model, which describes the geometric properties of our camera [

54], the projection of a 3D point to the 2D image plane is described as follows:

where

f is the effective focal length of camera, and (

X,

Y, z)

^{T} is the 3D point corresponding to a 2D point on the image plane

${\left(\tilde{u},\tilde{v}\right)}^{T}$. Once the projected point

${\left(\tilde{u},\tilde{v}\right)}^{T}$ is acquired, the transition to pixel coordinates (

x, y)

^{T} is described by the following equation:

s_{u} denotes a scale factor;

D_{u},

D_{v} are coefficients needed for the transition from the metric units to pixels, and (

x_{0}, y_{0})

^{T} is the principal point of the camera. With the combination of Equations (8) and (9) the projection which describes the transition from 3D space to the 2D image pixel coordinate system can be expressed as

The 3D projection of a 2D point with pixel coordinates (

x, y), for which the depth value

z is known, can be performed by solving Equation (10) for

X,

Y formally expressed below [

55]:

where

f_{x} =

fD_{u}s_{u} and

fy =

fD_{v}. Equation (11) is applied on all the 2D points of

I_{RGB} with known depth values

z. After the 3D points have been calculated, the

Y coordinates are used to create a 2D height map

H_{M} of the scene, where each value is a

Y coordinate indicating the height an object at the corresponding pixel coordinate in

I_{RBG}. Given the height

h_{u} of the user, we apply the

p membership function on the height map

H_{M} to assess the risk with respect to the height of the user. The responses of

p on

H_{M} create a 2D fuzzy map

P_{M} as shown below:

Finally, the fuzzy AND operator is used to combine

${O}_{M}^{i}$ with P

_{M}, resulting in a final personalized obstacle map

${O}_{P}^{i}$:

Non-zero values of ${O}_{P}^{i}$ represent the final location of a probable obstacle with respect to the height of the user and the degree of participation to the respective risk degree, i.e., the fuzzy AND operation between ${O}_{P}^{1}$ with P_{M} describes the high-risk obstacles in the scenery.