Using Deep Learning for Visual Navigation of Drone with Respect to 3D Ground Objects

: In the paper, visual navigation of a drone is considered. The drone navigation problem consists of two parts. The ﬁrst part is ﬁnding the real position and orientation of the drone. The second part is ﬁnding the di ﬀ erence between desirable and real position and orientation of the drone and creation of the correspondent control signal for decreasing the di ﬀ erence. For the ﬁrst part of the drone navigation problem, the paper presents a method for determining the coordinates of the drone camera with respect to known three-dimensional (3D) ground objects using deep learning. The algorithm has two stages. It causes the algorithm to be easy for interpretation by artiﬁcial neural network (ANN) and consequently increases its accuracy. At the ﬁrst stage, we use the ﬁrst ANN to ﬁnd coordinates of the object origin projection. At the second stage, we use the second ANN to ﬁnd the drone camera position and orientation. The algorithm has high accuracy (these errors were found for the validation set of images as di ﬀ erences between positions and orientations, obtained from a pretrained artiﬁcial neural network, and known positions and orientations), it is not sensitive to interference associated with changes in lighting, the appearance of external moving objects and the other phenomena where other methods of visual navigation are not e ﬀ ective. For the second part of the drone navigation problem, the paper presents a method for stabilization of drone ﬂight controlled by autopilot with time delay. Indeed, image processing for navigation demands a lot of time and results in a time delay. However, the proposed method allows to get stable control in the presence of this time delay.


Introduction
The drone navigation problem [1][2][3][4][5][6][7][8][9][10][11] appeared during the development of systems for autonomous navigation of drones. The question arose about the possibility of determining the coordinates of the drone based on the image of the runway or the airfield control tower, or any characteristic building along the route of the drone.
However, the task of determining the coordinates of the camera with respect to three-dimensional (3D) object images can also be used for other conditions: indoor navigation or inside city navigation. Such navigation programs can be used both in tourist guide devices and in control of autonomous robots.
Usually, GPS (Global Positioning System) [1] and inertial navigation systems [2] are used to solve navigation problems. However, each of these navigation methods has its own drawbacks. Inertial systems tend to accumulate errors in the coordinates and angles of a drone during its flight.
In the original AlexNet, there are only 8 levels (5 convolutional and 3 fully connected). It is possible to load a pretrained version of the network trained on more than a million images from the ImageNet database [12][13][14][15]. The pretrained network can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 227 × 227.

Materials and Methods
There is a series of interesting papers [5,6] where the authors used deep learning and convolutional neural network to determine camera position and orientation. These papers are pioneer work in the application of deep learning techniques to navigation.
Unfortunately, this technology is used only for ground cameras, that can be placed on ground robots, ground vehicles, or pedestrians. More specifically, in their work [5,6], the authors determined the coordinates and angle of the camera using images of the Cambridge Landmarks building, filmed with a smartphone camera by pedestrians.
The flying camera was used in the current paper, unlike in References [5,6], where the authors used only the ground camera. So, in the current paper, we are able to analyze the applicability of deep learning technology for many more ranges of angles and coordinates.
In References [5][6][7][8][9], Cartesian coordinates and quaternions (logarithm of unit quaternions in Reference [10]) were used for description of the camera position and orientation. We used a description of the camera position and orientation that is more natural for images. It allows us to get better precision for the camera position and orientation. Indeed, such description provides the data which is easier and more natural for understanding by the artificial neural network.
In more detail, we considered the problem in which the camera can be located at a great distance (up to 1000 m) from the ground object landmark. Therefore, our method involves two stages: determining the position of the object on the image, and then determining the position and orientation of the camera using only a narrow neighborhood of the object.
This position and orientation of the camera is described (i) by two angles describing the camera optical axis rotation with respect to the ray (connecting the object with the camera optical center), (ii) by the distance between the camera optical center and the object, and (iii) by quaternions, describing two angles of the ray and one angle of the camera rotation around the ray.
For the two stages, as described above, we used the two artificial neural networks, which are built on the basis of AlexNet [12][13][14][15] with small changes in structure and training method.

Deep Learning for Localization: Training and Validation of Dataset Formation
We considered the problem of determining three coordinates and three angles of rotation of the camera using the artificial neural network.
In the game application Unity, the model scene was created, containing an airport control tower in the middle of an endless green field ( Figure 2). better precision for the camera position and orientation. Indeed, such description provides the data which is easier and more natural for understanding by the artificial neural network. In more detail, we considered the problem in which the camera can be located at a great distance (up to 1000 m) from the ground object landmark. Therefore, our method involves two stages: determining the position of the object on the image, and then determining the position and orientation of the camera using only a narrow neighborhood of the object.
This position and orientation of the camera is described (i) by two angles describing the camera optical axis rotation with respect to the ray (connecting the object with the camera optical center), (ii) by the distance between the camera optical center and the object, and (iii) by quaternions, describing two angles of the ray and one angle of the camera rotation around the ray.
For the two stages, as described above, we used the two artificial neural networks, which are built on the basis of AlexNet [12][13][14][15] with small changes in structure and training method.

Deep Learning for Localization: Training and Validation of Dataset Formation
We considered the problem of determining three coordinates and three angles of rotation of the camera using the artificial neural network.
In the game application Unity, the model scene was created, containing an airport control tower in the middle of an endless green field ( Figure 2). In real life, any object is surrounded by dozens of others-houses, trees, roads, etc.-which give us additional information for orientation. We decided to make the task minimally dependent on additional landmarks, so that the neural network would only use the control tower building for navigation. Obviously, such a solution of a minimalistic task can be easily transformed to an environment where the landmark is surrounded by many other objects.
The basis of the landmark building (named as the object in further parts of the paper) was located at the origin of the ground coordinate system, and the virtual camera was randomly located above ground level at a distance from 200 to 1000 m from the object origin. The camera was always oriented so that the landmark building was completely included in the frame. In total, 6000 images were produced in pixel size of 2000 × 2000: 5000 for training and 1000 for validation ( Figure 3). In real life, any object is surrounded by dozens of others-houses, trees, roads, etc.-which give us additional information for orientation. We decided to make the task minimally dependent on additional landmarks, so that the neural network would only use the control tower building for navigation. Obviously, such a solution of a minimalistic task can be easily transformed to an environment where the landmark is surrounded by many other objects.
The basis of the landmark building (named as the object in further parts of the paper) was located at the origin of the ground coordinate system, and the virtual camera was randomly located above ground level at a distance from 200 to 1000 m from the object origin. The camera was always oriented so that the landmark building was completely included in the frame. In total, 6000 images were produced in pixel size of 2000 × 2000: 5000 for training and 1000 for validation ( Figure 3).  All information about the coordinates of the camera in space and the location of the reference object on the image was automatically formed into a table, which was then used for training and validation of the neural network.

Two Stages and Two Artificial Neural Networks (ANN) for Finding the Drone Camera Position and Orientation
Since the obtained images have pixel size 2000 × 2000, which is too large for processing by a convolutional neural network, we split the task into two stages.
At the first stage, we use the first ANN to find coordinates u and v, describing the position of the object origin projection, on the image with reduced pixel size 227 × 227.
At the second stage, we use the second ANN to find the drone camera position and orientation ( -distance between camera optical center and the object origin, , , , -quaternion components, describing the camera orientation, and -the projection point of the object origin on the modified image) using the new modified image with pixel size 227 × 227, where the object origin projection is in the image center.

The first Stage of Drone Camera Localization
Network Architecture and Its Training at the First Stage Let us describe the first stage here. We solve here the task to find the horizontal and vertical coordinates u and v of the object origin projection in a large image ( Figure 4). These coordinates are equal to the pixel coordinates of the object origin projection according to the center of the image, divided by the focal length of the camera: where F is the focal length of the camera in pixels, calculated by the formula (from the right-angled triangle ABC in Figure 4b) where Width is the width size of the image in pixels, ctg = , and α-camera field of view in the x and y direction. All information about the coordinates of the camera in space and the location of the reference object on the image was automatically formed into a table, which was then used for training and validation of the neural network.

Two Stages and Two Artificial Neural Networks (ANN) for Finding the Drone Camera Position and Orientation
Since the obtained images have pixel size 2000 × 2000, which is too large for processing by a convolutional neural network, we split the task into two stages.
At the first stage, we use the first ANN to find coordinates u and v, describing the position of the object origin projection, on the image with reduced pixel size 227 × 227.
At the second stage, we use the second ANN to find the drone camera position and orientation (r-distance between camera optical center and the object origin, W q , X q , Y q , Z q -quaternion components, describing the camera orientation, u gom and v gom -the projection point of the object origin on the modified image) using the new modified image with pixel size 227 × 227, where the object origin projection is in the image center.

The first Stage of Drone Camera Localization
Network Architecture and Its Training at the First Stage Let us describe the first stage here. We solve here the task to find the horizontal and vertical coordinates u and v of the object origin projection in a large image ( Figure 4). These coordinates are equal to the pixel coordinates of the object origin projection according to the center of the image, divided by the focal length of the camera: where F is the focal length of the camera in pixels, calculated by the formula (from the right-angled triangle ABC in Figure 4b) where Width is the width size of the image in pixels, ctg , and α-camera field of view in the x and y direction.   For calculation of u and v, we used the artificial neural network, which is built on the basis of AlexNet [12][13][14][15] with small changes in the structure and training method.
For our tasks, we removed the last two fully connected layers, and replaced them with one fully connected layer with 2 outputs (u and v coordinates) and one regression layer. That is, we used the pre-trained AlexNet network to solve the regression problem: determining two coordinates of the camera ( Figure 5). Homography Transformation for Translation of the Object Origin Projection to the Image Center After finding coordinates u and v, we can accomplish homography transformation, corresponding to camera rotation on angle  in the horizontal direction and camera rotation on angle  in the vertical direction. After this transformation, the object origin projection moves to the image center ( Figure 6). Assuming that focus length is equal to 1, we find the following values for  and  : Finally, we get two rotation matrices, and , which describe the homographic transformation, H (i.e., linear transformation, moving points with homogeneous coordinates ( , , 1) to the point ( , , 1) ), in such a way that the object origin projection point ( , , 1) moves to the point (0,0,1) ( Figure 7): where We reduced all images to pixel size 227 × 227 and used them for neural network training. For training the solver, we used "adam" with 200 epochs, batch size 60, initial training velocity 0.001, and reduced the training velocity by 2 after every 20 epochs.
We have the training set (used for the artificial neural network (ANN) training) and the validation set for the verification of the pretrained ANN.
Firstly, all errors were found for the training set of images as differences between two angles (φ u and φ v ), obtained from the pretrained artificial neural network, and the two known angles (φ u and φ v ), used for creation of images by the Unity program. The root mean square of these errors are shown in the first row of Table 1. Secondly, all errors were found for the validation set of images as differences between two angles (φ u and φ v ), obtained from the pretrained artificial neural network, and the two known angles (φ u and φ v ), used for creation of images by the Unity program. The root mean square of these errors is shown in the second row of Table 1.
The error for finding the object origin projection coordinates u and v can be found in Table 1. Errors correspond to errors of two angles (φ u and φ v ), as demonstrated in Figure 6.  Let us consider a neighborhood with pixel size 227 × 227 of point (0, 0, 1), because after camera rotation, the object origin projection will be in this point. By application of homography to any point of this neighborhood (i.e., by multiplying the point homogeneous coordinates by ), we can find its correspondent point on the initial image and color values in this point.
In this way, we get a new set of images, where the optical axis is directed to the object origin and projection of this origin is in the image center (Figure 7). Since and were found by the artificial neural network with some error, we can again find the real coordinates and of the object origin projection on the new modified image (obtained from homology), which will be used for training the artificial neural network in the second In this way, we achieved the first-stage program purpose: to find the object origin projection position on the big image.
Homography Transformation for Translation of the Object Origin Projection to the Image Center After finding coordinates u and v, we can accomplish homography transformation, corresponding to camera rotation on angle φ u in the horizontal direction and camera rotation on angle φ v in the vertical direction. After this transformation, the object origin projection moves to the image center ( Figure 6).
Assuming that focus length is equal to 1, we find the following values for φ u and φ v : Finally, we get two rotation matrices, R u and R v , which describe the homographic transformation, H (i.e., linear transformation, moving points with homogeneous coordinates (u 0 , v 0 , 1) to the point (u 1 , v 1 , 1)), in such a way that the object origin projection point (u, v, 1) moves to the point (0, 0, 1) ( Figure 7): where Let us consider a neighborhood with pixel size 227 × 227 of point (0, 0, 1), because after camera rotation, the object origin projection will be in this point. By application of homography to any point of this neighborhood (i.e., by multiplying the point homogeneous coordinates by ), we can find its correspondent point on the initial image and color values in this point.
In this way, we get a new set of images, where the optical axis is directed to the object origin and projection of this origin is in the image center (Figure 7). Since and were found by the artificial neural network with some error, we can again find the real coordinates and of the object origin projection on the new modified image (obtained from homology), which will be used for training the artificial neural network in the second stage. Indeed, H is the rotation matrix in system coordinates of the camera (center of coordinates O is in the optical center of the camera, axis OZ is the direction of the optical axis of the camera, OX axis is directed along width of the image, OY axis is directed along height of the image). The rotation H is composed of two rotations: the first rotation R v is defined around the OX axis on angle φ v , and the second rotation R u is defined around the OY axis on angle φ u (see Figure 6).
Let us consider a neighborhood with pixel size 227 × 227 of point (0, 0, 1), because after camera rotation, the object origin projection will be in this point. By application of homography to any point of this neighborhood (i.e., by multiplying the point homogeneous coordinates by R T u R T v ), we can find its correspondent point on the initial image and color values in this point.
In this way, we get a new set of images, where the optical axis is directed to the object origin and projection of this origin is in the image center (Figure 7).

Finding Coordinates of the Object Origin Projection on the New Modified Images
Since u and v were found by the artificial neural network with some error, we can again find the real coordinates u gom and v gom of the object origin projection on the new modified image (obtained from homology), which will be used for training the artificial neural network in the second stage.

Transformation of Camera Coordinates to Quaternion Form
After several trials using different coordinate systems (Cartesian, cylindrical, spherical) and different angle representations, we chose quaternions for describing angle values, which does not result in the appearance of special points (where the same camera position and orientation correspond to several different representations).
In the case, when the camera is directed to the object origin, its position and orientation can be described by distance to the object origin r and by two rotation angles of the vector, connecting the object origin and optical center of the camera: azimuthal ϕ, zenith θ, and camera rotation angle with respect to this vector-ω (Figure 8).

Transformation of Camera Coordinates to Quaternion Form
After several trials using different coordinate systems (Cartesian, cylindrical, spherical) and different angle representations, we chose quaternions for describing angle values, which does not result in the appearance of special points (where the same camera position and orientation correspond to several different representations).
In the case, when the camera is directed to the object origin, its position and orientation can be described by distance to the object origin and by two rotation angles of the vector, connecting the object origin and optical center of the camera: azimuthal , zenith , and camera rotation angle with respect to this vector- (Figure 8).
Respectively, we can find rotation matrices , , and , and composite rotation matrix = . By transforming this matrix to quaternion form and further normalizing it, we get four coordinates, , , , (index q is added to avoid getting the wrong , , and as Cartesian coordinates of the camera). These quaternions give a full and unambiguous description of the camera orientation in space.

The Artificial Neural Network for the Second Stage
At the second stage, we also used AlexNet with similar modifications, i.e., we removed the last two fully connected layers, and replaced them with one fully connected layer with 7 outputs (distance between camera optical center and the object origin, , , , -quaternion components, describing the camera orientation, and -the projection point of the object origin on the modified image) and one regression layer.
Since the seven outputs of the fully connected layer have different scales ( changes from 200 to 1000 m, and the rest of the values change between 0 and 1), we multiplied , , , , , and by scale coefficient μ = 1000 for better training. For the training solver, we used "adam" with 200 epochs, batch size 60, initial training velocity 0.001, and reduced the training velocity by 2 after every 20 epochs.

Training Results
We have the training set (used for the ANN training) and the validation set for the verification of the pretrained ANN.
Firstly, all errors were found for the training set of images as differences between positions and orientations, obtained from the pretrained artificial neural network, and known positions and orientations, used for creation of images by the Unity program. The root mean square of these errors are shown in the first row of Table 2. Respectively, we can find rotation matrices R ϕ , R θ , and R ω , and composite rotation matrix R ϕθω = R ω R θ R ϕ . By transforming this matrix to quaternion form and further normalizing it, we get four coordinates, W q , X q , Y q , Z q (index q is added to avoid getting the wrong X q , Y q , and Z q as Cartesian coordinates of the camera). These quaternions give a full and unambiguous description of the camera orientation in space.

The Artificial Neural Network for the Second Stage
At the second stage, we also used AlexNet with similar modifications, i.e., we removed the last two fully connected layers, and replaced them with one fully connected layer with 7 outputs (r-distance between camera optical center and the object origin, W q , X q , Y q , Z q -quaternion components, describing the camera orientation, u gom and v gom -the projection point of the object origin on the modified image) and one regression layer.
Since the seven outputs of the fully connected layer have different scales (r changes from 200 to 1000 m, and the rest of the values change between 0 and 1), we multiplied W q , X q , Y q , Z q , u gom , and v gom by scale coefficient µ = 1000 for better training.
For the training solver, we used "adam" with 200 epochs, batch size 60, initial training velocity 0.001, and reduced the training velocity by 2 after every 20 epochs.

Training Results
We have the training set (used for the ANN training) and the validation set for the verification of the pretrained ANN.
Firstly, all errors were found for the training set of images as differences between positions and orientations, obtained from the pretrained artificial neural network, and known positions and orientations, used for creation of images by the Unity program. The root mean square of these errors are shown in the first row of Table 2.
Secondly, all errors were found for the validation set of images as differences between positions and orientations, obtained from the pretrained artificial neural network, and known positions and orientations, used for creation of images by the Unity program. The root mean square of these errors are shown in the second row of Table 2. Table 2. Errors of coordinates and rotation angles for the drone camera. Errors of the second stage are described in Table 2.
As a result, we get the camera coordinates in the space with precision up to 4 m and the camera orientation with precision up to 1 • .

Automatic Control
Let us define the following variables and parameters used in equations of motion for the drone (see Figure 9): V-Flight velocity tangent to trajectory (with respect to air) H-Height above mean sea level of a drone flight ϑ-Pitch angle, i.e., angle between longitudinal drone axis and horizontal plane α-Angle of attack, i.e., angle between longitudinal axis of a drone and projection of drone velocity on the symmetry plane of the drone. The drone flight is described by the system of nonlinear equations [16]. For the case when stationary parameters cannot provide stability of the desirable stationary trajectory themselves, we need to use autopilots ( Figure 10). An autopilot states the control parameters, ( − ), ( − ), to be functions of the output-controlled parameters ( ( ); ℎ( ); ( ); ( )): controlled parameters ( ( ); ℎ( ); ( ); ( )) are perturbation with respect to the desirable stationary trajectory. The autopilot can get the output parameters from measurements of different navigation devices: from satellite navigation, inertial navigation, vision-based navigation, and so on. Using these measurements of the navigation devices, the autopilot can create controlling signals to decrease undesirable perturbations. However, any navigation measurements have some time delay in obtaining values of the output parameters which are controlled by autopilot. As a result, the problem exists, because some necessary information which are necessary for control does not exist.
Indeed, image processing for visual navigation demands a lot of time and results in a time delay. We can see from Figure 10 that the time delay exists in a measurement block because of a large amount of time, which is necessary for image processing of visual navigation. As a result, we know the deviation also with time delay. However, the proposed method allows to get stable control in the presence of this time delay. In Reference [16], we demonstrate that it is possible even for such conditions with a time delay to get a controlling signal, providing a stable path.
Final solutions from Reference [16] are as follows: We can calculate parameters , , , :  We will consider the steady-state solution as drone flight at constant velocity and height.
Let us consider small perturbation with respect to steady state: Linear equations for perturbations are as follows: where v(t) is velocity perturbation, h(t) is height perturbation, α(t) is attack angle perturbation, ϑ(t) is pitch angle perturbation, t is normalized by τ a time t as defined above, δ p (t − τ) and δ B (t − τ) are two autopilot control parameters, and τ is time delay, which is necessary for measurement of the perturbations and calculation of the control parameters. All derivatives are made according to normalized time: d dt . All n ij and τ a are constant numerical coefficients, for which numerical values are described below.
For the case when stationary parameters cannot provide stability of the desirable stationary trajectory themselves, we need to use autopilots ( Figure 10). An autopilot states the control parameters, δ p (t − τ), δ B (t − τ), to be functions of the output-controlled parameters (υ(t); h(t); α(t); ϑ(t)): controlled parameters (υ(t); h(t); α(t); ϑ(t)) are perturbation with respect to the desirable stationary trajectory. The autopilot can get the output parameters from measurements of different navigation devices: from satellite navigation, inertial navigation, vision-based navigation, and so on. Using these measurements of the navigation devices, the autopilot can create controlling signals to decrease undesirable perturbations. However, any navigation measurements have some time delay τ in obtaining values of the output parameters which are controlled by autopilot. As a result, the problem exists, because some necessary information which are necessary for control does not exist.
Mathematics 2020, 8, x FOR PEER REVIEW 12 of 13 Figure 10. Automatic control: flying drones have output parameters (output of block 1) describing their position, orientation, and velocity. These parameters are measured and calculated by a measurement system with some time delay (output of block 2). These measured parameters and their desirable values, calculated from the steady-state trajectory (inputs of block 3), can be compared, and deviation from steady-state solution can be calculated (output of block 3). Automatic pilot gets these deviations and calculates control parameters for the drone to decrease these deviations. Then, the drone changes its output parameters (output of block 1). This cycle repeats at all times of the flight.

Conclusions
The coordinates found ( -distance between camera optical center and the object origin, , , , -quaternion components, describing the camera orientation, and -the projection point of the object origin on the new modified image, and -the projection points of the object origin on the initial image with reduced pixel size) unambiguously define the drone position and orientation (six degrees of freedom).
As a result, we have detailed the program which allows us unambiguously and with high precision to find the drone camera position and orientation with respect to the ground object landmark.
We can improve this result by increasing the number of images in the training set, using ANN Figure 10. Automatic control: flying drones have output parameters (output of block 1) describing their position, orientation, and velocity. These parameters are measured and calculated by a measurement system with some time delay (output of block 2). These measured parameters and their desirable values, calculated from the steady-state trajectory (inputs of block 3), can be compared, and deviation from steady-state solution can be calculated (output of block 3). Automatic pilot gets these deviations and calculates control parameters for the drone to decrease these deviations. Then, the drone changes its output parameters (output of block 1). This cycle repeats at all times of the flight.
Indeed, image processing for visual navigation demands a lot of time and results in a time delay. We can see from Figure 10 that the time delay exists in a measurement block because of a large amount of time, which is necessary for image processing of visual navigation. As a result, we know the deviation also with time delay. However, the proposed method allows to get stable control in the presence of this time delay.
In Reference [16], we demonstrate that it is possible even for such conditions with a time delay to get a controlling signal, providing a stable path.

Conclusions
The coordinates found (r-distance between camera optical center and the object origin, W q , X q , Y q , Z q -quaternion components, describing the camera orientation, u gom and v gom -the projection point of the object origin on the new modified image, u and v-the projection points of the object origin on the initial image with reduced pixel size) unambiguously define the drone position and orientation (six degrees of freedom).
As a result, we have detailed the program which allows us unambiguously and with high precision to find the drone camera position and orientation with respect to the ground object landmark.
We can improve this result by increasing the number of images in the training set, using ANN with more complex structure, considering a more complex and more natural environment of the ground object landmark.
The errors can also be improved by changing the network configuration (like in References [7-10], for example) or by using a set of images instead of a single image [8,10].
Errors (completely describing camera position and orientation errors) were given in Tables 1 and 2. We need not object identification error here.
We chose our ground object and background to be strongly different, so recognition quality was almost ideal. This is not because of the high quality of the deep learning network, but because of choosing an object which was very different on the background. Indeed, such a clear object is needed for robust navigation.
Image processing for visual navigation demands a lot of time and results in a time delay. We can see from Figure 10 that the time delay exists in measurement blocks because of a large amount of time, which is necessary for image processing of visual navigation. As a result, we know the deviation also with time delay. However, the proposed method allowed us to get stable control in the presence of this time delay (smaller than 1.703 s).
We have not developed real-time realization yet. So, we could not provide a description of hardware in this paper, which is necessary for real-time work. It is a topic for future research. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.