Distance Measurement of Unmanned Aerial Vehicles Using Vision-Based Systems in Unknown Environments

: Localization for the indoor aerial robot remains a challenging issue because global positioning system (GPS) signals often cannot reach several buildings. In previous studies, navigation of mobile robots without the GPS required the registration of building maps beforehand. This paper proposes a novel framework for addressing indoor positioning for unmanned aerial vehicles (UAV) in unknown environments using a camera. First, the UAV attitude is estimated to determine whether the robot is moving forward. Then, the camera position is estimated based on optical ﬂow and the Kalman ﬁlter. Semantic segmentation using deep learning is carried out to get the position of the wall in front of the robot. The UAV distance is measured using the comparison of the image size ratio based on the corresponding feature points between the current and the reference of the wall images. The UAV is equipped with ultrasonic sensors to measure the distance of the UAV from the surrounded wall. The ground station receives information from the UAV to show the obstacles around the UAV and its current location. The algorithm is veriﬁed by capture the images with distance information and compared with the current image and UAV position. The experimental results show that the proposed method achieves an accuracy of 91.7% and a computation time of 8 frames per second (fps).


Introduction
Nowadays, there has been an increase in unmanned aerial vehicles (UAV) to be used in various fields for different applications of indoor [1] and outdoor [2,3] surveillance investigation. Aerial surveillance [4] has the advantages of avoiding some obstacles and uneven surfaces on the land, where the positioning of the UAV mostly relies on the global positioning system (GPS) [2]. However, GPS signals can be easily disturbed and cannot reach some places, such as urban areas, mountains, forests, and buildings [5]. This situation makes localization of UAVs without the GPS remain a challenging task.
In simultaneous localization and mapping (SLAM) [6], the mapping system is a crucial component for UAV localization and navigation, such as point cloud maps [7] and occupancy maps [8]. Point cloud maps can be obtained by combining point measurements. However, this type of map is only suitable for high-precision sensors in static environments, because in the new environment, object mapping cannot be accessed and modified. The main limitation of occupancy maps is the fixed-size voxel grid that requires a map size that is known in advance and cannot be changed dynamically. Thus, these mapping methods cannot be used in a new and unknown environment. So, one of the more precise ways to find out the current position of the robot in a building is to calculate its current distance from the entrance.
Several methods have been proposed to determine the location of mobile robots by measuring distances using radio frequency (RF) and sensors. Ni et al. [9] and Guerrieri et al. [10] presented an indoor localization system using radio frequency identification (RFID). RFID-based localization uses RF tags placed on buildings as navigation waypoints and tracking tags that are attached to moving objects, so readers can track the objects in different locations. However, these methods require many expensive RFID readers, and the detection capability of each tag only works for about 6 m. A localization method based on the laser finder was proposed by Subramanian et al. [11] and Barawid et al. [12], mounted on a vehicle as a navigation sensor. The laser finder was used to obtain distance information to explore the surrounding environment and avoid obstacles. However, these methods have high hardware costs and a heavy load that is not suitable for UAVs. A Zigbee-based system for obtaining the location of the user and tracking them inside a building was introduced by Lin et al. [13]. Zigbee devices are set up beforehand in the building, and the target movement is assumed to be constant. Cheok et al. [14] developed a method for indoor positioning and navigation using light sensors. Still, this method requires installing fluorescent lamps in buildings and hardware for use by users or moving objects. Nakahira et al. [15] proposed the concept of distance measurement using the ultrasonic system to determine the position and orientation of mobile robots in a room. This method measures distance by processing the signal from the returning echoes of the acoustic pulse emitted into space. It is not suitable for long-distance measurement and is only used to avoid obstacles near the robots.
Previous studies on indoor positioning used a vision-based system to solve the localization problem of mobile robots. By using visual sensors, environmental information in the form of color, texture, and other visual information can be obtained more easily and accurately compared to the GPS, laser flashes, ultrasonic sensors, and other traditional sensors. In addition, visual sensors are also cheaper and easier to use, so vision-based navigation is one of the techniques that has continued to be developed recently. Kim et al. [16] used an augmented reality technology to provide location information in indoor environments. However, this method only recognizes a particular location by a marker that has a characteristic pattern. Li et al. [17] presented a localization method by distance measurement using a webcam placed inside a building. Due to the low-quality lens and brightness changes in the structure, the images taken had low quality. So, the camera needed calibration to avoid image distortion problems [18]. In [17], a mobile robot was first detected to determine its coordinates on the image. Then, the location of the robot was estimated based on the distance between the camera and the position of the wall near the robot. Shim et al.'s [19] approach used coordinate mapping to recognize robots in a building using multiple cameras at the same time. Lan et al. [20] conducted research on vision-based navigation schemes for UAVs based on mapped landmarks, where absolute positions of the landmark points are known.
The main purpose of this paper is to create a new framework to determine the location of a UAV based on distance measurements using a single camera. In this study, the camera mounted on the UAV transmits the image wirelessly to the ground station. This method is proposed as a solution for positioning and navigation of UAVs in indoor environments where the location map is not registered and without devices installed beforehand. The distance is measured by a size comparison between reference and current images. Then, the UAV movement is mapped in the user interface.
The remainder of this paper is organized as follows. Section 2 provides an explanation of the proposed material and the main algorithms. Section 3 provides performance results using videos captured from the UAV, supplemented by a discussion. Next, Section 4 summarizes the conclusions.

Materials
The experiments were carried out using Visual Studio as the software program in a 3.40 GHz CPU with 8 GB RAM. Aerial image sequences with a resolution of 720 × 480 were taken to implement the proposed method. Figure 1 shows an overview system of the UAV and the ground station. The type of UAV used in this system is a quadrotor consisting of four rotors where the radio receiver receives flight commands. Each rotor speed is controlled via an electronic speed controller (ESC) that receives a signal from the processor. The XBee module on the onboard system has the function to send UAV flight data to the ground station. An ultrasonic sensor is installed on each side of the quadrotor frame to detect obstacles when the quadrotor flies, and one sensor is mounted with the camera.
The remainder of this paper is organized as follows. Section 2 provides an explanation of the proposed material and the main algorithms. Section 3 provides performance results using videos captured from the UAV, supplemented by a discussion. Next, Section 4 summarizes the conclusions.

Materials
The experiments were carried out using Visual Studio as the software program in a 3.40 GHz CPU with 8 GB RAM. Aerial image sequences with a resolution of 720 × 480 were taken to implement the proposed method. Figure 1 shows an overview system of the UAV and the ground station. The type of UAV used in this system is a quadrotor consisting of four rotors where the radio receiver receives flight commands. Each rotor speed is controlled via an electronic speed controller (ESC) that receives a signal from the processor. The XBee module on the onboard system has the function to send UAV flight data to the ground station. An ultrasonic sensor is installed on each side of the quadrotor frame to detect obstacles when the quadrotor flies, and one sensor is mounted with the camera.

The Proposed Methods
The algorithm consists of several steps: UAV attitude estimation, camera position correction, semantic segmentation, and distance measurement. First, when the UAV moves around 3 m from the starting point, the reference image I is captured. The distance of the UAV from the starting point is measured using ultrasonic sensors mounted with the camera. The UAV attitude is estimated to determine the current pitch angle of the UAV. If the pitch angle is positive, it means that the UAV is moving forward. Figure 2 shows an overview of the system proposed in this paper. Then, for every 100 ms, the current image is captured and aligned based on the camera position estimation. The semantic segmentation is performed to determine the location of the wall in front of the UAV as a current image. The first wall frame is saved as a refer-

The Proposed Methods
The algorithm consists of several steps: UAV attitude estimation, camera position correction, semantic segmentation, and distance measurement. First, when the UAV moves around 3 m from the starting point, the reference image I is captured. The distance of the UAV from the starting point is measured using ultrasonic sensors mounted with the camera. The UAV attitude is estimated to determine the current pitch angle of the UAV. If the pitch angle is positive, it means that the UAV is moving forward. Figure 2 shows an overview of the system proposed in this paper.
The remainder of this paper is organized as follows. Section 2 provides an explanation of the proposed material and the main algorithms. Section 3 provides performance results using videos captured from the UAV, supplemented by a discussion. Next, Section 4 summarizes the conclusions.

Materials
The experiments were carried out using Visual Studio as the software program in a 3.40 GHz CPU with 8 GB RAM. Aerial image sequences with a resolution of 720 × 480 were taken to implement the proposed method. Figure 1 shows an overview system of the UAV and the ground station. The type of UAV used in this system is a quadrotor consisting of four rotors where the radio receiver receives flight commands. Each rotor speed is controlled via an electronic speed controller (ESC) that receives a signal from the processor. The XBee module on the onboard system has the function to send UAV flight data to the ground station. An ultrasonic sensor is installed on each side of the quadrotor frame to detect obstacles when the quadrotor flies, and one sensor is mounted with the camera.

The Proposed Methods
The algorithm consists of several steps: UAV attitude estimation, camera position correction, semantic segmentation, and distance measurement. First, when the UAV moves around 3 m from the starting point, the reference image I is captured. The distance of the UAV from the starting point is measured using ultrasonic sensors mounted with the camera. The UAV attitude is estimated to determine the current pitch angle of the UAV. If the pitch angle is positive, it means that the UAV is moving forward. Figure 2 shows an overview of the system proposed in this paper. Then, for every 100 ms, the current image is captured and aligned based on the camera position estimation. The semantic segmentation is performed to determine the location of the wall in front of the UAV as a current image. The first wall frame is saved as a refer- Then, for every 100 ms, the current image is captured and aligned based on the camera position estimation. The semantic segmentation is performed to determine the location of the wall in front of the UAV as a current image. The first wall frame is saved as a reference frame. Then, the feature points in the reference and current images are found to obtain an affine transformation of the current image. So, a size comparison between the reference and current images can be obtained. If the reference image size is less than 55% of the current image, then it is assumed that the UAV has moved 1 m forward. Then, the reference image is updated every 1 m. In this work, the UAV moves with a speed of around 1-1.5 m/s, and we assume that the UAV moves at a constant altitude and speed.  Figure 3 shows that the front and rear rotors rotate clockwise, and the others rotate counter-clockwise. A force F i is generated by each rotor i and is used to calculate the Euler angles: roll φ, pitch θ, and yaw ψ. The main thrust F N and control input, which depends on the rotor profile, can be calculated as follows [21]:

UAV Attitude Estimation
F N is applied to the airframe, where C th is the thrust coefficient of each rotor, ω i is the angular velocity of rotor I, and C d is the drag coefficient.
erence and current images can be obtained. If the reference image size is less than 55% of the current image, then it is assumed that the UAV has moved 1 m forward. Then, the reference image is updated every 1 m. In this work, the UAV moves with a speed of around 1-1.5 m/s, and we assume that the UAV moves at a constant altitude and speed. Figure 3 shows that the front and rear rotors rotate clockwise, and the others rotate counter-clockwise. A force Fi is generated by each rotor i and is used to calculate the Euler angles: roll , pitch , and yaw  . The main thrust FN and control input, which depends on the rotor profile, can be calculated as follows [21]:

UAV Attitude Estimation
FN is applied to the airframe, where th C is the thrust coefficient of each rotor, i  is the angular velocity of rotor I, and d C is the drag coefficient. The gyroscope and accelerometer measure three angular rates and two angular positions (  and  ) where the magnetometer measures  . A nonlinear complementary filter on the SO(3) [22] is used on each axis of the accelerometer and gyroscope to estimate the UAV attitude. Figure 4 shows the position and orientation of the UAV, where {E} is an arbitrary point of the space with a fixed inertial frame x, y, and z axes; l is the arm length of the UAV; m is mass; and g is the gravitational acceleration. Then, the dynamic model of the UAV is computed as follows [23,24]: The gyroscope and accelerometer measure three angular rates and two angular positions (φ and θ) where the magnetometer measures ψ. A nonlinear complementary filter on the SO (3) [22] is used on each axis of the accelerometer and gyroscope to estimate the UAV attitude. Figure 4 shows the position and orientation of the UAV, where {E} is an arbitrary point of the space with a fixed inertial frame x, y, and z axes; l is the arm length of the UAV; m is mass; and g is the gravitational acceleration. Then, the dynamic model of the UAV is computed as follows [23,24]: where J x , J y , and J z indicate the moments of inertia on the x, y, and z axes, respectively.
, and

Camera Position Correction
This step estimates the camera position to align the current image. Figure 5 shows an illustration of UAV movement that affects the camera motion. The image motion corresponds to camera motion on the yaw, pitch, and roll axes of the UAV movement. The affine transformation is used to handle the rotation and translation of the images. The optical flow method in [26] is used in this step to calculate the motion vectors of two consecutive frames. For each 10 × 10 sub-window, the flow of each pixel in the window is estimated by a polynomial in the local coordinate system (LCS) at I as follows:

Camera Position Correction
This step estimates the camera position to align the current image. Figure 5 shows an illustration of UAV movement that affects the camera motion. The image motion corresponds to camera motion on the yaw, pitch, and roll axes of the UAV movement. The affine transformation is used to handle the rotation and translation of the images.
, and

Camera Position Correction
This step estimates the camera position to align the current image. Figure 5 shows an illustration of UAV movement that affects the camera motion. The image motion corresponds to camera motion on the yaw, pitch, and roll axes of the UAV movement. The affine transformation is used to handle the rotation and translation of the images. The optical flow method in [26] is used in this step to calculate the motion vectors of two consecutive frames. For each 10 × 10 sub-window, the flow of each pixel in the window is estimated by a polynomial in the local coordinate system (LCS) at I as follows: The optical flow method in [26] is used in this step to calculate the motion vectors of two consecutive frames. For each 10 × 10 sub-window, the flow of each pixel in the window is estimated by a polynomial in the local coordinate system (LCS) at I as follows: where p is a vector, A is a symmetric matrix, b is a vector, and c is a scalar. The LCS at I(t) can be defined by Based on Equations (4) and (5), a new signal can be built at I(t) by a global displace- , so the relation between the LCS of two input images can be calculated by The coefficients of b in Equations (5) and (6) can be equated by So, the total displacement of the motion vectors in I(t) is computed as follows: The displacement value in Equation (8) is the translation of the motion vectors containing of the x axis (∆ x (t)) and y axis ∆ y (t) , so its angular value can be calculated by Then, the translation T x,y (t) and rotation θ(t) of I(t) are obtained as the most frequent value of the motion vectors as follows: and where l is the lower motion vector value, τ is the size of the motion vector class interval, f 1 is the frequency of the modal class, f 0 is the frequency of the class preceding the modal class, and f 2 is the frequency of the class succeeding the modal class.
In the next step, the translation and rotation obtained are compensated using the Kalman filter consisting of prediction and measurement parts. The initial state in the prediction step is defined by s(0) = [0, 0, 0], and then the state of the trajectoryŝ(t) = T x (t),T y (t),θ(t) at I(t) can be estimated bŷ The initial error covariance in the prediction step is defined by e(0) = [1, 1, 1], where the error covariance computed byê (t) = e(t − 1) + Q p (13) where Q p is the process's noise covariance set to 0.004. A Kalman gain can be calculated by where Q m is the measurement's noise covariance set to 0.25. The error covariance compensation is calculated as follows: Then, the trajectory is compensated in the new state by s(t) = T x (t), T y (t), θ (t) , where the trajectory state at I(t) is calculated as follows: The accumulation of the trajectory from each frame can be measured by So, a new trajectory can be obtained as follows: where the difference between x, y, and θ can be obtained as d Finally, a new image plane is produced to apply the new trajectory in Equation (18) to align I(t) with the transformation as follows:

Semantic Segmentation
We implemented a deep convolutional neural network (DCNN) [27] with ResNet-101 [28] for segmenting indoor scenes on ade20 k datasets. The model used is shown in Figure 6. To double the spatial density, feature responses are computed in the ResNet-101 network, then the last pooling or convolution layer is found to lower resolution. To determine the area of floors, walls, roofs, and other furniture in the room, we created labels for 27 classes in order to easily classify obstacles and roads in front of the robot. Fully connected layers are transformed into convolutional layers with increased feature resolution so that the feature response can be computed for every 8 pixels. Bi-linear interpolation is performed to resize the score map to the original image resolution. Then, the input image is forwarded to a fully connected CRF [29] to fine-tune the segmentation results.

Distance Measurement
This step estimates the position and size comparison of I at I(t) based on affine transformation [25,30]. The position of features that have similarities between I and I(t) is found using scale-invariant features transform (SIFT) [31]. SIFT is used as the feature extractor and descriptor in this method because it provides more invariance in the illumination changes compared with SURF [32,33].
First, the image color is changed into a gray-scale, and a median filter [34] is applied. Then, interesting points are approximated using Laplacian of Gaussian (LoG) in the scale space images. The difference between two consecutive scales is calculated as the convolution of the scale space with the Gaussian function as follows: where ( , , ) G x y  is a scale-variable Gaussian defined as where ( , ) x y are the spatial coordinates and  is the scale space factor.
The key points are found as the maxima and minima in the difference of Gaussian (DoG) between two images to make it a scale-invariant. This is done by comparing eight neighbor pixels in the current scale and nine corresponding neighbors at neighboring scales. Two such extrema images are generated, which need 4 DoG images with 5 Gaussian blurred images, hence the five levels of blurs in each octave. The DoG function with adjacent scales k can be computed by The bad key points on the edges and low-contrast regions are rejects using secondorder Taylor expansion of the ( , , ) D x y  at sample point X by The location of the extreme point can be calculated by taking the derivative of Equa-

Distance Measurement
This step estimates the position and size comparison of I at I(t) based on affine transformation [25,30]. The position of features that have similarities between I and I(t) is found using scale-invariant features transform (SIFT) [31]. SIFT is used as the feature extractor and descriptor in this method because it provides more invariance in the illumination changes compared with SURF [32,33].
First, the image color is changed into a gray-scale, and a median filter [34] is applied. Then, interesting points are approximated using Laplacian of Gaussian (LoG) in the scale space images. The difference between two consecutive scales is calculated as the convolution of the scale space with the Gaussian function as follows: L(x, y, σ) = G(x, y, σ) * I(x, y) where G(x, y, σ) is a scale-variable Gaussian defined as where (x, y) are the spatial coordinates and σ is the scale space factor.
The key points are found as the maxima and minima in the difference of Gaussian (DoG) between two images to make it a scale-invariant. This is done by comparing eight neighbor pixels in the current scale and nine corresponding neighbors at neighboring scales. Two such extrema images are generated, which need 4 DoG images with 5 Gaussian blurred images, hence the five levels of blurs in each octave. The DoG function with adjacent scales k can be computed by D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) * I(x, y) = L(x, y, kσ) − L(x, y, σ) The bad key points on the edges and low-contrast regions are rejects using secondorder Taylor expansion of the D(x, y, σ) at sample point X by The location of the extreme point can be calculated by taking the derivative of Equation (23) with respect to X as follows: Then, the low-contrast key points can be obtained by The key points are eliminated when |D(X )| < D 0 makes the algorithm efficient and robust. Then, the key points along with the edge are filtered out by the Hessian matrix as follows: The magnitude and orientation of each key point are calculated to cancel out the effect of orientation to make it rotation-invariant by MD(x, y) = (L(x + 1, y) − L(x − 1, y)) 2 + (L(x, y + 1) − L(x, y − 1)) 2 (27) and θD(x, y) = tan −1 (L(x + 1, y) − L(x − 1, y)) 2 + (L(x, y + 1) − L(x, y − 1)) 2 (28) SIFT features are generated with scale and rotation invariance in place. A SIFT descriptor is a characterization of a key point in the spatial histogram of the image gradients. The gradient at each pixel consists of the location of the pixel and the orientation of the gradient. The orientation is quantized into eight spatial coordinates for each cell in a 16 × 16 window. Then, a histogram consisting of 128 bins (16 cells × 8 orientations) is stacked as a single 128-dimensional vector.
The feature point pairs between I and I(t) are selected using the Fast Library for Approximate Nearest Neighbor (FLANN) [35]. Each distance of the pair is calculated using the Euclidean distance. The feature point is classified as a match if the distance is less than 0.6. Four feature points near the boundary in I(t) that are similar to the feature point in I are selected. These feature points are used to estimate the image size comparison. If the matching feature points are less than four, the feature points of the previous I(t) are used.
In the homogenous coordinates, the relationship between four matching feature points between I and I(t) can be estimated by where H is the homogeneous affine matrix that can be defined by where a ij is the parameter of the rotation angular θh, and Th x and Th y are the translation on the x and y axis on the image plane, respectively. Then, the least-squares problem is used to solve the affine matrix. In addition, the Random Sample Consensus (RANSAC) algorithm [36] is used to filter the outliers to find the correct affine transformation. As a result of the affine matrix, we can obtain the comparison of I in I(t) with the scale factor Ω(t) as follows:

Sensor Specifications
An ultrasonic sensor is used to provide information about the distance of nearby objects. The ultrasonic sensor used is HC-SR04 [37], as shown in Figure 7a, which includes a transmitter, a receiver, and control circuits (Vcc, trigger, echo, and GND). HC-SR04 uses an I/O trigger for 10 µs high-level signals by a pulse input from the processor and then sends eight 40 kHz cycle signals and detects returning pulse signals. If the signal returns through a high level, then the distance (in cm) is calculated as There are five ultrasonic sensors used in this system connected to a pin connector on an additional board. The trigger and echo pins are connected to analog pins on the processor, switched alternately using ULN 2803A [38]. There are five ultrasonic sensors used in this system connected to a pin connector on an additional board. The trigger and echo pins are connected to analog pins on the processor, switched alternately using ULN 2803A [38].
The XBee pro s2b module [39], as shown in Figure 7b, provides a UART interface to transmit (TX) and receive (RX) data that are connected to the UART pins on the mainboard. The XBee module operates using the Zigbee protocol with a low-power wireless sensor network that requires minimal power and provides reliable data transmission between remote devices.
The video transmitter, as shown in Figure 7c, used is a 2.4 GHz color CMOS camera to sends aerial images to the radio receiver at the ground station. The power supply for the camera requires 9-12 V DC obtained directly from the battery. Table 1 summarizes the component specifications that are used to support the surveillance system.

The User Interface
Visual Basic Net 2017 is used for user interface (UI) software programming, as shown in Figure 8. The proposed user interface is used to save and display aerial images, estimate the attitude of the UAV, receive obstacle information, and map the UAV's estimated dis- The XBee pro s2b module [39], as shown in Figure 7b, provides a UART interface to transmit (TX) and receive (RX) data that are connected to the UART pins on the mainboard. The XBee module operates using the Zigbee protocol with a low-power wireless sensor network that requires minimal power and provides reliable data transmission between remote devices. The video transmitter, as shown in Figure 7c, used is a 2.4 GHz color CMOS camera to sends aerial images to the radio receiver at the ground station. The power supply for the camera requires 9-12 V DC obtained directly from the battery. Table 1 summarizes the component specifications that are used to support the surveillance system.

The User Interface
Visual Basic Net 2017 is used for user interface (UI) software programming, as shown in Figure 8. The proposed user interface is used to save and display aerial images, estimate the attitude of the UAV, receive obstacle information, and map the UAV's estimated distance. Arduino is used for mainboard software programming. First, the COM port and baud rate used to access serial communication are selected. The UI displays the attitude data of the UAV, i.e., yaw, pitch, and roll, and simulates the current position of the UAV. The UAV is equipped with a small motor to move the camera forward and down. In the UI, there are settings for turning the camera on or off, as well as moving the camera up or down. The UI also displays a simulation of the current travel path of the UAV and the number of obstacles around the UAV detected by the ultrasonic sensors.

Frame Size Comparison
We collected images for around 2 m from the starting point in several locations where the first frame was a reference frame, and for every 50 ms, the reference frame's size was compared with the current frame's size. Figure 9 shows the steps for pre-processing images to obtain the average size of a reference in the current frame. The current frame is enhanced using a median filter and aligns based on the correction of the camera position, as described in Section 2.4. The features in the reference and current frames are extracted and described to find the match features, as explained in Section 2.6.  First, the COM port and baud rate used to access serial communication are selected. The UI displays the attitude data of the UAV, i.e., yaw, pitch, and roll, and simulates the current position of the UAV. The UAV is equipped with a small motor to move the camera forward and down. In the UI, there are settings for turning the camera on or off, as well as moving the camera up or down. The UI also displays a simulation of the current travel path of the UAV and the number of obstacles around the UAV detected by the ultrasonic sensors.

Frame Size Comparison
We collected images for around 2 m from the starting point in several locations where the first frame was a reference frame, and for every 50 ms, the reference frame's size was compared with the current frame's size. Figure 9 shows the steps for pre-processing images to obtain the average size of a reference in the current frame. The current frame is enhanced using a median filter and aligns based on the correction of the camera position, as described in Section 2.4. The features in the reference and current frames are extracted and described to find the match features, as explained in Section 2.6. First, the COM port and baud rate used to access serial communication are selected. The UI displays the attitude data of the UAV, i.e., yaw, pitch, and roll, and simulates the current position of the UAV. The UAV is equipped with a small motor to move the camera forward and down. In the UI, there are settings for turning the camera on or off, as well as moving the camera up or down. The UI also displays a simulation of the current travel path of the UAV and the number of obstacles around the UAV detected by the ultrasonic sensors.

Frame Size Comparison
We collected images for around 2 m from the starting point in several locations where the first frame was a reference frame, and for every 50 ms, the reference frame's size was compared with the current frame's size. Figure 9 shows the steps for pre-processing images to obtain the average size of a reference in the current frame. The current frame is enhanced using a median filter and aligns based on the correction of the camera position, as described in Section 2.4. The features in the reference and current frames are extracted and described to find the match features, as explained in Section 2.6.   Figure 10 shows the results of the frame size comparison of around 2 m. Based on our experiments, the average size of the reference frame in the current frame for a 1 m distance is about 55%. The best distance measurement using the estimated frame size ratio is for every 1 m going forward. In Figure 10, we can see that after the 25th frame, in which the distance is more than 1 m, the frame size ratio does not significantly differ. The influence of our image quality can cause this result.
Electronics 2021, 10, x FOR PEER REVIEW 12 of 15 Figure 10 shows the results of the frame size comparison of around 2 m. Based on our experiments, the average size of the reference frame in the current frame for a 1 m distance is about 55%. The best distance measurement using the estimated frame size ratio is for every 1 m going forward. In Figure 10, we can see that after the 25th frame, in which the distance is more than 1 m, the frame size ratio does not significantly differ. The influence of our image quality can cause this result. Figure 10. Frame size ratio in the n-th frame. Figure 11 shows the semantic segmentation results for the indoor environment. The walls, floors, roofs, and furniture or other objects in the room are displayed in peach, green, gray, and blue colors, respectively. The wall area is chosen as an area other than the floor or in green. Selecting only the wall area improves accuracy in feature detection and recognition at the next step. Because the floor area is too plain for feature detection, it can cause feature recognition errors and increase computation time.  Figure 12 shows the UAV used in this experiment that is equipped with a camera and ultrasonic sensors. The resulting distance measured using the proposed algorithm is compared with the actual distance, as shown in Figure 13. The results of the distance measurement indicate that the proposed algorithm achieves an accuracy rate of 91.7%. Because the current frame is saved for every 100 ms, the computation time is around 8 fps. Although we have low image quality, the proposed algorithm results are good enough to  Figure 11 shows the semantic segmentation results for the indoor environment. The walls, floors, roofs, and furniture or other objects in the room are displayed in peach, green, gray, and blue colors, respectively. The wall area is chosen as an area other than the floor or in green. Selecting only the wall area improves accuracy in feature detection and recognition at the next step. Because the floor area is too plain for feature detection, it can cause feature recognition errors and increase computation time.

Segmentation
Electronics 2021, 10, x FOR PEER REVIEW 12 of 15 Figure 10 shows the results of the frame size comparison of around 2 m. Based on our experiments, the average size of the reference frame in the current frame for a 1 m distance is about 55%. The best distance measurement using the estimated frame size ratio is for every 1 m going forward. In Figure 10, we can see that after the 25th frame, in which the distance is more than 1 m, the frame size ratio does not significantly differ. The influence of our image quality can cause this result.  Figure 11 shows the semantic segmentation results for the indoor environment. The walls, floors, roofs, and furniture or other objects in the room are displayed in peach, green, gray, and blue colors, respectively. The wall area is chosen as an area other than the floor or in green. Selecting only the wall area improves accuracy in feature detection and recognition at the next step. Because the floor area is too plain for feature detection, it can cause feature recognition errors and increase computation time.  Figure 12 shows the UAV used in this experiment that is equipped with a camera and ultrasonic sensors. The resulting distance measured using the proposed algorithm is compared with the actual distance, as shown in Figure 13. The results of the distance measurement indicate that the proposed algorithm achieves an accuracy rate of 91.7%. Because the current frame is saved for every 100 ms, the computation time is around 8 fps. Although we have low image quality, the proposed algorithm results are good enough to   Figure 12 shows the UAV used in this experiment that is equipped with a camera and ultrasonic sensors. The resulting distance measured using the proposed algorithm is compared with the actual distance, as shown in Figure 13. The results of the distance measurement indicate that the proposed algorithm achieves an accuracy rate of 91.7%. Because the current frame is saved for every 100 ms, the computation time is around 8 fps. Although we have low image quality, the proposed algorithm results are good enough to measure the UAV distance from the starting point. So, we can estimate the UAV's current location in the building.  The proposed algorithm has the best result for distances less than 12 m. After 12 m, because the starting point is too far away, the result of the frame size ratio is less accurate. Maybe this situation occurs due to the low quality of our images. We believe that the proposed algorithm can be used for longer distances using a high-quality camera.

Conclusions
A new method for UAV distance measurement using a vision-based system in an unknown environment is presented in this work. The proposed method has a major contribution in measuring the current UAV distance from the starting point to estimate its location in an indoor environment where the GPS cannot be used. The UAV is equipped with several ultrasonic sensors to avoid obstacles. Unwanted motion in aerial images is handled using the image stabilization method. Semantic segmentation based on deep learning is used to obtain the wall position in front of the UAV. The first wall frame is saved as a reference frame to compare its size ratio in the current frame to determine the current distance of the UAV. The reference frame is updated if the distance is detected as 1 m forward. Comparing the results with actual distances, the proposed method can be used to determine the location of mobile robots, especially UAVs, based on distance measurements in buildings or places that cannot use the GPS without prior place registration. The proposed method provides more than 90% accurate results for short UAV mileage measurements in the building. The addition of physical sensors and more accurate feature detection can be done so that better detection can be carried out for longer distances of the UAV.  The proposed algorithm has the best result for distances less than 12 m. After 12 m, because the starting point is too far away, the result of the frame size ratio is less accurate. Maybe this situation occurs due to the low quality of our images. We believe that the proposed algorithm can be used for longer distances using a high-quality camera.

Conclusions
A new method for UAV distance measurement using a vision-based system in an unknown environment is presented in this work. The proposed method has a major contribution in measuring the current UAV distance from the starting point to estimate its location in an indoor environment where the GPS cannot be used. The UAV is equipped with several ultrasonic sensors to avoid obstacles. Unwanted motion in aerial images is handled using the image stabilization method. Semantic segmentation based on deep learning is used to obtain the wall position in front of the UAV. The first wall frame is saved as a reference frame to compare its size ratio in the current frame to determine the current distance of the UAV. The reference frame is updated if the distance is detected as 1 m forward. Comparing the results with actual distances, the proposed method can be used to determine the location of mobile robots, especially UAVs, based on distance measurements in buildings or places that cannot use the GPS without prior place registration. The proposed method provides more than 90% accurate results for short UAV mileage measurements in the building. The addition of physical sensors and more accurate feature detection can be done so that better detection can be carried out for longer distances of the UAV. The proposed algorithm has the best result for distances less than 12 m. After 12 m, because the starting point is too far away, the result of the frame size ratio is less accurate. Maybe this situation occurs due to the low quality of our images. We believe that the proposed algorithm can be used for longer distances using a high-quality camera.

Conclusions
A new method for UAV distance measurement using a vision-based system in an unknown environment is presented in this work. The proposed method has a major contribution in measuring the current UAV distance from the starting point to estimate its location in an indoor environment where the GPS cannot be used. The UAV is equipped with several ultrasonic sensors to avoid obstacles. Unwanted motion in aerial images is handled using the image stabilization method. Semantic segmentation based on deep learning is used to obtain the wall position in front of the UAV. The first wall frame is saved as a reference frame to compare its size ratio in the current frame to determine the current distance of the UAV. The reference frame is updated if the distance is detected as 1 m forward. Comparing the results with actual distances, the proposed method can be used to determine the location of mobile robots, especially UAVs, based on distance measurements in buildings or places that cannot use the GPS without prior place registration. The proposed method provides more than 90% accurate results for short UAV mileage measurements in the building. The addition of physical sensors and more accurate feature detection can be done so that better detection can be carried out for longer distances of the UAV.