An Aerial Mixed-Reality Environment for First-Person-View Drone Flying

: A drone be able to ﬂy without colliding to preserve the surroundings and its own safety. In addition, it must also incorporate numerous features of interest for drone users. In this paper, an aerial mixed-reality environment for ﬁrst-person-view drone ﬂying is proposed to provide an immersive experience and a safe environment for drone users by creating additional virtual obstacles when ﬂying a drone in an open area. The proposed system is e ﬀ ective in perceiving the depth of obstacles, and enables bidirectional interaction between real and virtual worlds using a drone equipped with a stereo camera based on human binocular vision. In addition, it synchronizes the parameters of the real and virtual cameras to e ﬀ ectively and naturally create virtual objects in a real space. Based on user studies that included both general and expert users, we conﬁrm that the proposed system successfully creates a mixed-reality environment using a ﬂying drone by quickly recognizing real objects and stably combining them with virtual objects.


Introduction
As cameras mounted on drones can easily capture photographs of places that are inaccessible to people, drones have been recently applied in various fields, such as aerial photography [1], search and rescue [2], disaster management [3], and entertainment [4]. Commercial drones are controlled in the third-person or first-person view (FPV), i.e., the view of an unmanned aircraft. In the third-person view, a pilot can view the entire drone in relation to its surroundings during a flight. However, it is difficult to estimate the position and direction of a drone relative to nearby objects to avoid obstacles when flying a drone from a third-person view because the range of human stereoscopic vision is limited to approximately 90 m, and the effective range is considerably smaller. More advanced drone control systems allow pilots to visualize that they are onboard their drones. FPV drones, such as FatShark [5] and Skyzone [6], are remotely controlled using an onboard camera and FPV goggles. These drones send monocular video feeds via analog signals from the camera to the pilot's goggles with a limited field of view of 35 • -43 • and resolutions of a video graphics array (640 × 480) or wide video graphics array (800 × 600). Professional drone pilots can safely fly an FPV drone by estimating distances between the drone and obstacles depending on their experience; however, it is not easy for novice pilots to fly an FPV drone while avoiding collisions with obstacles due to the low transmission rate, poor reception, fast speed, and obstacles.
According to a remote robot control experiment [7], FPV control using a stereo camera yields fewer errors than using mono images. Hence, the first-person experience could be improved by utilizing more advanced techniques, including stereo views and wide fields of views. Moreover, safer and more exciting environments could be provided to prevent drone crashes and reduce collisions with obstacles, thereby protecting drones and their surroundings. Our research was motivated by the need to provide

•
We propose a new aerial MR environment using an FPV drone with a stereo camera to provide users with a safe and engaging flight environment for flight training. The stereo camera improves the accuracy of the pilot's distance estimation from binocular disparities, and a head-mounted display (HMD) with a wide field of view increases immersion. • Different MR environments for drone flying can be created easily by placing several drone flags in the real world. The flags are recognized by machine learning techniques and virtual objects are synthesized with a real scene when detecting the flags during flight. • The proposed system is evaluated on the basis of user studies to compare the experiences of using our FPV-MR drone system with a VR drone flight simulator.
The remainder of this paper is organized as follows. Section 2 presents a review of related works. Section 3 outlines the proposed system and describes the composition of the proposed system. Section 4 summarizes the user studies conducted to evaluate the proposed system. Finally, Section 5 presents the conclusions and future research.

Uses of Drones with Cameras
Drones are operated by mounting a mono or stereo camera depending on the purpose of use. Methods to accurately track people or objects using drones with a mono camera have been investigated [8,9]. In previous research [8,9], object tracking and control methods for various objects located in the suburbs were proposed. The systems used were aimed at tracking objects specified by a user while maintaining a certain distance to the objects. To capture the panoramic views of landmarks, Xie et al. [10] proposed a design tool for drone trajectories that creates a drone flight plan in advance using quadrotor drones equipped with a high-quality mono camera. However, the tool cannot cope with unexpected situations during flight because it uses a rough 2.5D model for landmarks visited by the drone as well as additional obstacles. By contrast, Nägeli et al. [11] and Galvane et al. [12] proposed a method for automatically planning the path of a quadrotor in real time in a dynamic 3D environment while satisfying the physical constraints of cinematographic and multiple quadrotors. In their study, control functions for cinematographic drones were provided to perform specific tasks with dynamic targets. In the entertainment field, racing drones are garnering attention. Drone racing [13] is a sport event in which operators fly drones quickly in a complex environment while passing them through various predesigned obstacles. The operators can feel as if they are in fact flying through an FPV flight because of real-time video feeds. Autonomous driving drone racing competitions [14] have also been held, in which vision-based autonomous drones were flown around a track comprising predesigned gates. In autonomous drone racing, a drone is required to fly through the gates quickly without any collision. To identify drone racing gates, Jung et al. [15] presented a reliable method using deep learning from images acquired through a mono camera onboard a drone.
A stereo camera is mainly used to capture photographs of a 3D space because it can capture the color and depth information of all of the pixels of two separate scenes. Because drones are being spotlighted as a method to access spaces that are inaccessible to people, the use of a stereo camera with drones has been proposed in some recent studies to enhance the ground control and autonomous driving abilities of drones, or to reproduce panoramic outdoor views or buildings in three dimensions. Furthermore, stereo cameras have been investigated in relation to an autonomous driving drone system for disaster rescue [16,17], and a reliable ground control system for autonomous system failure has been proposed [18]. To improve the ground control experience of drones, Nikolai et al. [18] attached a stereo camera to a drone and proposed a complete immersive setup that provided an FPV in three dimensions through a VR HMD. In this study, it was demonstrated through expert experiment that users controlling a drone successfully performed take-off, flight, and landing while wearing a VR headset; moreover, the incidence of simulator sickness for general users was investigated. This approach enabled an accurate remote control by providing higher perceptual accuracy during flight. Using stereo images and global positioning system (GPS)/inertial navigation system data, Akbarzadeh et al. [19] proposed a 3D reconstruction method for an urban scene; however, it failed in real time because of limited processing speed. Through hardware development, Geiger et al. [20] suggested a pipeline for implementing a 3D map from a high-resolution stereo sequence in real time. Research regarding methods for 3D reconstruction from stereo video streams has been expanded using drones. Zhang et al. [21] proposed the use of drones in archeology. In their study, drones with a stereo camera produced 3D models of a large cave and entered spaces that are not easily accessible by humans to find ruins and provide new insights into archeological records. Furthermore, Deris et al. [22] suggested 3D scanning and reproduction methods for outdoor scenes and buildings using drones with a stereo camera. However, although drones with cameras are being actively researched in the fields of films, disasters, 3D reconstruction, and autonomous driving, studies regarding applications that create an MR environment in the air using a stereo camera and a drone are rare.

Virtual and Augmented Reality for Drones
Creating virtual and augmented reality (AR) environments for drones is a recent challenge. Drone manufacturers, including DJI and Parrot, have released low-priced easy-to-control drones, and low-priced VR/AR headsets have recently been launched. Hence, many interesting applications combining drones and VR/AR technologies have been developed. Using drones with cameras and HMDs, a user's view can be expanded to places that are difficult or impossible to capture. Erat et al. [23] suggested using a see-through display for flight control in a narrow or limited environment. This system simulates X-ray vision in a walled environment to enable users to naturally view and explore hidden spaces. Ai et al. [24] proposed a system that allows users to interactively control a drone and visualize the depth data of photographed parts in real time using a point cloud, which can be viewed on a VR HMD.
Research has been conducted that mixed real scenes from drone cameras with virtual objects. For large-scale AR, Okura et al. [25] proposed a hardware configuration and virtual object overlay methods for aerial photography using a multidirectional camera and a large airship. Aleotti et al. [26] presented a haptic augmented interface to detect nuclear sources outdoors and verified their locations using a fixed ground camera. Zollmann et al. [27,28] proposed a drone navigation method for 3D reconstruction of an area of interest using a drone equipped with a high-resolution camera and AR. This is an AR system that complements a supervisor's view through relevant information for its autonomous driving plan and real-time flight feedback. Sun et al. [29] developed a video augmentation system using differential global positioning system (DGPS) sensors for remote disaster monitoring. In AR research regarding construction environment monitoring, a system for the midair 3D reconstruction of a construction site has been presented, where data of interest are captured and then overlaid with those of an actual site on a mobile device.
Appl. Sci. 2020, 10, 5436 4 of 17 However, in the abovementioned studies, the locations of augmented objects are occasionally mismatched with the locations in the physical environment, which can confuse users. This is because virtual object mixing using a mono camera or GPS can easily generate errors from the sensors. Furthermore, their main objective is not to provide an immersive experience for users but to provide information to users through a two-dimensional (2D) screen interface or 3D point-cloud-based VR. As previously mentioned in Section 2.1, drones with stereo cameras have been investigated for autonomous driving [16][17][18] and 3D reconstructions [19][20][21][22], and Nikolai et al. [18] proposed a similar immersive setup using a VR HMD but only for a ground control purpose. In addition, the recent studies [25][26][27][28] discussed in Section 2.2 combined advanced AR technologies with drones to superimpose valuable graphic data over the images that come from the cameras of the drones. MR is a hybrid of physical and virtual worlds, encompassing both AR and augmented reality via immersive technology. Stereo cameras are appropriate for generating an immersive MR environment; however, no previous study exists in which stereo cameras mounted on drones are used to create an aerial MR environment and expand the ground control experience of drones using a VR HMD.

Object Recognition and Immersion Measurement in Mixed Reality
To create an MR environment based on the real world, it is crucial to accurately match the locations of virtual objects to be merged with those of real objects. In other words, real objects, which are the criteria for positioning virtual objects to be rendered, must be recognized accurately. As an example, virtual objects, such as trees and obstacles, can be dynamically created in the MR environment when a drone recognizes flags or water bottles in real scenes. Furthermore, a stable MR system must interact with users in real time to eliminate a sense of difference with the real world. In fact, users are more sensitive to visual mismatches than to virtual objects not moving in the same manner as real objects because the matching errors of virtual objects result in an unnatural MR. Moreover, the MR environment for drone flights from an FPV requires faster and more accurate identification when deciding the locations of virtual objects to be mixed with real scenes because flying drones change their locations and directions more dynamically in each moment than human vision.
Object recognition has developed rapidly in the field of computer vision because it can be used in major applications, such as intelligent vehicle control, monitoring, and advanced robots. The technology for object recognition has been investigated mainly for pedestrian detection [30] and face recognition [31]. This technology extracts the features of objects using scale invariant feature transformation [32] and histogram of oriented gradient [33] methods, and uses them for identification. However, methods for solving classification problems are slowly being developed, and their performance has improved owing to the use of ensemble or complementing past studies [34,35]. Recently, Alex et al. [36] achieved a significant performance improvement in the ImageNet Large-Scale Visual Recognition Challenge using AlexNet based on a convolutional neural network (CNN). Subsequently, CNNs for object recognition have been investigated actively. Consequently, region-based CNNs (R-CNNs) using selective search have been proposed, and fast R-CNN [37] and faster R-CNN [38] with higher speeds have been released. A fast R-CNN cannot perform real-time object recognition because of its slow processing speed of the region proposal method operated outside the CNN. However, the faster R-CNN improves the accuracy and speed by using the region proposal algorithm inside the CNN. Consequently, real-time recognition with high accuracy is enabled. In this study, the faster R-CNN proposed by Ren et al. [38], which can perform accurate object recognition in real time, was used as the object recognition method.
Previous studies have not precisely defined user experience components or the method to measure them in an immersive virtual environment; however, they have been studied indirectly in flight simulators [39] and game fields [40]. Simulator sickness among aircraft pilots has been mainly investigated. The simulator sickness questionnaire (SSQ) is the current standard for measuring simulator sickness. It is recognized as an excellent single measure because it provides each indicator of overall symptoms. The game engagement questionnaire (GEQ) has been widely used as a measure of user engagement in games. Based on these, a more accurate method of measuring immersion in a virtual environment [41] was proposed, in which the reliability of the questionnaire was improved using 10 rating scales and 87 questions. Questions corresponding to immersion, presence, emotion, flow, judgment, usability, and experience were posed and answered based on 10 scales. Furthermore, user immersion was measured with a simple questionnaire that combined the GEQ [42] and SSQ [39] to verify the efficiency of the proposed framework.

System Overview
In this paper, an immersive MR environment for drone flying is proposed to expand the ground control experience of drone operators. It synthesizes virtual environments such as forests and cities with a wide empty space in which a drone can be flown safely. Figure 1 shows the software architecture of the proposed system. The stereo camera mounted on the drone recognizes predefined objects from the input image. The 2D coordinates of the recognized objects are mapped to three dimensions and mixed with virtual 3D objects using depth information from the stereo camera. Because some virtual objects in this MR environment are generated dynamically based on real objects, the locations of the virtual objects can be changed by moving the locations of actual objects, such as flags or bottles, in the real world. Object recognition and the mixing of virtual and real scenes are performed using a wearable VR backpack PC. The mixed scene is stereo rendered and is shown to the user through an HMD connected to the user's VR backpack.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 17 virtual environment [41] was proposed, in which the reliability of the questionnaire was improved using 10 rating scales and 87 questions. Questions corresponding to immersion, presence, emotion, flow, judgment, usability, and experience were posed and answered based on 10 scales. Furthermore, user immersion was measured with a simple questionnaire that combined the GEQ [42] and SSQ [39] to verify the efficiency of the proposed framework.

System Overview
In this paper, an immersive MR environment for drone flying is proposed to expand the ground control experience of drone operators. It synthesizes virtual environments such as forests and cities with a wide empty space in which a drone can be flown safely. Figure 1 shows the software architecture of the proposed system. The stereo camera mounted on the drone recognizes predefined objects from the input image. The 2D coordinates of the recognized objects are mapped to three dimensions and mixed with virtual 3D objects using depth information from the stereo camera. Because some virtual objects in this MR environment are generated dynamically based on real objects, the locations of the virtual objects can be changed by moving the locations of actual objects, such as flags or bottles, in the real world. Object recognition and the mixing of virtual and real scenes are performed using a wearable VR backpack PC. The mixed scene is stereo rendered and is shown to the user through an HMD connected to the user's VR backpack.

Aerial Mixed Reality Using a Drone with a Stereo Camera
The proposed system uses a stereo camera (StereoLabs' ZED mini, San Francisco, USA [43]) with a baseline distance of 64 mm, which is the binocular distance of an average adult. As shown in Figure  2, the ZED stereo camera mounted on the drone is used for passive 3D depth estimation and provides depths of 0.7-20 m with an accuracy of 0.4-1 mm. For outdoor applications, a ZED camera is significantly better than a Microsoft Kinect, which is an active 3D depth estimator using an infrared laser. In this study, a ZED camera was mounted at the front of a DJI F550 hexacopter (Shenzhen, China) to observe the flight path of the drone from the FPV. The drone has a maximum load of approximately 600 g and is appropriate for aerial photography. In the current system, the ZED camera was connected to a mobile PC through a USB 3.0 interface. Therefore, the drone can lift a 10 m cable and capture videos with excellent air stability because it sends video streams using a special USB 3.0 cable, known as a repeater cable. In an environment where wireless transmission is possible and the weather is fine and not windy, the camera can be connected to a smaller drone. In this drone system, a stereo camera simulates the human binocular vision of actual scenes. It provides FPV flight scenes through the Oculus Rift CV1 headset connected to the user's VR backpack weighing 4.6 kg, comprising an Intel Core i7-7820HK, NVIDIA GeForce GTX 1070, and 16GB of RAM.

Aerial Mixed Reality Using a Drone with a Stereo Camera
The proposed system uses a stereo camera (StereoLabs' ZED mini, San Francisco, USA [43]) with a baseline distance of 64 mm, which is the binocular distance of an average adult. As shown in Figure 2, the ZED stereo camera mounted on the drone is used for passive 3D depth estimation and provides depths of 0.7-20 m with an accuracy of 0.4-1 mm. For outdoor applications, a ZED camera is significantly better than a Microsoft Kinect, which is an active 3D depth estimator using an infrared laser. In this study, a ZED camera was mounted at the front of a DJI F550 hexacopter (Shenzhen, China) to observe the flight path of the drone from the FPV. The drone has a maximum load of approximately 600 g and is appropriate for aerial photography. In the current system, the ZED camera was connected to a mobile PC through a USB 3.0 interface. Therefore, the drone can lift a 10 m cable and capture videos with excellent air stability because it sends video streams using a special USB 3.0 cable, known as a repeater cable. In an environment where wireless transmission is possible and the weather is fine and not windy, the camera can be connected to a smaller drone. In this drone system, a stereo camera simulates the human binocular vision of actual scenes. It provides FPV flight scenes through the Oculus Rift CV1 headset connected to the user's VR backpack weighing 4.6 kg, comprising an Intel Core i7-7820HK, NVIDIA GeForce GTX 1070, and 16GB of RAM. To avoid drone crashes, inexperienced drone users should fly a drone in a location with fewer obstacles, such as a playground, as shown in Figure 3a. In our system, users can select one of the virtual scenes, including a forest, grassland, and city, and can fly around in the viewpoint of the drone in the desired location. In fact, users can fly the drone by mixing all or part of the virtual location they desire with the real world without being present in that location. The real scenes obtained from a stereo camera mounted on the drone are synthesized with virtual scenes in the user's backpack PC. Finally, the user can feel as if he/she is flying a drone in the desired location while looking at the synthesized scenes via an HMD in FPV mode. Figure 3b, c, d shows the pilot's view while flying in a virtual path passing through a forest. The head-up display, including the altitude and three-axis information required for drone flight, is displayed on the left of the screen. Figure 4a, b shows the pilot's view while flying between buildings in a virtual city. Hence, the user can experience a sense of flying in an engaging immersive manner by mixing the desired virtual environment with the real world.  To avoid drone crashes, inexperienced drone users should fly a drone in a location with fewer obstacles, such as a playground, as shown in Figure 3a. In our system, users can select one of the virtual scenes, including a forest, grassland, and city, and can fly around in the viewpoint of the drone in the desired location. In fact, users can fly the drone by mixing all or part of the virtual location they desire with the real world without being present in that location. The real scenes obtained from a stereo camera mounted on the drone are synthesized with virtual scenes in the user's backpack PC. Finally, the user can feel as if he/she is flying a drone in the desired location while looking at the synthesized scenes via an HMD in FPV mode. To avoid drone crashes, inexperienced drone users should fly a drone in a location with fewer obstacles, such as a playground, as shown in Figure 3a. In our system, users can select one of the virtual scenes, including a forest, grassland, and city, and can fly around in the viewpoint of the drone in the desired location. In fact, users can fly the drone by mixing all or part of the virtual location they desire with the real world without being present in that location. The real scenes obtained from a stereo camera mounted on the drone are synthesized with virtual scenes in the user's backpack PC. Finally, the user can feel as if he/she is flying a drone in the desired location while looking at the synthesized scenes via an HMD in FPV mode. Figure 3b, c, d shows the pilot's view while flying in a virtual path passing through a forest. The head-up display, including the altitude and three-axis information required for drone flight, is displayed on the left of the screen. Figure 4a, b shows the pilot's view while flying between buildings in a virtual city. Hence, the user can experience a sense of flying in an engaging immersive manner by mixing the desired virtual environment with the real world.

Flight in a Mixed Reality Environment
To generate virtual objects dynamically using real objects, the locations of real objects must be determined based on the images obtained from the stereo camera. In this study, predefined objects were recognized by the transfer learning of a faster R-CNN, as described in Section 2.3. Transfer learning is a popular method in computer vision because it can build accurate models in a timesaving manner [44].
Transfer learning refers to the use of a pre-trained model that was trained on a large benchmark dataset for a new problem. Due to the computational cost of training, it is typical to use models from the published literature (e.g., Visual Geometry Group, Inception, MobileNet). A comprehensive review of pre-trained models' performance using data from the ImageNet [45] challenge was presented by Canziani et al. [46]. For transfer learning in this study, the Google Inception V2 module [47] was used as a backbone network, which is the most typically used CNN since the release of TensorFlow. Subsequently, a faster R-CNN model pre-trained on the COCO dataset [48] was used. The data of the objects to be recognized were collected through images captured during the actual FPV drone flight. The parameters of the model were adjusted appropriately for this system using the captured images as learning data. As a result of learning, flags were identified with an average performance time of 300 ms and an average accuracy of 97% from real-time video frames. Here, the accuracy was calculated as a ratio of the number of correctly identified images to the total number of test images with flags acquired from an experimental flight lasting around 1 min.
Virtual objects can be created interactively using physical objects. For example, virtual large rings can be generated next to drone racing flags, as shown in Figure 5, which can be passed through to practice drone flying. As another example, trees in a forest can be generated using water bottles for practicing obstacle dodging, as shown in Figure 6.

Flight in a Mixed Reality Environment
To generate virtual objects dynamically using real objects, the locations of real objects must be determined based on the images obtained from the stereo camera. In this study, predefined objects were recognized by the transfer learning of a faster R-CNN, as described in Section 2.3. Transfer learning is a popular method in computer vision because it can build accurate models in a timesaving manner [44]. Transfer learning refers to the use of a pre-trained model that was trained on a large benchmark dataset for a new problem. Due to the computational cost of training, it is typical to use models from the published literature (e.g., Visual Geometry Group, Inception, MobileNet). A comprehensive review of pre-trained models' performance using data from the ImageNet [45] challenge was presented by Canziani et al. [46]. For transfer learning in this study, the Google Inception V2 module [47] was used as a backbone network, which is the most typically used CNN since the release of TensorFlow. Subsequently, a faster R-CNN model pre-trained on the COCO dataset [48] was used. The data of the objects to be recognized were collected through images captured during the actual FPV drone flight. The parameters of the model were adjusted appropriately for this system using the captured images as learning data. As a result of learning, flags were identified with an average performance time of 300 ms and an average accuracy of 97% from real-time video frames. Here, the accuracy was calculated as a ratio of the number of correctly identified images to the total number of test images with flags acquired from an experimental flight lasting around 1 min.
Virtual objects can be created interactively using physical objects. For example, virtual large rings can be generated next to drone racing flags, as shown in Figure 5, which can be passed through to practice drone flying. As another example, trees in a forest can be generated using water bottles for practicing obstacle dodging, as shown in Figure 6.

Flight in a Mixed Reality Environment
To generate virtual objects dynamically using real objects, the locations of real objects must be determined based on the images obtained from the stereo camera. In this study, predefined objects were recognized by the transfer learning of a faster R-CNN, as described in Section 2.3. Transfer learning is a popular method in computer vision because it can build accurate models in a timesaving manner [44]. Transfer learning refers to the use of a pre-trained model that was trained on a large benchmark dataset for a new problem. Due to the computational cost of training, it is typical to use models from the published literature (e.g., Visual Geometry Group, Inception, MobileNet). A comprehensive review of pre-trained models' performance using data from the ImageNet [45] challenge was presented by Canziani et al. [46]. For transfer learning in this study, the Google Inception V2 module [47] was used as a backbone network, which is the most typically used CNN since the release of TensorFlow. Subsequently, a faster R-CNN model pre-trained on the COCO dataset [48] was used. The data of the objects to be recognized were collected through images captured during the actual FPV drone flight. The parameters of the model were adjusted appropriately for this system using the captured images as learning data. As a result of learning, flags were identified with an average performance time of 300 ms and an average accuracy of 97% from real-time video frames. Here, the accuracy was calculated as a ratio of the number of correctly identified images to the total number of test images with flags acquired from an experimental flight lasting around 1 min.
Virtual objects can be created interactively using physical objects. For example, virtual large rings can be generated next to drone racing flags, as shown in Figure 5, which can be passed through to practice drone flying. As another example, trees in a forest can be generated using water bottles for practicing obstacle dodging, as shown in Figure 6.  Target objects such as flags or bottles are identified by a faster R-CNN, and then the center point is calculated using the start and end points of the 2D coordinates of objects identified in the 2D image. The ZED camera calculates the distance between a specific pixel and the camera using triangulation [49]. After obtaining the distance z between the center point of the identified object and the camera, it is mapped to a 3D virtual space. The 3D virtual space is first defined by the viewpoint vector of the camera. The viewpoint of the camera is the z-axis, the width is the x-axis, and the height is the y-axis. Target objects such as flags or bottles are identified by a faster R-CNN, and then the center point is calculated using the start and end points of the 2D coordinates of objects identified in the 2D image. The ZED camera calculates the distance between a specific pixel and the camera using triangulation [49]. After obtaining the distance z between the center point of the identified object and the camera, it is mapped to a 3D virtual space. The 3D virtual space is first defined by the viewpoint vector of the camera. The viewpoint of the camera is the z-axis, the width is the x-axis, and the height is the y-axis. The locations of virtual objects are mapped in the Unity3D engine in the following manner.
The center point calculated for 3D mapping must be adjusted according to the size of the screen to be rendered to the user. The equation for the adjustment is as follows: where (x, y) is the center of the detected object, p is the size of the rendered plane, and h is the size of the image obtained from the camera. Furthermore, the Unity3D engine calculates the virtual world coordinates by converting through a reverse projection from the current projection plane to the 3D space using x' and y' from Equation (1) and the depth value z from the ZED camera. Finally, the virtual objects are moved appropriately to the xand y-axis directions such that the real and virtual objects do not overlap on the 3D coordinates of the converted target object. In this case, the moving distance is determined by the size of the virtual object to be generated. However, each task is performed in the background because the performance speed can be unsatisfactory if the tasks are performed separately for each of the multiple objects that appear on the screen. The 3D coordinates converted by reverse projection are updated to V using Equation (2) when detection is completed. Here, V denotes 3D coordinates. The 3D coordinates x', y', and z in the current frame are returned to the corrected coordinates F(V f ) after linear interpolation, as shown in Equation (2). Here, δ is the weight based on the linear and angular velocities. In Equation (3), fps means frames per second and ω denotes the angular velocity of the drone. In Equation (2), V f −1 is the 3D coordinates of the virtual object in the previous frame, and V f is the 3D coordinates converted by reverse projection in the current frame.
One caution in creating an MR environment is that when virtual objects appear suddenly by camera rotation, it can negatively affect the user's sense of reality. To prevent this, virtual objects that are generated are continuously tracked in the virtual world coordinates. This prevents the unnatural mapping of virtual objects that appear again after disappearing from the screen and enables a stable MR system because it is not significantly dependent on the object identification speed.
The generated virtual obstacles can be designed as new shapes of obstacles in the flight track planning stage of the drone; furthermore, the types of physical objects that determine the locations of virtual objects can be selected freely. In addition, virtual obstacles can be added, moved, or removed during drone flight. Hence, the user's MR environment can be changed dynamically by reflecting changes in the real environment. Figure 7a shows a drone racing track for generating an MR environment; Figure 7b-d shows the view of the pilot flying in an MR environment created using this system. Users can fly a drone in MR with a latency time of less than 20 ms when flying at approximately 3-4 m/s. Finally, when the MR scene is rendered, the post-rendering of virtual objects in accordance with the characteristics of the image obtained from a stereo camera must be performed. In the images acquired from a stereo camera, motion blurs are generated depending on the drone movement or rotation; however, they do not occur in virtual objects, which can provide a sense of difference. Therefore, the parameters between the real and virtual cameras must be matched. In this system, the camera parameters of Unity 3D were changed in accordance with the settings of the ZED mini camera to an aperture value of f/2.0, a focal length of 28 mm, a viewing angle of 110 • , and an exposure value of 50% of the camera frame rate. The focal length can vary by the camera resolution and calibration and is significantly dependent on the camera resolution. In the ZED mini camera, the focal length was 1400 pixels, each pixel size was 0.002 mm, and the focal length was 28 mm. For system efficiency, the viewing angle was set to 110 • . The exposure was set to 50% of the camera frame rate, and the degree of motion blur was determined in real time. Hence, the differences between images rendered by the cameras of the real and virtual worlds were minimized by synchronizing the aperture value, focal length, viewing angle, and exposure value. Finally, when the MR scene is rendered, the post-rendering of virtual objects in accordance with the characteristics of the image obtained from a stereo camera must be performed. In the images acquired from a stereo camera, motion blurs are generated depending on the drone movement or rotation; however, they do not occur in virtual objects, which can provide a sense of difference. Therefore, the parameters between the real and virtual cameras must be matched. In this system, the camera parameters of Unity 3D were changed in accordance with the settings of the ZED mini camera to an aperture value of f/2.0, a focal length of 28 mm, a viewing angle of 110°, and an exposure value of 50% of the camera frame rate. The focal length can vary by the camera resolution and calibration and is significantly dependent on the camera resolution. In the ZED mini camera, the focal length was 1400 pixels, each pixel size was 0.002 mm, and the focal length was 28 mm. For system efficiency, the viewing angle was set to 110°. The exposure was set to 50% of the camera frame rate, and the degree of motion blur was determined in real time. Hence, the differences between images rendered by the cameras of the real and virtual worlds were minimized by synchronizing the aperture value, focal length, viewing angle, and exposure value.

Experimental Environment
To conduct user studies of an FPV drone flight in MR, we constructed an experimental drone racing environment in a drone airfield (8 m in length, 20 m in width, and 5 m in height), as shown in Figure 8. Three checkpoint flags and one destination were set in the drone flight path. While flying an FPV drone, virtual 3D obstacles were created at each checkpoint and mixed with the video feed from the drone camera. The flight time to the destination was set to approximately 1 min to complete the flight within a short time. To adjust the difficulty, more checkpoints can be added or the drone

Experimental Environment
To conduct user studies of an FPV drone flight in MR, we constructed an experimental drone racing environment in a drone airfield (8 m in length, 20 m in width, and 5 m in height), as shown in Figure 8. Three checkpoint flags and one destination were set in the drone flight path. While flying an FPV drone, virtual 3D obstacles were created at each checkpoint and mixed with the video feed from the drone camera. The flight time to the destination was set to approximately 1 min to complete the flight within a short time. To adjust the difficulty, more checkpoints can be added or the drone can be flown in a larger area. First, a user study was conducted to investigate the differences in immersion by stereoscopic sense and presence, in which an MR-based FPV flight in a simulated manner was compared with a VR-based FPV flight. The experiment data of the participants were collected after receiving written consent of each participant in accordance with the Declaration of Helsinki and our institutional regulations (Research ethics and integrity committee, 20200662, Sejong University). Furthermore, the participants received an explanation and video training from the researcher about each environment, and were notified before the experiment about the system components of each environment (viewing angle, monocular or binocular, resolution, etc.) A second user study was conducted to investigate the differences in FPV flights from the control of drones in real-word, VR, and MR environments by actual pilots.
Helsinki and our institutional regulations (Research ethics and integrity committee, 20200662, Sejong University). Furthermore, the participants received an explanation and video training from the researcher about each environment, and were notified before the experiment about the system components of each environment (viewing angle, monocular or binocular, resolution, etc.) A second user study was conducted to investigate the differences in FPV flights from the control of drones in real-word, VR, and MR environments by actual pilots.

Immersion in FPV Flights
To evaluate the usability of the proposed MR-based FPV flight, its flying experience was compared with that of a drone flight simulator. For this comparison, 16 participants (average age: 25.44 years; range: 22-31 years) wore a VR HMD and viewed flying scenes controlled by an experienced drone pilot in each environment. The experience of VR simulators is generally evaluated by active task completion. However, in the first user study, we compared the cinematic VR experiences in both environments so that differences were not caused by the operation difficulty between a real FPV drone and a simulator. These experiences are passive, but the viewer is still engaging in multiple cognitive activities. The evaluation techniques for cinematic VR experiences were proposed by MacQuarrie et al. [50] and a similar evaluation approach to ours was used in [51]. Figure 9 shows the flying scenes of each environment and a participant in the indoor experiment. The top images are the viewpoint of the drone flying in a virtual environment, which was created similarly to the experimental environment shown in Figure 8

Immersion in FPV Flights
To evaluate the usability of the proposed MR-based FPV flight, its flying experience was compared with that of a drone flight simulator. For this comparison, 16 participants (average age: 25.44 years; range: 22-31 years) wore a VR HMD and viewed flying scenes controlled by an experienced drone pilot in each environment. The experience of VR simulators is generally evaluated by active task completion. However, in the first user study, we compared the cinematic VR experiences in both environments so that differences were not caused by the operation difficulty between a real FPV drone and a simulator. These experiences are passive, but the viewer is still engaging in multiple cognitive activities. The evaluation techniques for cinematic VR experiences were proposed by MacQuarrie et al. [50] and a similar evaluation approach to ours was used in [51]. Figure 9 shows the flying scenes of each environment and a participant in the indoor experiment. The top images are the viewpoint of the drone flying in a virtual environment, which was created similarly to the experimental environment shown in Figure 8  Helsinki and our institutional regulations (Research ethics and integrity committee, 20200662, Sejong University). Furthermore, the participants received an explanation and video training from the researcher about each environment, and were notified before the experiment about the system components of each environment (viewing angle, monocular or binocular, resolution, etc.) A second user study was conducted to investigate the differences in FPV flights from the control of drones in real-word, VR, and MR environments by actual pilots.

Immersion in FPV Flights
To evaluate the usability of the proposed MR-based FPV flight, its flying experience was compared with that of a drone flight simulator. For this comparison, 16 participants (average age: 25.44 years; range: 22-31 years) wore a VR HMD and viewed flying scenes controlled by an experienced drone pilot in each environment. The experience of VR simulators is generally evaluated by active task completion. However, in the first user study, we compared the cinematic VR experiences in both environments so that differences were not caused by the operation difficulty between a real FPV drone and a simulator. These experiences are passive, but the viewer is still engaging in multiple cognitive activities. The evaluation techniques for cinematic VR experiences were proposed by MacQuarrie et al. [50] and a similar evaluation approach to ours was used in [51]. Figure 9 shows the flying scenes of each environment and a participant in the indoor experiment. The top images are the viewpoint of the drone flying in a virtual environment, which was created similarly to the experimental environment shown in Figure 8 by modifying a virtual space provided by Microsoft AirSim, a popular drone flight simulator. The bottom images are the view of the pilot flying in our MR environment. The flying scenes from a stereo camera were recorded for 1 min in the same experimental environment. The drone pilot flew at a speed below 4 m/s to the destination in each environment without hitting the obstacles. In both environments, a viewing angle of 110° and a resolution of 2160 × 1200 were used. All participants were experienced in playing 3D video games using a VR HMD, and no participant felt strong motion sickness during the experiment. The participants engaged in the experiment for up to 20 min including the warm-up time, and they experienced drone flights at a stable speed below 4 m/s because a higher drone speed could induce simulator sickness [41]. Subsequently, the participants responded to the questionnaire shown in Table 1 such that the overall user immersion after experiencing each environment could be measured. The simplified questionnaire for measuring immersion [39] comprised 24 questions. The 18 questions for measuring immersion in the configuration of each environment could be answered based on a five-point agreement scale. They comprised four questions regarding emotion, three regarding flow, three regarding immersion, three regarding judgment, and five regarding presence. Furthermore, because the degree of simulator sickness [52] affects the user's immersion, six questions regarding the experience were posed to determine it. The replies to positive questions were marked as "P" and negative ones as "N". The reactions of each participant were analyzed by calculating the scores of their replies to each question. In this questionnaire, scores on a five-point Likert scale were calculated (strongly agree = 5, agree = 4, average = 3, disagree = 2, and strongly disagree = 1). For negative questions, the scores were set in the opposite order. Significant results were obtained from the negative and positive reactions. The response of all participants to each question was calculated using the average score. Experience consequence N Figure 10 shows the average score and standard deviation of immersion (including 18 questions for measuring immersion and six questions for simulator sickness affecting the immersion) according to the experience level of each environment. Both environments showed positive results for immersion, with 3 points (average immersion) or higher. Furthermore, the flight in the MR environment showed an overall immersion that was higher by approximately 0.41 than the flight in the VR environment. We performed the two-sample t-test to determine if there was a statistically significant difference between the means of two groups, since the data followed the normal distribution and the variances of the two groups were equal. The resulting two-sided p-value was 0.011, therefore there was a significant difference between VR and MR environments regarding user immersion at the 95% level (p < 0.05). This suggests that the immersive experience elements added based on stereoscopic view and real world obtained from a stereo camera contributed positively to the overall evaluation measure. significant difference between the means of two groups, since the data followed the normal distribution and the variances of the two groups were equal. The resulting two-sided p-value was 0.011, therefore there was a significant difference between VR and MR environments regarding user immersion at the 95% level (p < 0.05). This suggests that the immersive experience elements added based on stereoscopic view and real world obtained from a stereo camera contributed positively to the overall evaluation measure.

Actual Flight Experiences
As a formative assessment, the second user study examined the effectiveness of the proposed MR environment. To investigate the differences in FPV flights when actual users controlled the drone in the real word, and VR and MR environments, four drone pilots (average age: 32.75 years; range: 25-36 years) were recruited to perform a qualitative user study. Two of these were expert pilots belonging to the FPV racing drone community, whereas the other two were experienced in FPV drone flights. Experts skilled in drone control were recruited for this user study because they were required to compare this MR system with the actual FPV drone system. Furthermore, they were required to skillfully play 3D video drone games using a VR HMD. Therefore, the group that satisfied this condition was small. In the experiment, all participants did not experience simulator sickness. Figure 11 shows a user controlling a drone using the proposed MR system and their viewpoints. To compare the immersion in controlling the drone in each environment, we conducted a qualitative user study after a real FPV drone flight, an MR-based FPV drone flight, and a manipulation of a virtual FPV drone in the virtual simulator AirSim by expert pilots. Figure 7 shows the images recorded during this experiment. The flight in the proposed MR environment had a wider viewing angle and higher resolution than the real FPV drone flight. Furthermore, the camera mounted in the proposed MR drone was a stereo camera, which provided a greater stereoscopic sense than the images received from the drone with a mono camera. Finally, the differences in the generated virtual obstacles were compared with real obstacles. All of the participants were skilled drone pilots with

Actual Flight Experiences
As a formative assessment, the second user study examined the effectiveness of the proposed MR environment. To investigate the differences in FPV flights when actual users controlled the drone in the real word, and VR and MR environments, four drone pilots (average age: 32.75 years; range: 25-36 years) were recruited to perform a qualitative user study. Two of these were expert pilots belonging to the FPV racing drone community, whereas the other two were experienced in FPV drone flights. Experts skilled in drone control were recruited for this user study because they were required to compare this MR system with the actual FPV drone system. Furthermore, they were required to skillfully play 3D video drone games using a VR HMD. Therefore, the group that satisfied this condition was small. In the experiment, all participants did not experience simulator sickness. Figure 11 shows a user controlling a drone using the proposed MR system and their viewpoints. To compare the immersion in controlling the drone in each environment, we conducted a qualitative user study after a real FPV drone flight, an MR-based FPV drone flight, and a manipulation of a virtual FPV drone in the virtual simulator AirSim by expert pilots. Figure 7 shows the images recorded during this experiment. The flight in the proposed MR environment had a wider viewing angle and higher resolution than the real FPV drone flight. Furthermore, the camera mounted in the proposed MR drone was a stereo camera, which provided a greater stereoscopic sense than the images received from the drone with a mono camera. Finally, the differences in the generated virtual obstacles were compared with real obstacles. All of the participants were skilled drone pilots with experience playing 3D video games using a VR HMD. However, two of the drone pilots had only previously manipulated third-person-view drones and participated in the experiment after becoming familiar with FPV drone flights through several practices. The drone experts performed flight in an MR environment after completing flights in a real FPV drone and those in a VR environment. Tables 2 and 3 list the differences between the environments.
Compared with the real FPV drone flight, the positive differences of the proposed MR environment are wide viewing angle, high resolution, strong sense of depth, and no adaptation time required. The participants reported that the concentration of manipulation increased as the viewing angle widened during flight. When flying a real FPV drone, they must become acquainted with the flight path before starting a flight owing to the limited viewing angle because the current moving direction cannot be determined easily based on the limited viewing angle. This is consistent with the research results of Bowman et al. [53] and Lin et al. [54], i.e., when the viewing angle widens, the pilot can concentrate on understanding the space because the distraction of spatial understanding and spatial memory is reduced. Like real obstacles, all participants felt the presence of virtual objects created by the MR system. However, they reported slight simulator sickness because virtual 3D objects were generated, particularly when a crash with virtual objects occurred. The weak simulator sickness felt in this case may be attributed to two factors. The first is that the sense of reality collapsed because the obstacles passed without the drone falling or shaking, which may occur because of collision with objects. The second possible factor is that extreme differences occurred in the images rendered in both eyes because only one of the two eyes of the stereo camera overlapped with the inside of the virtual objects. Expert pilots reported that appropriate effects during a collision with virtual objects may help alleviate simulator sickness.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 17 experience playing 3D video games using a VR HMD. However, two of the drone pilots had only previously manipulated third-person-view drones and participated in the experiment after becoming familiar with FPV drone flights through several practices. The drone experts performed flight in an MR environment after completing flights in a real FPV drone and those in a VR environment. Tables  2 and 3 list the differences between the environments. Figure 11. A pilot controlling a drone using the proposed system and the scene from the user's point of view.  Compared with the real FPV drone flight, the positive differences of the proposed MR environment are wide viewing angle, high resolution, strong sense of depth, and no adaptation time required. The participants reported that the concentration of manipulation increased as the viewing angle widened during flight. When flying a real FPV drone, they must become acquainted with the flight path before starting a flight owing to the limited viewing angle because the current moving direction cannot be determined easily based on the limited viewing angle. This is consistent with the research results of Bowman et al. [53] and Lin et al. [54], i.e., when the viewing angle widens, the pilot can concentrate on understanding the space because the distraction of spatial understanding and spatial memory is reduced. Like real obstacles, all participants felt the presence of virtual objects created by the MR system. However, they reported slight simulator sickness because virtual 3D objects were generated, particularly when a crash with virtual objects occurred. The weak simulator sickness felt in this case may be attributed to two factors. The first is that the sense of reality collapsed because the obstacles passed without the drone falling or shaking, which may occur because of collision with objects. The second possible factor is that extreme differences occurred in the images rendered in both eyes because only one of the two eyes of the stereo camera overlapped with the Figure 11. A pilot controlling a drone using the proposed system and the scene from the user's point of view.  Compared with FPV drone flights in a virtual simulator, the proposed MR environment offered the advantages of natural drone movement, significantly less simulator sickness, and a strong sense of depth. The participants reported that the MR flight environment provided a better sense of space than the virtual flight simulator. Furthermore, the participants felt considerably less simulator sickness compared with during the virtual FPV drone flight. This is because the proposed MR environment was more realistic and stereoscopic than the virtual simulator environment because the virtual obstacles were mixed based on the real world through a stereo camera.
Through experiments with experts and non-experts, we discovered that the simulator sickness felt in the virtual environment differed from that felt in the MR FPV flight. In fact, the expert group who manipulated the drone for flights did not experience simulator sickness during an MR flight; however, 26% of the non-expert participants who experienced prerecorded MR felt simulator sickness.
Based on the findings of Padrao et al. [55], we assumed that in the comparison experiment of VR and MR environments, the fact that the intended behavior (manipulation) and the result of the behavior can be observed immediately significantly affected the real-time condition. This suggests that the proposed MR flight environment can provide a higher degree of immersion than the VR flight simulator in real-time conditions.

Conclusions and Future Work
We proposed a flight environment that enabled easy flight control via a wide viewing angle and a stereoscopic view that was safe from drone collisions using virtual obstacles mixed with the real world, instead of actual obstacles. Drone racing flags used to configure the flight environment of a drone were learned and identified using a faster R-CNN. To mix the 2D coordinates of identified objects with a virtual environment, they were mapped three dimensionally to virtual world coordinates, and an appropriate method of creating virtual object locations was investigated. For a qualitative evaluation, test driving was performed by expert pilots using this method, and the results confirmed that the FPV MR drone flight did not cause simulator sickness. The user immersion was surveyed through comparison of cinematic experiences with existing real FPV and virtual simulator FPV drone flights, and the results showed that the immersion of the proposed method improved. The two-sample t-test showed that there was a statistically significant difference between the two groups regarding user immersion at the 95% level.
The proposed FPV MR flight provided drone pilots with an experience to improve immersion in the flight environment, and provided positive assistance to the FPV drone flight of non-experts. Furthermore, the bidirectional MR interactions between real and virtual worlds can be useful for cutting-edge unmanned aerial vehicles for entertainment. In the experiments conducted in this study, the streaming video from the drone maintained a speed of 30 fps, and the overall speed of the framework was maintained at 50-70 fps. However, object recognition in each frame was limited because, not only must object recognition be performed in the user's VR backpack, but also mixed reality must be created and rendered in the VR HMD. The proposed method was performed in the background of Unity3D because the object recognition speed was lower than the appropriate speed of 20 ms. Furthermore, each of the frames for which object recognition was not performed was interpolated to minimize the tremor or incorrect matching of virtual objects.
In the future, we will study and apply more efficient object recognition and tracking techniques to increase real-time object recognition, and construct an efficient MR environment. Furthermore, we will conduct more comprehensive research focused on user experience, and extend our study to a fully mobile MR environment using wireless communication.