Path Generator with Unpaired Samples Employing Generative Adversarial Networks

Interactive technologies such as augmented reality have grown in popularity, but specialized sensors and high computer power must be used to perceive and analyze the environment in order to obtain an immersive experience in real time. However, these kinds of implementations have high costs. On the other hand, machine learning has helped create alternative solutions for reducing costs, but it is limited to particular solutions because the creation of datasets is complicated. Due to this problem, this work suggests an alternate strategy for dealing with limited information: unpaired samples from known and unknown surroundings are used to generate a path on embedded devices, such as smartphones, in real time. This strategy creates a path that avoids virtual elements through physical objects. The authors suggest an architecture for creating a path using imperfect knowledge. Additionally, an augmented reality experience is used to describe the generated path, and some users tested the proposal to evaluate the performance. Finally, the primary contribution is the approximation of a path produced from a known environment by using an unpaired dataset.


Introduction
Different academic domains, as well as daily life, have been incredibly touched by interactive multimedia systems [1]. For instance, they provide capabilities for robotic exploration in order to locate obstacles in a physical area [2]. In the realm of e-learning, materials are displayed to help students understand abstract concepts [3]. The use of virtual elements in systems for training in the healthcare industry was described in [4,5]. Therefore, the primary contribution of an interactive multimedia system is the enhancement of realworld settings with virtual components. Such technologies are described by the concepts of augmented reality (AR) and mixed reality (MR). AR technology allows computer-generated virtual visuals to perfectly overlay real-world objects in real time [6]. MR, on the other hand, enables concurrent work in both the physical and virtual domains, hence minimizing domain transition costs [7]. The primary distinction between MR and AR is that MR provides a more immersive experience.
Since MR requires a comprehension of the surroundings, specific equipment that contains sensors to sense the environment is required [8]. In order to provide a realtime user-interaction experience, the system must handle a great volume of data, which generates numerous important opportunities and difficulties [9]. Consequently, the current equipment is intended for a limited audience due primarily to its technological qualities, which limit the gradual extension of this experience on a large scale [10]. Nevertheless, unlike in MR, an immersive experience in AR is unnecessary, which reduces the prices and requirements for equipment features. For instance, mobile devices, such as smartphones, are an integral part of people's everyday lives, generating a vast presence in conducting a variety of tasks. In 2020, there were at least 50 billion gadgets [11]. These gadgets have provided diverse interactive experiences, including AR experiences, mostly through markers and telemetry [12].
Existing technologies that enable interactive AR experiences, such as the ARCore and ARKit frameworks, make it possible to comprehend interactions with the environment in real time [13]. These frameworks concentrate on gathering information from flat surfaces to display virtual elements, but these frameworks disregard the fact that items may produce an accident between the user and the surrounding environment [14,15]. Therefore, current solutions lack an approach to this fundamental obstacle when providing this type of application: an environment's security. The authors of [16] defined a collision as the avoidance of a path or trajectory between two configurations embedded in the cost field while motion restrictions are considered. Similarly, free space allows agents to interact freely with the environment to avoid accidents.
For this reason, environmental comprehension is essential to the real-time processing of the environment. In order to apply solutions that enhance the features of constrained equipment, data processing that employs machine learning (ML) techniques is required [17]. For instance, the authors of [18] emulated the behavior of a sophisticated sensor by using machine learning. Deep convolutional neural networks (DCNNs) reduce the number of operations and can operate on devices with limited computational resources [19]. In addition, this technique was supplemented by generative adversarial networks (GANs) to create depth image samples [20]. These enabled the production of a depth image via competition between two networks.
The creation of a specific dataset is a significant constraint of existing methods. Such datasets comprise paired samples, since the predicted sample and source image have linked attributes. This constraint creates a disadvantage for this type of dataset in terms of data collection. For example, RGB-D data are required in order to perceive the scene, but this strategy could be more effective in undetermined scenes, since depth information is necessary to estimate samples. Since the predicted images are unconnected to the source images, this problem involves unpaired samples [21]. However, the concept of the neural transfer style, which involves learning the features from one sample to another, is an alternative method for merging sample features [22]. Cycle-GAN and pix2pix are two concepts for combining the properties of one domain with those of another [23]. In order to interact with a known environment without prior information, an unknown domain modifies its transfer style.
As a result of the above, this proposal offers a path planner generator for avoiding collisions by employing GANs. In addition, the architecture purports to be a substitute for interacting with numerous settings in order to minimize design time and enable the use of smartphones and other devices with few resources. Consequently, the following hypothesis is investigated in this study. Since the GAN permits the transfer of samples from one domain to another, it could infer the properties between domains to connect samples from a known situation with paths through an unfamiliar environment.
This work investigates and explains the influence of inferring features from a known scenario to an unknown scenario in order to approximate a generic solution for creating pathways in real time on low-cost devices.
The remaining sections of this manuscript are structured as follows. Section 2 describes the context and research gaps. The suggested work is introduced in Section 3. Section 4 illustrates the outcomes and analyses of the experiments. The conclusion is then stated in Section 5.

Background and Research Gaps
In the research field of autonomous intelligent systems, path planning is a problem where the system determines the optimum collision-free path between two points. Usually, the implemented architecture incorporates the perception and the planning modules separately, forming a complex architecture to offer a real-time performance [24,25]. Nevertheless, the performance in the real world is inefficient in terms of time and resources [26]. As a result, virtual simulators expedite the design process. Simulators such as Airsim help evaluate agents' dual systems [27]. Therefore, simulators generate samples of possible scenarios for training algorithms that require real-time perception based on ML techniques [28].
The perception of the environment requires specialized sensors, but the current tools require previous interactions to be effective. For instance, ARcore needs to move the device to detect key features to build a 3D environment described by the ORB-SLAM approach [29].
Once the system scans the scene, flat surfaces are detected, offering real-time interaction with the scenery, but the system positions virtual elements in places that are difficult to access or that may cause accidents. For example, Figure 1 shows a virtual element on a flat surface, offering a limited experience in terms of interaction with the environment because the selection of the surface is random, generating some inconvenience in the experience. However, mobile devices require high computer performance but are limited in computer power and battery. Therefore, AR and MR experiences require understanding the environment to determine the best place to display the virtual elements. Since specific sensors are required and avoid previous interaction with the environment, different proposals have offered alternative solutions, such as replacing the sensor with ML algorithms [30]. Existing evidence where ML replaces specialized smartphone sensors, as described in [31], is used to accurately replace a depth sensor from an image taken from the device. Likewise, another alternative solution is to employ simulators and ML approaches to connect a sample-limited environment with a known environment [32]. Thus, ML techniques provide novel features to the current systems.
On the other hand, the high computing power on an embedded device is high in time and resource needs, limiting the roll-out on generic devices. Therefore, the end-to-end approach reduces external elements such as sensors since the perception of the scenario is considered within a simulation [33]. This approach reduces the conventional architecture and additional sensors to generate the necessary information to perceive the environment.
Although considerable efforts have been made to create alternatives through ML techniques, there is a severe issue where the current solutions focus on particular problems. In other words, a dataset is required for each environment. Since the current solutions implement datasets in the training environment based on specific environments, we have detected that GANs have been used to perform feature blending between unknown do-mains through images [34,35]. Therefore, GANs can offer novel solutions by exploring domain changing through style transfer between two domains.
The current work aims to contribute to the analysis of the behavior in feature inference between an image acquired from an unknown domain and translate it to a controlled scenario to generate a safe path and evaluate its behavior and determine the reliability of deploying virtual elements in augmented reality applications in real time. Furthermore, the conventional architecture described by Figure 2a is composed of two main modules, the perception and the planner. Likewise, the conventional architecture is reduced for being employed on embedded devices only utilizing a single module, as Figure 2b illustrates.

Proposed Work
The present proposal takes the principal features of cycle-GAN and pix2pix models since these approaches generate a new sample according to features of an unknown domain. Based on this approach, the first step is to define the available domain to generate the dataset with known paths connected to the unknown environment. The referent environment is known because it includes the depth data for each sample. In contrast, the depth value in the second scenario is unknown because the depth data is unavailable. As depicted in Figure 3, the virtual simulator generates paired samples, and the samples of the unknown scenario are unpaired because the relation between data is unfamiliar. In order to describe the performance of the experiment, four environments were defined. An environment represents a domain since a domain contains elements with similar features. Therefore, each environment has different features, such as illumination, color distribution, and textures, as shown in Figure 4. The following analysis describes the performance for each domain according to features such as the average color and dominant color employing the k-means approach [36]. These outcomes are described in Table 1. It is observed that it is challenging to determine whether transfer and regular features exists in relation with the known environment because the unknown environment A has a bad behavior since the dominant color tends to be black. Consequently, the interpretation of data is limited to determining the relation of approximation to the expected environment.  In order to describe the data, each image is plotted in a three-dimensional space. Therefore, the sample is converted into a 3D point. This analysis employs a deep convolutional neuronal network (DCNN) model to plot a sample in 3D space; Figure 5 describes the characteristics. For this case, the model initializes the variables arbitrarily. Consequently, the distribution of the elements in space varies based on the random values, and Tanh has been implemented as a transfer function because the data are distributed throughout several spatial regions (Figure 6a). According to Figure 6b, the transfer style approximates the unknown sample to a known sample. This behavior shows that the GAN trough transfer style positively affects the data between two domains.

Paired Unpaired
Due to the features of each domain depicted in Figure 6c, the unknown domains are distributed throughout the 3D space. Red, for instance, indicates a recognized environment, while green, blue, and cyan represent unknown environments. As is shown in Figure 6d, the approximation from unknown domains to known domains is near since the data concentration covers a similar space region after applying the GAN architecture. Similarly, the centroid of each domain and its distance from the known environment are determined. According to Table 2, the centroid for each domain is close to the known environment centroid. Therefore, the GAN approximates the characteristics between unidentified space domains.   This proposal adds a step to the GAN architecture once the features have been approximated: a hierarchical clustering (HC) [37] where the closest sample is obtained based on the distance between the generated sample and the known samples. Since the accessible domain contains depth information for determining a path, the generated sample must approximate a known sample. Furthermore, the rapidly exploring random tree (RRT) algorithm establishes the collision-free path [38], as is shown in Figure 7. To generate a path, the calculus of variations describes the composition of a path [39] as the sum of the distances between two consecutive places defined by Equations (1) and (2). Consequently, a path consists of a collection of lines in space. Likewise, one characteristic of a path is its capacity to avoid obstacles. Path planning is the shortest distance between m obstacles O and the best value with the highest level of free collisions in a series of points p of length n. In this sense, the maximum optimization problem is changed into the minimum optimization problem by adding a negative sign to the value, as stated by Equation (3) [40].
The following step is to develop path generators based on ML using an end-to-end approach. Regarding this study, [41], we have presented two methods for developing a path-planning generator. The first method analyzes the input sample based on the close distance between the centroid and each cluster level until a sample with the minimum distance is obtained, as shown in Figure 8a). Figure 8b) describes the second approach, which is an autoencoder composed of two types of networks: a DCNN and a recurrent neural network (RNN). Since the second method is an autoencoder, the encoder made of DCNN has the features shown in Figure 9 to extract a characteristic vector, and the RNN is responsible for producing a series of points in 3D space. Moreover, a vocabulary is necessary because it was inspired by the picture caption algorithm [42]. The vocabulary consists of discrete samples of the three-dimensional space, and the 2-meter space is split into 20 cm for each step. As a result, the vocabulary decreases the number of discrete-sized samples on each side. Once discrete samples have been gathered, each trajectory calculates the frequency of each node. In this instance, at least 1000 potential values were reduced to 183. Consider that the total number of samples depends on the size of the discrete sample and that the samples generated vary based on the RRT algorithm's generated pathways. According to [43], a real-time system generates at least ten frames per second. In order to offer a real-time experience, the architecture must be simplified to be deployed on a mobile device. The transfer learning approach replaces a complex design with fewer features to solve the same problem [44]. As demonstrated in Figure 10, the GAN architecture with HC is simplified to a DCNN since a smartphone can execute this type of model without any difficulty. The methodology is summarized in Figure 11, which depicts the implementation of a lengthy process in an embedded device using unpaired samples from two environments. The paths provided by the RRT algorithm in a known environment are described in 11a. Figure 11b depicts how the following step constructs the path using HC and autoencoder. Likewise, Figure 11c uses the GAN style generator to convert the characteristics across two domains employing the HC approach to approximate a known path. In order to use the design on embedded devices, Figure 11d implements the transfer learning approach, which describes the 200 samples of the optimized architecture saved in an unknown environment. Figure 11e illustrates the relationship between an unfamiliar and known path for unpaired samples. In addition, an augmented reality system for smartphones is proposed to display the behavior of the development of paths, which generates a path in real time.   The performance of this proposal is evaluated based on the behavior in creating an expected vector by a machine learning model and the collision-free coefficient that characterizes the viability of the generated path.

Experimental Phase and Analysis
The proposed architecture was implemented in a g4dn.xlarge instance in Amazon Web Services (AWS) with the following specifications: 4 VCPU XEON 8259CL 2.5 GHz, 16 GB RAM, 125 GB SDD storage, with NVIDIA Tesla T4 GPU with 320 Tensor Core with 16 GB RAM. The introduced architecture was implemented in Tensorflow 2.4. Training time was 1 h and 10 min. The implementation on the smartphone employs Tensorflow-lite. Along with the experimental phase, 300 samples were used for the GAN architecture; finally, 50 unknown samples were used for the experiment. There are three distinct categories of physical items in the surrounding area. Two of these things are chairs of various colors and sizes, while the last object is a table. Microsoft Kinect V1 has a minimum perception range of 40 cm when configured for a 4 m range [45]. Therefore, things must be larger than 40 cm to be perceived at a range of up to 4 m. Consequently, we have both flat and curved surfaces to measure data consistently.
The RRT algorithm returns a vector representing each sample's path. The following experiment compares the autoencoder's and HC's performance between known and created vectors. The behavior is based on the Euclidean distance (Equation (4)), the Manhattan distance (Equation (5)), and the cosine similarity (Equation (6)) used to examine the difference between the predicted vector x and the generated vector y. The free collision coefficient of Equation (7) quantifies whether or not at least one node generates an inadequate path.
cosine similarity = x · y x y (6) The experiment assesses the behavior in four contexts to develop a path. Furthermore, the suggested method consists of the original model and the transfer learning method, with 50 samples for each model. Comparing the created and predicted vectors, Table 3 illustrates the behavior of path generation. A proposed collision-free coefficient describes the behavior because the coefficient decreases when a possible collision is likely in the path. In addition, distance-measuring devices such as ARcore and the Kinect sensor have been used as references. It is crucial to mention that ARcore and Kinect sensor measurements were only conducted once. According to the statistics, the model with the best performance is the HC with transfer learning. Given the HC approach's characteristics, it approximates a near-optimal solution because the training samples are sufficient for the scenario features. However, this algorithm lacks efficiency because more samples are required when the scenario expands. On the other hand, it has been observed that the use of TL reduces characteristic vector resolution for each sample because the vector is normalized, and the word size change directly influences path generation. The autoencoder with TL exhibits this behavior because a fluctuation in the model's word size is another value that causes the error to grow; an alternative option is to increase the number of samples when performing inference in the TL model. ARcore and Kinect sensor offers the best performance, but integrating these technologies is challenging because a path must be established online and requires additional considerations, for example, previous environment exploration for ARcore and located objects in at least 40cm for the Kinect sensor.
Since HC has the best performance in this experiment, it has been implemented on a Moto X4 smartphone with a Qualcomm 630 processor. In addition, the proposal employs the PointCloud function of ARCore to determine when the input data are updated. Thus, the path is formed when the features of the current image are relevant to the new image, obtaining a maximum sampling rate of 30 frames on the test device. Figure 12 illustrates a path that depicts a 2-meter-long path devoid of collisions. Observe that a route is provided to avoid obstacles in the environment. The test evaluation was conducted in two unknown environments using an AR tool. The path avoids physical elements, implementing a domain change with a transfer style. Once the systems display virtual elements, a user experiment describes the performance employing a virtual coin route to guide the users to evaluate the interaction with the physical world. Figure 13a describes an unsafe distance between barriers that could result in an accident. Figure 13b depicts the safe location for virtual objects to avoid traversing the obstruction. Real-time deployment of virtual elements in a physical environment is depicted in Figure 14. The position for each coin that composes a virtual path prevents potential collisions.  According to [46], there are three important factors to consider when evaluating the performance of a 3D application: the participation of representative users, the environment of the evaluation, and the sorts of findings generated. The evaluation is based on the user's interaction with the proposed system. Eight users conducted four system evaluations in this investigation and monitored user behavior. Table 4 details the number of potential collisions between each user and the obstruction. Over time, each user adjusts to the system based on the outcomes of their user experience. Once the user has used the system for the first time, the user typically mistrusts the location of the virtual elements and maintains a keen awareness of the real world. In other words, the user looks over the mobile device out of concern for a potential collision. The experiment demonstrates how individuals adapted to a system with a high number of potential collisions; each user had at least three collisions, although in some cases, their fear of colliding through the system decreased to zero or one. Consequently, the user focuses on the device's screen, reducing the fear of colliding with an obstruction, and the system enhances the safety of traveling in a controlled environment.
Therefore, this study presents an alternate technique for providing a path generator for displaying virtual elements with minimal information, which can be implemented in uncertain environments. Likewise, a novel alternative has been established for deriving a general solution from restricted information to avoid specific data information in the environment.

Conclusions and Future Work
This research presents a way to resolve the problem of building unpaired datasets for indoor exploration using augmented reality applications on limited devices using the transfer style. This proposal provides an alternative to expanding the limits of specific datasets to specific solutions. Therefore, the transfer style enables connection to unknown domains based on a known environment, which contains the generated paths, and the GAN approximates a potential solution according to sharing features. Furthermore, it is crucial to note that this method delivers a real-time understanding of the physical world because the virtual elements are displayed in a safe location. Due to this behavior, this approach expands the potential implementations of GANs.
Although technologies such as Arcore and Kinect sensors have been utilized to evaluate the possibility of collisions, these technologies require the online execution of a path planner generator. Consequently, the development and implementation time is costly, but the precision is superior. However, this proposal helps reduce the number of external sensors and avoids knowing the environment in advance, two of the most significant limitations of the technologies, as mentioned earlier. In addition, the experience is real-time, and the most innovative aspect is the execution on constrained devices such as smartphones.
According to user experience, the user improves with time while learning to concentrate on the immersive experience without relying on prior knowledge of the environment. However, additional concerns have emerged, such as the minimum number of available samples required to ensure optimal behavior and how this type of solution compares to physical sensors. On the other hand, the concept of meta-learning could supplement this work by enabling the creation of a generic solution for scenarios that share similar features without the need for several samples of the same objects.
The objective of future work will be to address the challenges caused by the introduced work and to supplement this proposal with other methods, such as meta-learning. Metalearning might be applied to fine-tune potential global solutions whose characteristics are adequate to facilitate the current disadvantages of the present effort.