Deep Reinforcement Learning-Based DQN Agent Algorithm for Visual Object Tracking in a Virtual Environmental Simulation

: The complexity of object tracking models among hardware applications has become a more in-demand task to accomplish with multifunctional algorithm skills in various indeterminable environment tracking conditions. Experimenting with the virtual realistic simulator brings new dependencies and requirements, which may cause problems while experimenting with runtime processing. The goal of this paper is to present an object tracking framework that differs from the most advanced tracking models by experimenting with virtual environment simulation (Aerial Informatics and Robotics Simulation—AirSim, City Environ) using one of the Deep Reinforcement Learning Models named as Deep Q-Learning algorithms. Our proposed network examines the environment using a deep reinforcement learning model to regulate activities in the virtual simulation environment and utilizes sequential pictures from the realistic VCE (Virtual City Environ) model as inputs. Subsequently, the deep reinforcement network model was pretrained using multiple sequential training image sets and ﬁne-tuned for adaptability during runtime tracking. The experimental results were outstanding in terms of speed and accuracy. Moreover, we were unable to identify any results that could be compared to the state-of-the-art methods that use deep network-based trackers in runtime simulation platforms, since this testing experiment was conducted on the two public datasets VisDrone2019 and OTB-100, and achieved better performance among compared conventional methods.


Introduction
Visual object tracking is a classic computer vision problem that entails detecting an object in a scene and distinguishing it from other objects in each frame. In sequential frames, it might be static or dynamic. To our knowledge, there are a slew of fundamental challenges to visual object tracking, including occlusion, motion blur, background clutter, ambient illumination fluctuations, etc. To address these issues, the most common tracking approaches [1][2][3][4] monitor specific object classes utilizing a variety of feature learning algorithms. However, several other tracking techniques offer great efficiency and competitive outcomes when compared to the state-of-the-art object tracking models. Nevertheless, they still have constraints that need to be addressed in order to achieve high accuracy and quicker tracking performance under difficult environmental circumstances.
There are certain tracking filters [5,6] and object-detection-based models [7,8] that may be competitive alternatives to classic methods, but they still have drawbacks when compared to deep network-based tracking strategies.
Despite the relative success of conventional tracking approaches [9][10][11], deep convolutional neural network (Deep CNN)-based visual object tracking models [12][13][14] have gained popularity in recent years. The popularity and advantages of CNN-based trackers may be explained in two ways: firstly, by their tracking robustness, and secondly, by the highly efficient feature representations located inside their detection units. In the majority of target tracking situations, pretrained CNN classifiers were employed [11,15,16] for finding objects and classifying them, or cropping and regressing methods [9,10,17]. However, the disparity between the CNN-based feature representation for classification and the tracking algorithm has an impact on the output of the final tracking results. Furthermore, the pretrained CNN classifier fails to function well in the challenging environment of the tracking process, where the captured spatial features are not explored thoroughly during training.
Nonetheless, tracking by detection method has been proven to be effective in numerous difficult tracking circumstances [11,18,19] where CNN was used in training and superior tracking results were finally obtained compared to standard target trackers. Unfortunately, in the event of a visually crowded scene with numerous occluded frames and a small distance of correlation, targeted items may be missed. The basic strategy and goal of CNN-based models relies on a target classification approach. In addition, an object classification model will be used to address a backdrop cluttering problem. Consequently, there is a necessity to investigate the ability of deep learning models to automatically learn effective features in a virtual environment that uses drone agents with spatial and temporal constraints. In particular, it should be considered to make a long-short term feature learning strategy and classify objects into identical target classes with the proposed end-to-end model.
The motivation for this study was to develop an object tracking algorithm capable of learning and tracking the target using deep reinforcement learning and an artificial intelligence network model in a wide variety of skills and abilities that can compete with other models in complicated tasks. In this study, we created a novel tracker that was based on a sequential recurrent neural network [20,21] prediction and tracking architecture. The deep reinforcement learning-based Q-network agent tracker was integrated with the high-dimensional cross-platform virtual simulator for drones called AirSim.
For testing reasons, we built our unique tracking model using the AirSim [22] (Aerial Informatics and Robotics Simulation) simulator platform. The platform enables the algorithm to be evaluated in a realistic virtual environment that includes elements such as pedestrians, automobiles, trees, street signs, buildings, and weather conditions. Figure 1 shows the virtual CityEnviron model scenario with a virtual AirSim drone from the Microsoft AirSim v1.2 version. The inspiration and idea for our proposed model came from deep reinforcement learning for human-level control [23], which uses a specific architecture in conjunction with a deep convolutional network [24] with hierarchical layers of tiled convolutional filters integrated alongside artificial neural networks in order to learn concepts such as object categories directly from raw sensory data.
Our proposed deep reinforcement learning and tracking technique is thought of as a sequential feature learning/prediction and decision-making technique for drone agents to monitor actions' rewards through environment sequence and to acquire a tracking output of the recommended methodology. Traditionally, we estimated the adaptive action-value function using a deep sequential neural network [25] such as Q π (s, a) .
where the equation estimates the expected total returns, also known as the sum of rewards G t beginning from a fixed location state s and performing an action a in accordance with some policy π. Our proposed deep reinforcement learning and tracking technique is thought of as a sequential feature learning/prediction and decision-making technique for drone agents to monitor actions' rewards through environment sequence and to acquire a tracking output of the recommended methodology. Traditionally, we estimated the adaptive action-value function using a deep sequential neural network [25] such as where the equation estimates the expected total returns, also known as the sum of rewards beginning from a fixed location state s and performing an action a in accordance with some policy .
Reinforcement Learning (RL) algorithms are mostly based on the above-mentioned Equation (1), which estimates the best value function from sufficient experience to obtain the excellent value estimation. Our proposal utilizes a Q-learning algorithm that uses an action-value function with particular parameters to learn and obtain the best action-value by repeating the learning process. Basically, the reinforcement learning method is known to be unstable as a nonlinear function approximator in a neural network to represent the action-value (known as Q) function given above. The ( , ) equation function above estimates the expected total returns by taking some policy . ( | ) is the policy that maps from state observation s to action a. Typically, stochastic policies are used; however, deterministic policies can also be specified. Most of the useful RL algorithms heavily rely on the above value function and will be obtained from sufficient experience, the optimal policy * that can be found in every state, and taking the greedy action that leads to the state with the highest value estimations. The process will be applied continuously for every state environment as an iterative update. The rest of this article is organized as follows. Section 2 will give an overview of deep network-based object tracking descriptions and related work. In Section 3, we describe the deep reinforcement learning-based DQN agent drone tracking algorithm integrated with the Virtual CityEnviron (VCE) simulation platform. Experimental simulation results for Reinforcement Learning (RL) algorithms are mostly based on the above-mentioned Equation (1), which estimates the best value function from sufficient experience to obtain the excellent value estimation. Our proposal utilizes a Q-learning algorithm that uses an action-value function with particular parameters to learn and obtain the best action-value by repeating the learning process. Basically, the reinforcement learning method is known to be unstable as a nonlinear function approximator in a neural network to represent the action-value (known as Q) function given above. The Q π (s, a) equation function above estimates the expected total returns by taking some policy π. π(a|s) is the policy that maps from state observation s to action a. Typically, stochastic policies are used; however, deterministic policies can also be specified. Most of the useful RL algorithms heavily rely on the above value function and will be obtained from sufficient experience, the optimal policy π * that can be found in every state, and taking the greedy action that leads to the state with the highest value estimations. The process will be applied continuously for every state environment as an iterative update.
The rest of this article is organized as follows. Section 2 will give an overview of deep network-based object tracking descriptions and related work. In Section 3, we describe the deep reinforcement learning-based DQN agent drone tracking algorithm integrated with the Virtual CityEnviron (VCE) simulation platform. Experimental simulation results for verifying our algorithm are presented in Section 4 and our conclusion follows in Section 5.

Visual Object Tracking
Visual object tracking has gained more attention in computer vision in the last decade than ever before. There have been numerous successful studies on various tracking benchmarks [18,19,26]. Classification-based trackers have also been proposed, which may be referred to as tracking-by-detection or tracking-by-classification [27][28][29][30][31]. These techniques primarily focus on separating targets from the scene by collecting target locations and detecting them using trained classifier models. To be more specific, a tracker captures foreground patches close to the target position and background patches from a distance, which are then trained as a foreground-background classifier to score the current or next frame's target's location in order to recognize it.
Robust MIL-based feature learning [12] and tracking-learning-detection adopted for unsupervised learning [14] techniques were presented to improve the resilience of tracking models in noisy environments. In general, the classification model is trained offline using manually labeled pictures before being utilized for online or real-time tracking operations. Numerous neural network-based trackers utilize these concepts [11,16,32,33] throughout the development of their approach, and they provide effective outcomes when compared to classic trackers [12,13,34] and achieve state-of-the-art results [11,26]. The concept of using correlation filters to resolve the inadequate representation of convolutional and handcrafted features is retained [5,15,35]. However, due to the fact of learning through a limited number of scale-wise filters and a relatively small number of feature extraction video frames, those methods are not fully functional, resulting in the loss of critical long-term feature representation and temporal information between two or several consecutive frames.

Regression-Based Trackers
In recent years, some researchers have developed a deep regression network-based tracking approach [10,36,37] that uses bounding box regression to identify objects instead of classification models. This technique improves the chances of solving the tracking issue by training the model on trainable datasets using a loss function such as meansquared or mean-absolute error. The datasets usually consist of the input pictures and the bounding boxes of the frame's object classes. However, if the object moves very quickly across the consecutive frames or if occlusion occurs, this approach may fail or produce poor results during the tracking process. Furthermore, to fully cover the consecutive frames' characteristics and their information, the aforementioned approaches require a more efficient searching algorithm for learning the dataset or environment by utilizing sliding-window or candidate-sampling strategies.

Recurrent Neural Network-Based Tracking
Furthermore, recurrent neural network-based research [36][37][38] has advocated employing recurrent layers in order to solve the visual object tracking issues. They employ an RNN-based structure combined with an attention mechanism. RNN-based tracking models have not been proven competitive yet on contemporary benchmarks; nevertheless, this technique can obtain higher results by using sequential layers to anticipate objects using the sliding method.
There is also a work by Ning et al. [39] that combines spatially supervised recurrent convolutional neural network integration with the YOLO network [40] architecture for directly detecting object classes. The recurrent neural network will directly regress the YOLO detection output on each frame to retrieve the targeted objects class.

Deep Reinforcement Learning
Reinforcement learning (RL) is a training approach in which a machine learning model makes judgments in a series of actions while managing the process. It investigates how an agent may learn to make decisions and attain goals in a complex and unpredictable environment. Essentially, this method learns the optimal policy for deciding which sequential acts to perform by maximizing future benefits [41]. Recent popular works [42][43][44][45] propose the combination of RL models with deep neural networks in order to improve decision making by representing RL techniques as a policy or value function where the model learns the process interactively from feedback rewards, to improve expected rewards in long-term sequential processes by learning best policy. Several techniques have proposed recovering deep features by learning particular policy tasks [46], which have been evaluated by applying them to Atari games [42] and other [44] approaches, which have been successfully addressed by using a semi-supervised learning methodology. In addition, many models for object localization [47], prediction, and target tracking [48] have been suggested. By integrating CNN, RNN, and RL algorithms, Zhang et al. [49] proposed a network design that was particularly well-suited for addressing the tracking issue. They used RNN as a top-level CNN feature extraction, focusing on both spatial and temporal restrictions. In addition, the framework was trained in offline mode utilizing an end-to-end reinforcement method.
Deep Q-Networks (DQN) and the gradient policy method are the most well-known deep RL algorithms [23,42]. The Deep Q-Network is an alternate model of the Q-learning algorithm that learns each step of the action values in a given state. It is a model-free method that uses stochastic transitions and incentives to solve problems without requiring any modifications. Alternatively, various DQN algorithm-based designs have been developed, such as Double DQN [50] and Dueling DQN [51], which are better versions of the learning and tracking process in terms of stability.
Another research method is the policy gradient methodology, which learns policy directly by applying gradient descent to optimize the network policy to the projected future reward. Williams et al. [46] proposed the reinforce method, which used a simple and quick reward to measure the policy's value.

Deep Reinforcement Learning-Based DQN Agent Drone Algorithm for Visual Object Tracking in a Virtual Environmental Simulation Platform
In this study, we propose a network that is connected to a virtual environment in order to obtain a runtime sequence of video frames and locate targeted objects in each image of the episode. The basic novelty of our algorithm is that it uses one of the most wellknown RNNs for learning and predicting from long-term temporal sequences, integrated with action decision techniques inspired by the successful work [52]. RNNs enable the VCE simulation platform to exhibit temporal dynamic behavior by connecting nodes from direct or indirect graphs along a temporal sequence. An integrated deep reinforcement learning network can control actions by using RNN-based training sequences for making action decisions as an output for an object tracking procedure. Figure 2 depicts the overall framework of the proposal, which demonstrates the integrated scheme of the AirSim simulation platform with a virtual environment.  The AirSim simulation platform includes two types of Multirotor (Drone) and Car options to test the approach via connecting AirSim Python Client, as shown in Figure 2. This DQN model discovers the VCE model relatively quickly while exploring and starting to learn the tactic of UAV agents. In this paper, we use a structure in Figure 2 (above) to learn sequential video frames by controlling policies in a variety of simulation environmental conditions. This is achieved by receiving input images and using them as input values for learning, as well as value estimation. The Environment Simulation code in our model connects the VCE and DQN agent simulation network algorithm with tracking via AirSim python client to control the drone simulation during training and tracking. The AirSim API gives an opportunity to run the algorithm on the VCE simulation platform and test it easily without any physical hardware system.

DQN Network Architecture
The recommended network model architecture incorporates reinforcement learning and a recurrent neural network method. The recurrent network is effective for applying sequential circumstances and for forecasting environmental target attributes. Figure 3, The AirSim simulation platform includes two types of Multirotor (Drone) and Car options to test the approach via connecting AirSim Python Client, as shown in Figure 2. This DQN model discovers the VCE model relatively quickly while exploring and starting to learn the tactic of UAV agents. In this paper, we use a structure in Figure 2 (above) to learn sequential video frames by controlling policies in a variety of simulation environmental conditions. This is achieved by receiving input images and using them as input values for learning, as well as value estimation. The Environment Simulation code in our model connects the VCE and DQN agent simulation network algorithm with tracking via AirSim python client to control the drone simulation during training and tracking. The AirSim API gives an opportunity to run the algorithm on the VCE simulation platform and test it easily without any physical hardware system.

DQN Network Architecture
The recommended network model architecture incorporates reinforcement learning and a recurrent neural network method. The recurrent network is effective for applying sequential circumstances and for forecasting environmental target attributes. Figure 3, given below, illustrates the network structure of the learning procedure. The DQN network model processing steps with targeted action and state values of the learning procedure are depicted in the diagram below. The taken action and state values will be stored for the next step of training operations as an initial input value. Additionally, preserving learned information allows drone agents to avoid unnecessarily surveying the same location twice. This action may appear to be a repeated procedure, but in a virtual environment, the action and state value will be identical to the preceding one. learn sequential video frames by controlling policies in a variety of simulation environmental conditions. This is achieved by receiving input images and using them as input values for learning, as well as value estimation. The Environment Simulation code in our model connects the VCE and DQN agent simulation network algorithm with tracking via AirSim python client to control the drone simulation during training and tracking. The AirSim API gives an opportunity to run the algorithm on the VCE simulation platform and test it easily without any physical hardware system.

DQN Network Architecture
The recommended network model architecture incorporates reinforcement learning and a recurrent neural network method. The recurrent network is effective for applying sequential circumstances and for forecasting environmental target attributes. Figure 3, given below, illustrates the network structure of the learning procedure. The DQN network model processing steps with targeted action and state values of the learning procedure are depicted in the diagram below. The taken action and state values will be stored for the next step of training operations as an initial input value. Additionally, preserving learned information allows drone agents to avoid unnecessarily surveying the same location twice. This action may appear to be a repeated procedure, but in a virtual environment, the action and state value will be identical to the preceding one.
whereQ(s t , a t ) is the result of the updated Q-network iteration value; Q(s t , a t ) in which the initial old value will be added; α is the learning rate of the training network; r t is the reward value; in this equation γ relates the rewards to the time domain; and max a Q(s t+1 , a) is the optimum future value estimation. In the equation above, the calculation of the updated Q-Network iteration value has been presented. An updated iteration value can be formulated by adding the previous Q(s t , a t ) value and multiplying the learning rate with α, which represents the difference in computation time. The max in this equation represents the maximized actions that help agents to take in the upper arcs of the VCE itself. Suppose, our drone agent was in state s and it took some action a. Because of that action, the environment might land our drone agent in any of the states s t+1 , and from these states, it would maximize the action. From these values, the drone agent will choose the action with the maximum Q value: max a Q(s t+1 , a). Intended new values can be calculated by multiplying the discount factor γ with the maximum Q value and adding to them the reward r t value.
In our DQN-network concept, the DQN agent contains a replay memory class unit that keeps track of the environment in dynamic mode. All state (s t ), action (a t ), new state (s t+1 ), Appl. Sci. 2022, 12, 3220 7 of 18 reward (r t ), and done transitions are memorized. This replay memory approach allows us to effectively sample minibatches from preserved values and generate the accurate state representation. In order to obtain the best results from memorized values and keep track dynamically, a responding buffer memory is integrated with the VCE simulation platform. Figure 4 illustrates the construction of the replay memory unit, where the storage monitoring process is shown. The buffer memory is directly connected to the virtual environment, which adds the required transition to the memory unit. During the sampling process, a random number of map indices of varying sizes are produced in memory, and the returned indices may be retrieved using the "get state" function of the AirSim python client. Minibatches will be generated by using the last saved values from the training process. Replay memory is one of the most critical individual core components of the DQN agent, separating the target Q-network and exhibiting the negative impacts on performance.
these values, the drone agent will choose the action with the maximum value: max ( , ). Intended new values can be calculated by multiplying the discount factor with the maximum value and adding to them the reward value. In our DQN-network concept, the DQN agent contains a replay memory class unit that keeps track of the environment in dynamic mode. All state ( ), action ( ), new state ( ), reward ( ), and done transitions are memorized. This replay memory approach allows us to effectively sample minibatches from preserved values and generate the accurate state representation. In order to obtain the best results from memorized values and keep track dynamically, a responding buffer memory is integrated with the VCE simulation platform. Figure 4 illustrates the construction of the replay memory unit, where the storage monitoring process is shown. The buffer memory is directly connected to the virtual environment, which adds the required transition to the memory unit. During the sampling process, a random number of map indices of varying sizes are produced in memory, and the returned indices may be retrieved using the "get state" function of the AirSim python client. Minibatches will be generated by using the last saved values from the training process. Replay memory is one of the most critical individual core components of the DQN agent, separating the target Q-network and exhibiting the negative impacts on performance. As shown in Figure 4, the data flow diagram for the DQN-network model connects a Q-network and a recurrent network design (network for prediction) via the AirSim Python client to a virtual simulation platform. The accumulator keeps track of the N frames to be utilized for agent assessment. We can also compute the DQN loss function by combining forecasted and targeted Q-values, and we can obtain gradient loss output as well. The DQN network uses loss function for updating Q-learning at iteration i as follows [23]: where γ denotes the discount factor value of the agent's horizon, θ i are the parameters of the Q-network at iteration i, and θ − i are the network parameters utilized to compute the target at iteration i. The target network parameters θ − i are only updated with the Q-network parameters (θ i ) at each defined step and are maintained constant between individual updates.
The underlying buffer with N previous states is stacked along the first axis in the replay memory unit and added to the state preservation. Furthermore, by using the reset function, the whole memory unit is reset to the underlying buffer, with all indexes set to zero (0).

Deep Q-Agent with Tracking Unit
We propose a DQN agent model that learns by utilizing a recurrent neural network model, whereas the authors in [23] adopted a convolutional neural network model. In our work, the recurrent layers are used to learn the features and information about the virtual simulation environment, while the method proposed in [23] used the convolutional neural network layers for that purpose. In our model, an additional featured approach is used to determine the eventual consequence of final action value. In the sequential environment learning process, the recurrent layers provide accurate predictions of state and action values with generalized data created from policies. Figure 5 illustrates a schematic representation of the DQN agent learning architecture integration with tracking based on recurrent neural networks. In the first step of implementation, we configure the parameters of the DQN agent model with initial setup, where an action value model will be used by the agent to interact with the virtual simulation environment. A target model is used to compute the target Qvalues in training, and is updated less frequently to increase the stability of the learning procedure of the DQN agent. The network model is built using a sequential mode by applying an LSTM layer with a deeply connected dense layer to change the dimensions of the vector from 64 to 32 with the activation-relu function. The network outputs will be state, Q, and action values to provide the drone agent with crucial information. A number of Q-value actions allow the agent to select the next action to perform in regard of the current state of the environment. Subsequently, the drone agents' next targeted action will be chosen dependently by following predicted feature information of the trackable objects via LSTM network that are stored in a long-short term memory unit. Correspondingly, the drone agent takes best targeted action Q-values while tracking objects over time.
The network model's observation unit allows the agent to observe the number of Q- In the first step of implementation, we configure the parameters of the DQN agent model with initial setup, where an action value model will be used by the agent to interact with the virtual simulation environment. A target model is used to compute the target Q-values in training, and is updated less frequently to increase the stability of the learning procedure of the DQN agent. The network model is built using a sequential mode by applying an LSTM layer with a deeply connected dense layer to change the dimensions of the vector from 64 to 32 with the activation-relu function. The network outputs will be state, Q, and action values to provide the drone agent with crucial information. A number of Q-value actions allow the agent to select the next action to perform in regard of the current state of the environment. Subsequently, the drone agents' next targeted action will be chosen dependently by following predicted feature information of the trackable objects via LSTM network that are stored in a long-short term memory unit. Correspondingly, the drone agent takes best targeted action Q-values while tracking objects over time.
The network model's observation unit allows the agent to observe the number of Q-values through action functions on the old state. If finished, the process will be reset in the network's short-term memory, where it gives summary outcomes about one-time exploration episodes and summaries of learning procedures and appends to the network's long-term memory.
Network training's output summary includes total rewards, average MaxQ, duration, average loss, and timesteps of the full training episodes. Number of action Q-values emphasize different types of movement in terms of tracking objects in different trajectories. Given signs in a network model represent the direction of the drone agent's interaction with object locations and movement. For example, the following signs on the flow chart present (starts from right sight) up, down, left, right, two times up and down, two times left and right, resizing the object on the inner and outer side, and stopping action Q-values, respectively. Targeted action Q-values shown in Figure 5 are taken after predicting object locations relatively.

Training the DQN Network Model
The training process allows the agent to train itself to better understand the environment dynamics and compute the expected rewards for the next state (t + 1). Additionally, it allows the agent to update the expected reward at step t according to the first training episode outcomes of the network model. The target expectation is computed through the targeted Q-values of the actions, which is a more stable version of the action value for increasing training stability. In actual cases, the target network is a frozen copy of the action value network updated as regular intervals. After the training process, the network clips all positive rewards at 1 and all negative rewards at −1, leaves 0 rewards unchanged, trains the network again for the next episode, updates the target network, and eventually saves the network output files into a fixed path. There is an extension of the train function that calls the batch generation and graph computations for sampling random minibatches of transitions from the replay memory unit. We set the hyperparameters and their default values for the deep Q-agent before training the network. At the end, we summarize all the learned values of total reward, average MaxQ, duration, and average loss, and save them into the adjusted path location.

Training Dataset
For the training of RNN-based object prediction and DQN network modules, we created our own dataset from the VCE simulation model. We captured images for training and testing purposes, totaling 727 and 195 images, respectively. To ensure compatibility with the training and testing processes, all images were manually labeled. Images include two types of objects: pedestrians and cars. Overall, 922 images were used for the training and testing process, which is sufficient to produce accurate outputs for analyzing proposals. Training parameters were fine-tuned with several different values while examining the algorithm under different conditional changes. Additionally, the tracking algorithm was tested in runtime on a simulation platform to observe the tracking performance.

Tracking Baseline of the DQN Network Model
In the tracking implementation section, we utilized a supervised approach to identify object classes, information, and properties, and we combined it with the DQN agent simulation network architecture. While testing the suggested tracker, the tracking approach employed a pretrained object classifier model to detect targets from a virtual simulation platform. Our proposed tracker was implemented by integrating a DQN network as a sequential decision-making procedure with a drone agent in the VCE simulation plat-form. Moreover, the network model observation part represents the virtual environment sequences and recurrent network-based architecture layers and provides an output with predicted bounding box locations in each frame, as shown in Figure 5. We first trained our network to predict target classes in proper action to provide environment states. The network was then updated with a deep q-learning approach to ensure that the drone agent continued to learn effectively from high-dimensional virtual simulation environment inputs via end-to-end reinforcement learning.
The network gives the output prediction from learning process, and we integrated tracking units by calculating the intersection over union (IoU) of bounding boxes. The origin from the Cartesian coordinate system at the center of the right frame and in the top position was considered as a positive axis. Then, the coordinates of the intersection rectangle were determined by identifying the maximum and minimum values. The intersection area of the two axis-aligned bounding boxes was always considered as an axis-aligned bounding box value. Then, we computed the area of both axis-aligned bounding boxes. Intersection over union was calculated by taking the computational intersection area and dividing it by the sum of the prediction plus ground truth areas minus the intersection area. The result is asserted to be a value between zero and one. The next step was to interpret the action sequence using a simulation environment and tracking calculation activity, which was applied to the interpreted action sequence.
The reward function was calculated as a scaled sum of the Euclidean distance between the center of the bounding box and the center of the frame, intersection over union of bounding boxes, and an imaginary box centered with parameters threshold height and weight. The completed portion will be determined by taking reward values at predetermined intervals. The final stage was to create a reinforcement agent to configure the specified parameters and test the algorithm using virtual environment simulation model inputs.

Datasets for Evaluation: VisDrone2019 and OTB-100
There are various opensource datasets available for measuring and evaluating the suggested method, as well as comparing with other state-of-the-art models. Applying the identical image or video sets technique to the algorithm reveals the advantages and weaknesses of the recommended method while evaluating it among other state-of-the-art models in different criteria of the learning field. VisDrone2019 [54] and OTB-100 [55] datasets are opensource dataset benchmarks for evaluating measures and competing in different challenges for object detection/tracking techniques.
The VisDrone2019 [54] dataset was collected by the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China. The dataset benchmark consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras and covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and rural), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes). Notably, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios, and under various weather and lighting conditions. These frames are manually annotated with over 2.6 million bounding boxes of targets of frequent interests, such as pedestrians, cars, bicycles, and tricycles. Some important attributes including scene visibility, object class, and occlusion are also provided for better data utilization.
The full OTB-100 [55] benchmark contains 100 sequences from recent literatures that address an extensive evaluation of the state-of-the-art online object tracking algorithms with various evaluation criteria to understand how these methods perform within the same framework. The authors initially constructed a large dataset with ground-truth object positions and extents for tracking, and introduced the sequence attributes for the performance analysis. Additionally, they integrated most of the publicly available trackers into one code library with uniform input and output formats to facilitate large-scale performance evalua-tion. Moreover, extensive evaluation was conducted by the performance of 31 algorithms on 100 sequences with different initialization settings. By analyzing the quantitative results, an effective approach for robust tracking was identified and provided along with potential future research directions in this field.

Evaluation and Discussion
The proposed reinforcement learning-based object tracking algorithm has been explored and evaluated using AirSim [22], a well-known simulation platform that is highly useful and advantageous for exploring freely in a realistic environment simulation. The implementation of the proposed algorithm was connected to the VCE simulation platform with AirSim Python Client that was accomplished by Microsoft developers.
Firstly, we trained the network to learn the object classes with environmental feature information, so that it was ready for training with a reinforcement learning approach. After completing the object classification learning process, the output was used in an integrated tracking framework to track objects with a drone agent. Figure 6 shows the evolution of the learning rate and loss function of the trained tracker. It can be seen that learning rate image on the left side illustrates small variation of the learning outcome; due to the environmental conditions, that drone flew in an extensive area and provided a small learning rate. Alternatively, on right side, loss functions results increase during training epochs relatively by learning a large-scale area of the virtual environment.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 useful and advantageous for exploring freely in a realistic environment simulation implementation of the proposed algorithm was connected to the VCE simulation plat with AirSim Python Client that was accomplished by Microsoft developers. Firstly, we trained the network to learn the object classes with environmental fea information, so that it was ready for training with a reinforcement learning approach ter completing the object classification learning process, the output was used in an grated tracking framework to track objects with a drone agent. Figure 6 shows the ev tion of the learning rate and loss function of the trained tracker. It can be seen that lear rate image on the left side illustrates small variation of the learning outcome; due t environmental conditions, that drone flew in an extensive area and provided a s learning rate. Alternatively, on right side, loss functions results increase during trai epochs relatively by learning a large-scale area of the virtual environment. This training was conducted in offline mode to understand the characteristics details of the object class. For the training method, numerous sequential image sets t from a realistic virtual world were employed as inputs. Only VCE realistic virtual m pictures were used in the trials to train the DQN tracker, which is based on a recu neural network. The figures shown below represent the average predicted action-v MeanQ and MaxQ computed over the held-out set of states. Figure 7 shows the training epochs outcome for computing average action value MeanQ/MaxQ that examines learning and predicting procedures of the recurrent ne network-based DQN tracker. Figure 7 (left) portrays the evolution of the training p dure by using the virtual realistic images as training data. Figure 8 represents the ev tion of both the epsilon score and the used replay memory during the tracker trai process. The epsilon greedy exploration graph represents a chosen random action w probability of epsilon that exploits the best-known action value and can probably r nearly one epsilon value. Figure 8 (right) displays replay memory, also known as th play buffer or experience buffer, and contains a collection of experience tuples (s, a, with training information. They were added gradually to the buffer as we interacted the virtual realistic environment. This act of sampling a small batch of tuples from replay buffer in order to learn the environment is also known as experience replay. lowed us to learn the environment more deeply with individual tuples multiple times make better use of learning experiences during training. This training was conducted in offline mode to understand the characteristics and details of the object class. For the training method, numerous sequential image sets taken from a realistic virtual world were employed as inputs. Only VCE realistic virtual model pictures were used in the trials to train the DQN tracker, which is based on a recurrent neural network. The figures shown below represent the average predicted action-value MeanQ and MaxQ computed over the held-out set of states. Figure 7 shows the training epochs outcome for computing average action values for MeanQ/MaxQ that examines learning and predicting procedures of the recurrent neural network-based DQN tracker. Figure 7 (left) portrays the evolution of the training procedure by using the virtual realistic images as training data. Figure 8 represents the evolution of both the epsilon score and the used replay memory during the tracker training process. The epsilon greedy exploration graph represents a chosen random action with a probability of epsilon that exploits the best-known action value and can probably reach nearly one epsilon value. Figure 8 (right) displays replay memory, also known as the replay buffer or experience buffer, and contains a collection of experience tuples (s, a, r, s ' ) with training information. They were added gradually to the buffer as we interacted with the virtual realistic environment. This act of sampling a small batch of tuples from the replay buffer in order to learn the environment is also known as experience replay. It allowed us to learn the environment more deeply with individual tuples multiple times and make better use of learning experiences during training.

Learning Rate Loss
Appl. Sci. 2022, 12, x FOR PEER REVIEW 13 MeanQ MaxQ Epsilon Memory-GB Figure 8. The average achieved epsilon score (Epsilon, left) and used replay memory (Memory right) value during the training process (red-target and blue-DQN value loss in epsilon gr orange-used replay memory unit in GB). Epsilon graph shows the balances exploration and ploitation by choosing randomly in y axes that refers to the probability of choosing to explore, w x axes depict the training epochs. The memory graph represents the used memory space du training in the y axes as well as training epochs in the x axes.
In summary, the basic idea of using replay buffer memory, or experience re memory, is to take advantage of a strong experience and use a random subset of the perience to update the Q-network. Rather than using the last single experience outc during the tracking process, this action was originally used for the learning tuples of servation state, action, reward, done flag, and next state parameters to keep the obta transitions from the virtual environment.
After an offline training process of the tracker, the network was updated and nected to the virtual realistic environment in order to be tested in real time. A drone a gave several output parameter results (shown in Table 1), which were configured be the testing process, illustrating the virtual drone agent's behavior during the testing cess as well as providing summarizing episode (Table 2) results.  Epsilon Memory-GB Figure 8. The average achieved epsilon score (Epsilon, left) and used replay memory (Memory right) value during the training process (red-target and blue-DQN value loss in epsilon g orange-used replay memory unit in GB). Epsilon graph shows the balances exploration an ploitation by choosing randomly in y axes that refers to the probability of choosing to explore, w x axes depict the training epochs. The memory graph represents the used memory space du training in the y axes as well as training epochs in the x axes.
In summary, the basic idea of using replay buffer memory, or experience re memory, is to take advantage of a strong experience and use a random subset of th perience to update the Q-network. Rather than using the last single experience outc during the tracking process, this action was originally used for the learning tuples o servation state, action, reward, done flag, and next state parameters to keep the obta transitions from the virtual environment.
After an offline training process of the tracker, the network was updated and nected to the virtual realistic environment in order to be tested in real time. A drone a gave several output parameter results (shown in Table 1), which were configured be the testing process, illustrating the virtual drone agent's behavior during the testing cess as well as providing summarizing episode ( Table 2) results.  In summary, the basic idea of using replay buffer memory, or experience replay memory, is to take advantage of a strong experience and use a random subset of the experience to update the Q-network. Rather than using the last single experience outcome during the tracking process, this action was originally used for the learning tuples of observation state, action, reward, done flag, and next state parameters to keep the obtained transitions from the virtual environment.
After an offline training process of the tracker, the network was updated and connected to the virtual realistic environment in order to be tested in real time. A drone agent gave several output parameter results (shown in Table 1), which were configured before the testing process, illustrating the virtual drone agent's behavior during the testing process as well as providing summarizing episode ( Table 2) results.    Table 1 depicts the parameters of the drone agent while testing it in the VCE model. Information given in Table 1 provides the drone agent's coordinates while learning the environment in random episodes as well as rewards of the DQN model drone. In every randomly chosen episode of the training, the location parameters show the drone agent's tracking path in a virtual environment. Reward XY, Z, and +T parameters emphasize the drone agent's adaptation to environmental conditions while analyzing it in sequential actions. It can be seen that while training the realistic VCE, the drone agent obtains negative values by flying in different directions because of environmental space and conditions; behavior of the VCE model affects the result of the rewards. Additionally, Table 2 shows a summarized action Q-value result during the testing process.
The result of the tracker in the realistic VCE model is presented in Table 2. The summary results of randomly selected episodes, such as time steps, duration, total reward, average max Q-value, and average loss results of the testing procedure, are displayed in this table. The drone traveled in a random area and identified the item sites to track during the testing phase, which means that the drone flew in a random location and identified the object locations to track.

Comparison and Qualitative Results
The suggested tracking model was evaluated using VisDrone2019 [54] and OTB-100 [55] to compare it to recent state-of-the-art object tracking models based on deep reinforcement learning techniques. These datasets are available on the internet in a variety of image and video sets, including training, testing, and challenge sets. Figure 9 compares the performance of the proposed tracking methodology with that of recent state-of-the-art trackers, such as ADNet [52] and ASRL Track [56], which are both based on the DRL (Deep Reinforcement Learning) strategy. Comparison was carried out by testing video sets of the two public datasets VisDrone2019 [54] and OTB-100 [55] respectively. The graphic above depicts precision performance with regard to local error threshold outcomes in two public datasets.  Figure 9 compares the performance of the proposed tracking methodology with that of recent state-of-the-art trackers, such as ADNet [52] and ASRL Track [56], which are both based on the DRL (Deep Reinforcement Learning) strategy. Comparison was carried out by testing video sets of the two public datasets VisDrone2019 [54] and OTB-100 [55] respectively. The graphic above depicts precision performance with regard to local error threshold outcomes in two public datasets. Table 3 (below) shows the precise results of the numerical comparison:  In Table 3, the results of a comparison of recent RL-based object tracking approaches and the proposed model are emphasized. The table compares the accuracy, frames per second (FPS), and intersection over union (IoU) outcomes of two current approaches with our model. Among these approaches, our strategy outperformed the others when tested on the identical open-source datasets. For the testing process, video inputs were used to examine the tracking capabilities of the approaches. Moreover, we provide some qualitative results below in Figure 10.  In Table 3, the results of a comparison of recent RL-based object tracking approaches and the proposed model are emphasized. The table compares the accuracy, frames per second (FPS), and intersection over union (IoU) outcomes of two current approaches with our model. Among these approaches, our strategy outperformed the others when tested on the identical open-source datasets. For the testing process, video inputs were used to examine the tracking capabilities of the approaches. Moreover, we provide some qualitative results below in Figure 10.
In Table 3, the results of a comparison of recent RL-based object tracking approaches and the proposed model are emphasized. The table compares the accuracy, frames per second (FPS), and intersection over union (IoU) outcomes of two current approaches with our model. Among these approaches, our strategy outperformed the others when tested on the identical open-source datasets. For the testing process, video inputs were used to examine the tracking capabilities of the approaches. Moreover, we provide some qualitative results below in Figure 10.  Figure 10 shows the qualitative results of the tracking model with two types of targeted objects. As we can see in Figure 10, the percentages of predicted and targeted object classes indicate how well the object classes were predicted. The result in Figure 10 demonstrates a good performance in terms of predicted and tracked object classes. We tested our proposed algorithm only in this virtual environment, and it performed much better compared to the other models, which, in the majority, have been tested with real video sequences. We have tested our algorithm only with the default simulation environment case, which is the normal weather condition.

Conclusions
In this research, we have proposed a novel tracking technique that integrates the virtual simulation platform VCE using the AirSim python client that performs in virtual realistic environment as a drone agent. In order to predict and track the objects and to learn the environment, we used a mutually integrated recurrent neural network-based DQN tracker that was trained with several virtual image-based video sequences. The AirSim simulation platform allowed us to test our model in a virtual environment, gather crucial feature information, and identify the object classes autonomously. AirSim API allowed us to test our DQN agent-based tracker easily by connecting it directly to the virtual drone simulation platform. Even though there are several challenges in terms of a three-dimensional virtual realistic model environment, our proposed model was successfully trained and achieved better performance with a recurrent prediction-based network integrated with an action decision technique. Our model can work autonomously in a virtual simulation model by applying deep RL agent solutions. Additionally, we tested our model with two different data sets, VisDrone2019 and OTB-100, and compared our model per-  Figure 10 shows the qualitative results of the tracking model with two types of targeted objects. As we can see in Figure 10, the percentages of predicted and targeted object classes indicate how well the object classes were predicted. The result in Figure 10 demonstrates a good performance in terms of predicted and tracked object classes. We tested our proposed algorithm only in this virtual environment, and it performed much better compared to the other models, which, in the majority, have been tested with real video sequences. We have tested our algorithm only with the default simulation environment case, which is the normal weather condition.

Conclusions
In this research, we have proposed a novel tracking technique that integrates the virtual simulation platform VCE using the AirSim python client that performs in virtual realistic environment as a drone agent. In order to predict and track the objects and to learn the environment, we used a mutually integrated recurrent neural network-based DQN tracker that was trained with several virtual image-based video sequences. The AirSim simulation platform allowed us to test our model in a virtual environment, gather crucial feature information, and identify the object classes autonomously. AirSim API allowed us to test our DQN agent-based tracker easily by connecting it directly to the virtual drone simulation platform. Even though there are several challenges in terms of a three-dimensional virtual realistic model environment, our proposed model was successfully trained and achieved better performance with a recurrent prediction-based network integrated with an action decision technique. Our model can work autonomously in a virtual simulation model by applying deep RL agent solutions. Additionally, we tested our model with two different data sets, VisDrone2019 and OTB-100, and compared our model performance with recent state-of-the-art techniques. Testing evaluation showed that our proposed technique displayed better performance among recent methods, with 92.07% and 82.35% precision in VisDrone2019 and OTB-100 data sets, respectively.
In our future works, we plan to improve the performance by using fine-tuning methodology and performing more simulation experiments with different weather conditions. Moreover, we plan to test our model with other open-source video sequences in order to compare it with other conventional RL-based DQN trackers.

Data Availability Statement:
The original VisDrone2019 and OTB-100 datasets are available online at https://paperswithcode.com/dataset/otb (accessed on 1 January 2022) and http://aiskyeye. com/download/ (accessed on 1 January 2022). These datasets used for comparing with algorithm performance with recent state-of-the-art models.