Angle of Arrival Passive Location Algorithm Based on Proximal Policy Optimization

Location technology is playing an increasingly important role in urban life. Various active and passive wireless positioning technologies for mobile terminals have attracted research attention. However, positioning signals experience serious interference in high-density residential areas or in the interior of large buildings. The main type of interference is that caused by non-line-of-sight (NLOS) propagation. In this paper, we present a new method for optimizing the angle of arrival (AOA) measurement to obtain high accuracy location results based on proximal policy optimization (PPO). PPO is a new family of policy gradient methods for reinforcement learning, which can be used to adjust the sampling data under different environments using stochastic gradient ascent. Therefore, PPO can correct the NLOS propagation errors to produce a clear AOA measurement data set without building an offline fingerprinting database. Then, we used the least square method to calculate the location. The simulation result shows that the AOA passive location algorithm based on PPO produced more accurate location information.


Introduction
Location technology plays an important role in a variety of services, from mobile phone navigation in personal life, to search and rescue in natural disasters, and precision strikes in military applications. With the rapid development of location services, user demand is growing. Therefore, various active and passive wireless positioning technologies have attracted increasing attention and research [1][2][3].
In wireless positioning systems, positioning accuracy is the most fundamental index used to evaluate positioning performance. However, many factors affect the positioning accuracy, such as multipath propagation, multi-access interference, and non-line-of-sight (NLOS) propagation [4]. All these factors lead to errors in signal measurement information, resulting in decreased positioning accuracy. The main factor influencing positioning error is NLOS propagation [5]. NLOS propagation is caused by various obstacles between the mobile station (MS) and the base station (BS), interfering with the straight path of the signal wireless transmission, and the signal can only be transmitted through refraction or reflection, resulting in NLOS error. Compared with line-of-sight (LOS) propagation, signal propagation in a NLOS environment requires extra time and distance, resulting in signal arrival angle deviation and additional power loss. The NLOS error is generally only related to the environment in which the signal is propagated.
In wireless communication, many passive positioning technologies are used in research and applications, such as angle of arrival (AOA) [6], time of arrival (TOA) [7], and time difference of arrival (TDOA) [8]. Among them, the AOA does not require strict time synchronization, which is a necessary condition for TOA and TDOA. Therefore, the passive AOA location algorithm has wide and flexible application prospects. According to the current research situation, the AOA has a high a new method to optimize the AOA based on deep reinforcement learning in a NLOS environment. Reinforcement learning can be broadly divided as value-based and policy-based. Different from the value-based algorithm, the policy-based algorithm is suitable for learning data with continuous time. Among the policy-based algorithms, trust domain strategy optimization (TRPO) and proximal strategy optimization (PPO) are typical strategy optimization methods. And PPO, which is evolved from TRPO, is a simpler and more general approach than TRPO [26,27]. So, we extend the state-of-the-art reinforcement learning algorithm, proximal policy optimization [28][29][30] to our error corrected training framework. The convergence problem of the policy gradient method is solved using the natural policy gradient method. In practice, however, the natural policy gradient involves a second-derivative matrix, which makes it non-scalable to large-scale problems. The computational complexity is too high for real tasks. PPO formalizes the constraint as a penalty in the target function instead of imposing a hard constraint. By avoiding constraints at all costs, a first-order optimizer, like gradient descent, can be used to optimize the target. Even though we sometimes violate the constraint, the effect is lessened and the calculation is much easier. The policy uses all the data gathered from parallel measurements for training. The parallel training framework not only considerably reduces the time cost of sample collection, but also enables the algorithm to be applied to various user equipment training in various scenarios.
The remainder of this paper is organized as follows: Section 2 describes the basic theory of multi-input multi-output (MIMO) system scenario, the AOA measurement in the MIMO, traditional AOA localization algorithm based on LS methods, the theory of proximal policy optimization, and the new AOA passive location algorithm for MIMO system based on PPO. Section 2 also gives the main flow chart of the proposed methods. Section 3 presents the simulated results. The simulation experiment includes the PPO training, the performance with different number of service base stations, cell radius, channel environment, and MS velocity. The experiment data is obtained from the AOA measurement error model and the actual NLOS communication scenario. The experimental results show that the algorithm proposed in this paper can produce high accuracy location results. Section 4 provides a summary, our conclusions, and recommendations for future research.

Scenarios
Assume that the base station has M transmit antennas, the mobile terminal has N receiver antennas, and the propagation path of the wireless channel has L propagation paths, then the receiver signal of the j receive antenna at time t can be defined as [31,32]: where h l i,j (t) is the channel model of i transmit antenna, j receive antenna at time t, t 0 is the delay time of multi-path. Each multi-path subchannel model of the of MIMO system can be defined as: Equation (2) shows that in the non-ideal channel environment, the signal delay is always caused by refraction, scattering, reflection, and the influence of NLOS propagation. Therefore, the measurement of AOA between the mobile station(MS) and base station(BS i ) can be expressed as: where α 0,i,j is the AOA value in the LOS environment, α n,i,j is the environment error, α e,i,j is the additional angle error caused by the NLOS, i is the number of transmitting antenna, and j is the number of receiving antenna.

AoA Passive Location Based on PPO
To solve this problem, we propose a new AOA passive location algorithm for a MIMO system based on PPO. PPO is used to modify the AOA measurement signal in the NLOS environment to reduce the NLOS error in the data. Then, the least squares (LS) method is used for location, which can effectively improve the positioning accuracy of the system. In this section, we introduce the new AOA passive location algorithm.

Traditional AOA Localization Algorithm
Assuming that the number of base stations is K, the position of MS is (x,y) and the position of BS p is (x p ,y p ). The AOA values of each measured base station between ith transmission antenna and jth receiving antenna is θ p,i,j , which can be defined as: Then, we obtain the following formula: When errors exist in the measured AOA data, the measurement error Ψ can be defined as: where h, G a , x are defined as follows: Therefore, the MS position can be estimated by the LS algorithm as follows: The error of the LS algorithm of AOA is relatively small, which produces excellent performance in LOS environment. However, the error caused by the LS algorithm is relatively large in the NLOS environment.

Proximal Policy Optimization
The PPO algorithm proposed by OpenAI, which is a new policy gradient algorithm [27], algorithm has much lower implementation complexity than the trust region policy optimization (TRPO) algorithm [26]. The PPO algorithm mainly includes two implementation methods. The first PPO algorithm is realized by CPU simulation, and the second PPO algorithm is realized by GPU simulation. Its simulation speed is more than three times that of the first PPO algorithm. Compared with the traditional neural network algorithm, the PPO method achieves the best optimal balance in algorithm complexity, precision, and ease of implementation.
The basic PPO algorithm theory can be defined as: where the output of the policy action in PPO is a set of discrete probability distribution, p θ (a t s t ) is the probability value of the policy action in current state, and p θ k (a t s t ) is the probability value of the policy action in new state. Therefore, the ratio of the probability under the new and old policies can be defined as where θ is the policy parameter. Clip(*) is an amplitude limiting function; its maximum value is 1 + ε,the minimum value 1 − ε. ε is a hyper parameter, the range is always from 0.1 to 0.2, which reflects the difference between the new and old policies. The function of trust zone update can be implemented using this objective function, which is compatible with random gradient descent. The equation of Clip(*) is defined as follows: , other (13) where A θ k (s t , a t ) is the estimated advantage at time t. When A θ k (s t , a t ) > 0, the state action of the PPO algorithm has a positive advantage, so the contribution of Equation (11) to the target can be simplified as: At this time, due to the advantage being positive, the target will increase. When p θ (a t s t ) > (1 + ε)p θ k (a t s t ) , starting from the minimum value, the value of Equation (14) can reach the upper limit (1 + ε)A θ k (s t , a t ), which ensures that the new strategy is not too different from the old strategy. Then, for this state, the algorithm encourages new strategies to increase the corresponding probability of action as shown in Figure 1.
At this time, due to the advantage being positive, the target will increase. When | 1 | , starting from the minimum value, the value of Equation (14) can reach the upper limit 1 , , which ensures that the new strategy is not too different from the old strategy. Then, for this state, the algorithm encourages new strategies to increase the corresponding probability of action as shown in Figure 1. When , 0, the state action of the OPP algorithm has a negative advantage, so the contribution of Equation (11) to the target can be simplified as: At this time, because the advantage is negative, therefore, the target will decrease. When | 1 | , starting from the maximum value, the value of Equation (15) can reach the lower limit 1 , , which will ensure that the new strategy is not far from the old strategy (which is shown in Figure 2). When A θ k (s t , a t ) ≤ 0, the state action of the OPP algorithm has a negative advantage, so the contribution of Equation (11) to the target can be simplified as: At this time, because the advantage is negative, therefore, the target will decrease. When p θ (a t s t ) < (1 − ε)p θ k (a t s t ) , starting from the maximum value, the value of Equation (15) can reach the lower limit (1 − ε)A θ k (s t , a t ), which will ensure that the new strategy is not far from the old strategy (which is shown in Figure 2). Figure 2. Diagram of PPO algorithm when A > 0.

Amendment of AOA Measurement Based on PPO
Here, we use the PPO method to correct the NLOS error and the measurement error, which make the AOA value to close the real value. The operation process is as follows: Step 1: Sample the N1 set of AOA measurement value , 1, . . . , under the NLOS environment. Sample the N2 set of the AOA measurement value , 1, . . . , without any interference under the LOS environment. Take the as the PPO training data and take the as the PPO training target data. The relationship between and can be defined as: , where is system measurement error and , is the additional angle error caused by NLOS.
Step 2: PPO Training. Assume that and are sampled from ten related base stations. Therefore, the input of PPO can be represented as: The output of PPO can be expressed as: The variable P represents the training input data of PPO. The variable T represents the training target data of PPO.
The PPO training process is as follows. Figure 3 shows that the input unit produces a feedback action when the environmental state is perceived through the P data, then the environment can send a reinforcement signal r(t) to the PPO unit. Simultaneously, the PPO unit updates the knowledge database of the policy module. The actor module selects a certain action a(t) in the range of  according to the current state P(t) and reinforcement information r(t), and then acts on the current environment to change the environment. In this paper, the certain action a(t) is the value, which can make the AOA

Amendment of AOA Measurement Based on PPO
Here, we use the PPO method to correct the NLOS error and the measurement error, which make the AOA value to close the real value. The operation process is as follows: Step 1: Sample the N 1 set of AOA measurement value {NAOA i , i = 1, ..., N 1 } under the NLOS environment. Sample the N 2 set of the AOA measurement value {AOA i , i = 1, ..., N 1 } without any interference under the LOS environment. Take the NAOA i as the PPO training data and take the AOA i as the PPO training target data. The relationship between NAOA i and AOA i can be defined as: where n i is system measurement error and α e,i is the additional angle error caused by NLOS.
Step 2: PPO Training. Assume that NAOA i and AOA i are sampled from ten related base stations. Therefore, the input of PPO can be represented as: The output of PPO can be expressed as: The variable P represents the training input data of PPO. The variable T represents the training target data of PPO.
The PPO training process is as follows. Figure 3 shows that the input unit produces a feedback action when the environmental state is perceived through the P data, then the environment can send a reinforcement signal r(t) to the PPO unit. Simultaneously, the PPO unit updates the knowledge database of the policy module. The actor module selects a certain action a(t) in the range of ε according to the current state P(t) and reinforcement information r(t), and then acts on the current environment to change the environment. In this paper, the certain action a(t) is the value, which can make the AOA value is as close to the real value as possible. The state P(t) is the value, which expresses the current AOA value, and the reinforcement information r(t) can control the PPO model whether the AOA value should be corrected or not. After multiple iterations of the actor and critic, we can obtain an optimal policy.
Step 2: PPO Training. Assume that and are sampled from ten related base stations. Therefore, the input of PPO can be represented as: The output of PPO can be expressed as: The variable P represents the training input data of PPO. The variable T represents the training target data of PPO.
The PPO training process is as follows. Figure 3 shows that the input unit produces a feedback action when the environmental state is perceived through the P data, then the environment can send a reinforcement signal r(t) to the PPO unit. Simultaneously, the PPO unit updates the knowledge database of the policy module. The actor module selects a certain action a(t) in the range of  according to the current state P(t) and reinforcement information r(t), and then acts on the current environment to change the environment. In this paper, the certain action a(t) is the value, which can make the AOA value is as close to the real value as possible. The state P(t) is the value, which expresses the current AOA value, and the reinforcement information r(t) can control the PPO model whether the AOA value should be corrected or not. After multiple iterations of the actor and critic, we can obtain an optimal policy.  According to PPO theory, when the advantage of the state-action pair is positive, the PPO increases the amendment value of the AOA measurement. When the advantage of the state-action pair is negative, the PPO decreases the amendment value of AOA measurement: where ξ i is the amendment value and γ i is the increase or decrease part of ξ i . After several iterations, the action a t ensures the output P of the environment module is close to the value of T.
A trained PPO model is used to correct the measurement data of AOA. The revised AOA is used to estimate the position using Equation (10).
In the process, the PPO policy update can be defined as: Finally, the PPO training process is finished. According to the above introduction, the main flow of the AOA passive location algorithm for a MIMO system based on PPO is shown in Figure 4.
In summary, the training PPO model can modify the AOA measurement data, and then reduce the NLOS error of the AOA measurement value.

Experiment
In this study, the proposed AOA passive location algorithm for an MIMO system based on PPO was simulated to estimate the location accuracy.
In this experiment, 1000 MS were located with uniform distribution as the target data. The training analog measurement data were obtained according to the AOA measurement error model, the testing analog measurement data includes two parts, the first part of testing data were obtained according to the AOA measurement error model, the second part of testing data were obtained from real data in the actual NLOS communication scenario.
We used the COST259 model to create the scenario and determined the additional delay error caused by NLOS, which follows Gaussian distribution [33]. The AOA error follows Gauss distribution with a mean value of 0 and standard deviation of 0.04 rad/s. The delay error caused by NLOS obeys the general COST259 urban model.

Simulation Experiment Parameters
The parameters of the simulation scenarios are listed in Table 1 and the PPO methods parameters are shown in Table 2. The system performance of the algorithm proposed in this paper was simulated and analyzed. The PPO training iteration number affects the training performance of PPO, when if iteration number is small, the training performance of PPO will be bad, otherwise the training performance of PPO will be better. Considering the training performance and training time of PPO, the number of iterations is 5000 in this paper. The Actor and Critic learning rate affects the learning efficiency of PPO, lower learning rate will cause pool learning efficiency, while the large learning rate will cause training can't converge. Considering the efficiency and convergence of PPO training, we set Actor and Critic learning rate is 0.0001 and 0.0002 respectively.
We placed 7 base stations on the campus of Beijing university of posts and telecommunications for real NLOS data collection, as shown in Figure 4:

2
Actor learning rate 0.0001 3 Critic learning rate 0.0002 4 Actor iteration number 10 5 Critic iteration number 10 The PPO training iteration number affects the training performance of PPO, when if iteration number is small, the training performance of PPO will be bad, otherwise the training performance of PPO will be better. Considering the training performance and training time of PPO, the number of iterations is 5000 in this paper. The Actor and Critic learning rate affects the learning efficiency of PPO, lower learning rate will cause pool learning efficiency, while the large learning rate will cause training can't converge. Considering   The proposed method has been validated utilizing a dual channel receiver using Universal Software Radio Peripheral (USRP) X310 as shown in Figure 5. The USRP X310 contains 2Tx and 2Rx ports with fully coherent 2 × 2 MIMO capabilities. The coherency between the Rx ports is an essential feature for extracting accurate phase difference between the received signals. Moreover, the USRP X310 covers RF frequencies from DC to 6GHz with a maximum ADC sample rate 200MS/s. Thus the USRP X310 was an ideal candidate for this project. The USRP X310 was configured with the GNU radio companion (GRC) interface. GRC is a graphical user interface to build GNU radio flow graphs or the software circuits. The GNU radio flow graph enables the SDR card to receive signals from the antennas, spectrally isolate them, and then calculate the phase difference between the received signals. Aziz [34] have completed the AOA data acquisition model based on USRP device. According to his model, we completed the collection of the original data and used it for the subsequent PPO algorithm verification. USRP X310 was an ideal candidate for this project. The USRP X310 was configured with the GNU radio companion (GRC) interface. GRC is a graphical user interface to build GNU radio flow graphs or the software circuits. The GNU radio flow graph enables the SDR card to receive signals from the antennas, spectrally isolate them, and then calculate the phase difference between the received signals. Aziz [34] have c

Simulation and Experimental Analysis
According to the basic flow chart in Figure 6 and the simulation parameters in Tables 1 and 2, we analyzed the influence of position performance, including the iteration times of PPO training, the cell radius, the number of service base stations, the number of MIMO antennas, the channel environment, the MS velocity, and the number of training measurement data.

Simulation and Experimental Analysis
According to the basic flow chart in Figure 6 and the simulation parameters in Tables 1 and 2, we analyzed the influence of position performance, including the iteration times of PPO training, the cell radius, the number of service base stations, the number of MIMO antennas, the channel environment, the MS velocity, and the number of training measurement data. USRP X310 was an ideal candidate for this project. The USRP X310 was configured with the GNU radio companion (GRC) interface. GRC is a graphical user interface to build GNU radio flow graphs or the software circuits. The GNU radio flow graph enables the SDR card to receive signals from the antennas, spectrally isolate them, and then calculate the phase difference between the received signals. Aziz [34] have completed the AOA data acquisition model based on USRP device. According to his model, we completed the collection of the original data and used it for the subsequent PPO algorithm verification.

Simulation and Experimental Analysis
According to the basic flow chart in Figure 6 and the simulation parameters in Tables 1 and 2, we analyzed the influence of position performance, including the iteration times of PPO training, the cell radius, the number of service base stations, the number of MIMO antennas, the channel environment, the MS velocity, and the number of training measurement data. The measurement error ϕ i of AOA measurement data obeys Gaussian distribution. The joint probability distribution of measurement errors can be defined as: where δ i is the standard deviation of the measurement error. In the simulation process, we also compared our method with other algorithms, such as the CWLS algorithm [35] and traditional AOA methods.

PPO Training
During the PPO training, a high performance computer is necessary. In this study, the whole training process was conducted on a 6-core and 4 GHz computer with an Intel i7-8700K processor (Intel, Santa Clara, State of California, United States) CPU, 42 G memory, and a GeForce GTX 1080(NVIDIA, Santa Clara, State of California, United States) graphics card. We defined the evaluation index of the final location effect as location error, which is the location error through AOA measurement value corrected by PPO. If the location error is 0 after training, the training is complete and the goal is reached. When the error after training is large, the training target has not been reached. In the training process, since it is impossible to guarantee that each training error will reach 0%, in the actual training process, we set the training target to within 5% of the positioning error.
To present the performance of PPO training, we use the reinforcement signal r(t) (Figure 3) as the reward value. The simulation result is as follows. In Figure 7, the abscissa is the number of iterations and the ordinate is the return value of the algorithm. We calculated the average of the reward of each simulation training period, then analyzed the average reward value obtained during the whole training process, and finally normalized the simulation result values to [−1,0]. The simulation result shows that the actor can search for the preset target position target within 300 steps in the first training period, ensuring that the PPO algorithm always maintains a high performance in the training process. A stable reward value leads the actor to the target position to complete the PPO training goal.
where is the standard deviation of the measurement error.
In the simulation process, we also compared our method with other algorithms, such as the CWLS algorithm [35] and traditional AOA methods.

PPO Training
During the PPO training, a high performance computer is necessary. In this study, the whole training process was conducted on a 6-core and 4 GHz computer with an Intel i7-8700K processor (Intel, Santa Clara, State of California, United States) CPU, 42 G memory, and a GeForce GTX 1080(NVIDIA, Santa Clara, State of California, United States) graphics card. We defined the evaluation index of the final location effect as location error, which is the location error through AOA measurement value corrected by PPO. If the location error is 0 after training, the training is complete and the goal is reached. When the error after training is large, the training target has not been reached. In the training process, since it is impossible to guarantee that each training error will reach 0%, in the actual training process, we set the training target to within 5% of the positioning error.
To present the performance of PPO training, we use the reinforcement signal r(t) (Figure 3) as the reward value. The simulation result is as follows. In Figure 7, the abscissa is the number of iterations and the ordinate is the return value of the algorithm. We calculated the average of the reward of each simulation training period, then analyzed the average reward value obtained during the whole training process, and finally normalized the simulation result values to [−1,0]. The simulation result shows that the actor can search for the preset target position target within 300 steps in the first training period, ensuring that the PPO algorithm always maintains a high performance in the training process. A stable reward value leads the actor to the target position to complete the PPO training goal.    Figure 8 compares AOA data before PPO and collected AOA data after PPO. The result shows that the NLOS and measurement error are corrected by PPO, so that the corrected AOA value is as close to the real value as possible.  Figure 9 shows that with increasing training data length, the PPO performs better. The testing data obtained from the actual NLOS scenario is more complex than the testing data from the COST259 model, therefore the performance of actual NLOS scenario is worse than the performance of COST259 model. When the training data length is longer than 4000, the PPO training performance tends to be stable because the PPO can make an optimal adjustment decision with a large set of training AOA measurement data.   Figure 9 shows that with increasing training data length, the PPO performs better. The testing data obtained from the actual NLOS scenario is more complex than the testing data from the COST259 model, therefore the performance of actual NLOS scenario is worse than the performance of COST259 model. When the training data length is longer than 4000, the PPO training performance tends to be stable because the PPO can make an optimal adjustment decision with a large set of training AOA measurement data. Figure 8 compares AOA data before PPO and collected AOA data after PPO. The result shows that the NLOS and measurement error are corrected by PPO, so that the corrected AOA value is as close to the real value as possible.  Figure 9 shows that with increasing training data length, the PPO performs better. The testing data obtained from the actual NLOS scenario is more complex than the testing data from the COST259 model, therefore the performance of actual NLOS scenario is worse than the performance of COST259 model. When the training data length is longer than 4000, the PPO training performance tends to be stable because the PPO can make an optimal adjustment decision with a large set of training AOA measurement data.  The AOA method is considerably influenced by the number of base stations because of the NLOS error. The CWLS method has a lower RMSE; however, both of them perform poorly when the number of service base stations is less than four. The method proposed in this paper produces the most accurate performance because the PPO model can constantly correct the AOA measurements and reduce the influence of NLOS and measurement error.  Figure 10 shows the performance with different number of service base stations, and the performance of actual NLOS scenario is worse than the performance of COST259 model. With increasing number of service base stations, the root-mean-square error (RMSE) location gradually decreases. The AOA method is considerably influenced by the number of base stations because of the NLOS error. The CWLS method has a lower RMSE; however, both of them perform poorly when the number of service base stations is less than four. The method proposed in this paper produces the most accurate performance because the PPO model can constantly correct the AOA measurements and reduce the influence of NLOS and measurement error.  Figure 11 depicts the performance of different methods when the cell radii range from 500 m to 3000 m, and the performance of actual NLOS scenario is worse than the performance of COST259 model.  Figure 11 shows that with the increasing of cell radius, the RMSE location is increasing gradually, this is because large cell always has a large NLOS. When the cell radius is only 1000 m, these five methods have a similar performance, especially the methods of CWLS and the proposed one. When the cell radius is larger than 2000 m, the differences begin to grow, the methods of AOA has a bad performance, because these three algorithm are location through the AOA data directly.  Figure 11 depicts the performance of different methods when the cell radii range from 500 m to 3000 m, and the performance of actual NLOS scenario is worse than the performance of COST259 model.  Figure 10 shows the performance with different number of service base stations, and the performance of actual NLOS scenario is worse than the performance of COST259 model. With increasing number of service base stations, the root-mean-square error (RMSE) location gradually decreases. The AOA method is considerably influenced by the number of base stations because of the NLOS error. The CWLS method has a lower RMSE; however, both of them perform poorly when the number of service base stations is less than four. The method proposed in this paper produces the most accurate performance because the PPO model can constantly correct the AOA measurements and reduce the influence of NLOS and measurement error.  Figure 11 depicts the performance of different methods when the cell radii range from 500 m to 3000 m, and the performance of actual NLOS scenario is worse than the performance of COST259 model.  Figure 11 shows that with the increasing of cell radius, the RMSE location is increasing gradually, this is because large cell always has a large NLOS. When the cell radius is only 1000 m, these five methods have a similar performance, especially the methods of CWLS and the proposed one. When the cell radius is larger than 2000 m, the differences begin to grow, the methods of AOA has a bad performance, because these three algorithm are location through the AOA data directly.  Figure 11 shows that with the increasing of cell radius, the RMSE location is increasing gradually, this is because large cell always has a large NLOS. When the cell radius is only 1000 m, these five methods have a similar performance, especially the methods of CWLS and the proposed one. When the cell radius is larger than 2000 m, the differences begin to grow, the methods of AOA has a bad performance, because these three algorithm are location through the AOA data directly. Compared with these three methods, the CWLS has a lower RMSE. The proposed method in this paper can obtain the best performance, this is because the PPO model can reduce the influence of NLOS and measurement error. Figure 12 depicts the performance of different channel environments, such as the additive white Gaussian noise (AWGN) channel and Rayleigh multi-path channel. Compared with these three methods, the CWLS has a lower RMSE. The proposed method in this paper can obtain the best performance, this is because the PPO model can reduce the influence of NLOS and measurement error. Figure 12 depicts the performance of different channel environments, such as the additive white Gaussian noise (AWGN) channel and Rayleigh multi-path channel. Figure 12. The accuracy of location with different channel. On the x axis, 1 is the proposed, 2 is the constrained weighted least squares (CWLS) method in this paper, 3 is the AOA method in this paper.

Channel Environment
In the AWGN channel, both of the proposed algorithm and the CWLS method produce good location results. In a Rayleigh multi-path channel, because of the additional multi-path influence, the location RMSE increases. However, the proposed method also produces the best performance because the PPO can eliminate the multipath influence to some extent. Figure 13 depicts the performance of different methods with different MS speeds, and the performance of actual NLOS scenario is worse than the performance of COST259 model. On the x axis, 1 is the proposed, 2 is the constrained weighted least squares (CWLS) method in this paper, 3 is the AOA method in this paper.

MS Velocity
In the AWGN channel, both of the proposed algorithm and the CWLS method produce good location results. In a Rayleigh multi-path channel, because of the additional multi-path influence, the location RMSE increases. However, the proposed method also produces the best performance because the PPO can eliminate the multipath influence to some extent. Figure 13 depicts the performance of different methods with different MS speeds, and the performance of actual NLOS scenario is worse than the performance of COST259 model. Compared with these three methods, the CWLS has a lower RMSE. The proposed method in this paper can obtain the best performance, this is because the PPO model can reduce the influence of NLOS and measurement error. Figure 12 depicts the performance of different channel environments, such as the additive white Gaussian noise (AWGN) channel and Rayleigh multi-path channel. Figure 12. The accuracy of location with different channel. On the x axis, 1 is the proposed, 2 is the constrained weighted least squares (CWLS) method in this paper, 3 is the AOA method in this paper.

Channel Environment
In the AWGN channel, both of the proposed algorithm and the CWLS method produce good location results. In a Rayleigh multi-path channel, because of the additional multi-path influence, the location RMSE increases. However, the proposed method also produces the best performance because the PPO can eliminate the multipath influence to some extent. Figure 13 depicts the performance of different methods with different MS speeds, and the performance of actual NLOS scenario is worse than the performance of COST259 model.  Figure 13 shows that with increasing of MS speed, the location RMSE gradually increases. The AOA methods are considerably influenced by MS speed; the trend is a geometric increase. The CWLS methods can reduce the location error of different MS speeds to a certain extent. The method proposed in this paper produces the best performance because the PPO model can reduce the influence of multiple paths.

Conclusions
To address the problem of AOA location, in this paper, we presented a new AOA passive location algorithm for an MIMO system based on PPO. This method uses PPO methods to correct the NLOS and measurement errors, and then determines the passive AOA location based on the least squares method. The experimental results showed that the proposed algorithm can determine the AOA location high accuracy. With the increasing of training data lengths and the number of base station, the RMSE of the proposed methods can be lower than 3 m. On the other hand, the RMSE of the proposed methods is almost less than 10 m at different MS velocity. PPO uses the ability of approximate arbitrary non-linear mapping and the fast learning property of deep reinforcement learning to effectively overcome the shortcomings of the traditional methods, such as AOA. In the future, we will apply the localization algorithm based on PPO to more complex sensor networks and channel environments.