# Stereoscopic Projection Policy Optimization Method Based on Deep Reinforcement Learning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Selection of DRL Framework

#### 2.1. Selection of Model-Based RL and Model-Free RL

#### 2.2. Selection of On-Policy and Off-Policy

#### 2.3. Selection of Episodic Update and Temporal-Difference Update

#### 2.4. Selection of Value-Based RL and Policy-Based RL

#### 2.5. DRL

## 3. Construction of Stereoscopic Projection Policy Model Based on DRL

#### 3.1. Operational Concept

#### 3.2. State Space

#### 3.3. Action Space

#### 3.4. Reward Function

**Definition**

**1.**

- 1.
- ${R}_{1}$ is the loss of the red side, that is, the force lost by the red side in a certain simulation time step. This value is negative and is obtained by the accumulation of various specific equipment scores.
- 2.
- ${R}_{2}$ is a one-time reward for the red to land. It is a positive number and is given within the time step of the successful landing. If the projection platform is destroyed in the process of projection, the reward cannot be obtained.
- 3.
- ${R}_{3}$ is the continuous reward for the survival of the red. It is a positive number. Whether the landing troops are alive is checked at each time step after the successful projection. If they are alive, the reward is obtained in this time step.
- 4.
- ${R}_{4}$ is the reward for the red to complete the occupation task. After the end of the last simulation time step, the landing position is checked. If there are red units but no blue units in the position, it will be regarded as a successful occupation, and a positive reward will be given at this time.
- 5.
- ${R}_{5}$ is the reward for the number of delivery platforms of the red, which is a negative number. This value has a negative correlation with the delivery power that has taken action in the current time step; that is, the more the delivery power, the smaller the value, which means more penalty points. Although the more projection platforms, the more land troops will be, it is not consistent with our hope to achieve the combat goal with the minimum transport capacity. Therefore, setting negative feedback of the platform quantity policy in the reward function is to ensure the expected requirements.

#### 3.5. Value Function and Policy Function

## 4. Model Training and Experimental Results Analysis

#### 4.1. Model Training Process of Interactive Learning

#### 4.2. Analysis of Training Results

#### 4.3. Optimization Policy Analysis

#### 4.4. Algorithm Comparison and Analysis

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Yang, J.; Si, G.; Hu, X. Research on exploratory simulation analysis method of information warfare system of systems confrontation. J. Syst. Simul.
**2005**, 6, 1469–1472. [Google Scholar] - Qi, Y.B.; Zhong, L.; Zhang, L. Joint combat system of systems modeling and effectiveness analysis based on complex networks. Adv. Mater. Res.
**2012**, 591–593, 1589–1592. [Google Scholar] [CrossRef] - Li, B.; Liu, S.; Li, C.; Xie, Y. Exploratory simulation experiment simulation scenario space screening. Fire Command. Control.
**2013**, 38, 152–156. [Google Scholar] - Yu, F.; Zhao, Z.; Bao, J. Experimental point design method under the framework of exploratory simulation analysis. Command. Control. Simul.
**2014**, 36, 80–84. [Google Scholar] - Yao, T.; Wang, Y.; Dong, Y.; Qi, J.; Geng, X. Application of deep reinforcement learning in combat mission planning. Cruise Missile
**2020**, 4, 16–21. [Google Scholar] - Wu, Z.; Li, H.; Wang, Z.; Tao, W.; Wu, H.; Hou, X. Design of intelligent simulation platform based on deep reinforcement learning. Tactical Missile Technol.
**2020**, 4, 193–200. [Google Scholar] - Yu, B.; Lu, M.; Zhang, J. Combat decision algorithm of joint combat simulation based on hierarchical reinforcement learning. Fire Command. Control.
**2021**, 46, 140–146. [Google Scholar] - Shi, D.; Yan, X.; Gong, L.; Zhang, J.; Guan, D.; Wei, M. Reinforcement learning driven multi-agent cooperative combat simulation algorithm for naval battlefield. J. Syst. Simul.
**2022**, 6, 1–11. [Google Scholar] - Mnih, V.; Badia, P.; Mirza, M. Asynchronous methods for deep reinforcement learning. Int. Conf. Mach. Learn.
**2016**, 1928–1937. [Google Scholar] - Mnih, V.; Kavukcuoglu, K.; Silver, D. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] [PubMed] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn.
**1992**, 8, 279–292. [Google Scholar] [CrossRef] - Wiering, M.A.; Van Otterlo, M. Reinforcement learning. Adapt. Learn. Optim.
**2012**, 12, 729. [Google Scholar] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A. Continuous control with deep reinforcement learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Schulman, J.; Wolski, F.; Dhariwal, P. Proximal policy optimization algorithms. arXiv
**2017**, arXiv:1707.06347. [Google Scholar] - Chen, W.; Qiu, X.; Cai, T.; Dai, H.; Zheng, Z.; Zhang, Y. Deep reinforcement learning for internet of things: A comprehensive survey. IEEE Commun. Surv. Tutor.
**2021**, 23, 1659–1692. [Google Scholar] [CrossRef] - Cao, L. Key technologies of intelligent game confrontation based on deep reinforcement learning. Command. Inf. Syst. Technol.
**2019**, 10, 17. [Google Scholar] - Sun, C.; Mu, C. Some key scientific problems of multi-agent deep reinforcement learning. Autom. Sin.
**2020**, 45, 1–12. [Google Scholar] - Sun, Y.; Li, Q.; Xu, Z. Air combat game countermeasure policy training model based on multi-agent deep reinforcement learning. Command. Inf. Syst. Technol.
**2021**, 12, 16–20. [Google Scholar]

**Figure 1.**Pre-experiment results. (

**a**) simulation pre-experiment results (part); (

**b**) heatmap of character choice.

Serial Number | Combat Power | Status Parameters |
---|---|---|

1 | Combat units (except the projection units), 69 | Remaining quantity |

2 | Projection units (except transport helicopter), 3 | Remaining quantity, current speed, departure status |

3 | Transport helicopter units, 3 | Remaining quantity, current speed, departure status, current altitude |

Serial Number | Combat Power | Action Space |
---|---|---|

1 | Inactive projection force | [0, 1] (0 means to hold, 1 means to sail) |

2 | Projection force without action, but has decided to act within this time step | Scale, quantity, speed, altitude (helicopter) |

3 | Acting projection force | Speed, altitude (helicopter) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

An, J.; Si, G.-Y.; Zhang, L.; Liu, W.; Zhang, X.-C. Stereoscopic Projection Policy Optimization Method Based on Deep Reinforcement Learning. *Electronics* **2022**, *11*, 3951.
https://doi.org/10.3390/electronics11233951

**AMA Style**

An J, Si G-Y, Zhang L, Liu W, Zhang X-C. Stereoscopic Projection Policy Optimization Method Based on Deep Reinforcement Learning. *Electronics*. 2022; 11(23):3951.
https://doi.org/10.3390/electronics11233951

**Chicago/Turabian Style**

An, Jing, Guang-Ya Si, Lei Zhang, Wei Liu, and Xue-Chao Zhang. 2022. "Stereoscopic Projection Policy Optimization Method Based on Deep Reinforcement Learning" *Electronics* 11, no. 23: 3951.
https://doi.org/10.3390/electronics11233951