Revising the Observation Satellite Scheduling Problem Based on Deep Reinforcement Learning

: Earth observation satellite task scheduling research plays a key role in space-based remote sensing services. An effective task scheduling strategy can maximize the utilization of satellite resources and obtain larger objective observation proﬁts. In this paper, inspired by the success of deep reinforcement learning in optimization domains, the deep deterministic policy gradient algorithm is adopted to solve a time-continuous satellite task scheduling problem. Moreover, an improved graph-based minimum clique partition algorithm is proposed for preprocessing in the task clustering phase by considering the maximum task priority and the minimum observation slewing angle under constraint conditions. Experimental simulation results demonstrate that the deep reinforcement learning-based task scheduling method is feasible and performs much better than traditional metaheuristic optimization algorithms, especially in large-scale problems.


Introduction
Earth observation satellites (EOSs) are platforms equipped with optical instruments in order to take photographs of specific areas at the request of users [1]. Currently, EOSs have been extensively employed in scientific research, mainly in environment and disaster surveillance [2], ocean monitoring [3], agricultural harvesting [4], etc. However, with the increase in multi-user and multi-satellite space application scenarios [5,6], it is becoming more difficult to meet various observation requirements under the limitation of satellite resources. Therefore, an effective EOS scheduling algorithm plays an important role to improve high-quality space-based information services, and not only guides the corresponding EOSs on how to perform the following actions, but also controls the time to start the observations [5]. The main purpose is to maximize the observation profit within the limited observation time window and with other resources (for example, the available energy, the remaining data storage, etc.) [7,8].
The EOS scheduling problem (EOSSP) is well known as a complex non-deterministic polynomial (NP) hard problem and multiple objective combinational optimization problem [9,10]. Currently, inspired by increasing demands on scheduling EOSs effectively and efficiently, the study of the EOSSP has gained more and more attention. Wang et al. [7] summarized that the EOSSP could be divided into time-continuous and time-discrete problems. In time-continuous models [11][12][13], a continuous decision variable is introduced to represent the observation start time for each visible time window (VTW), which is defined to check whether tasks are scheduled or not. On the other hand, for time-discrete models [14][15][16], each time window generates multiple observation tasks for the same target. In this way, each candidate task has a determined observation time, and binary decision variables are introduced to represent whether a task is operated in a specific time slice.
General solving algorithms are usually classified into exact method, heuristic, metaheuristic [17,18] and machine learning [7]. Exact methods, such as branch and bound (BB) [19] and mixed integer linear programming (MILP) [15], have deterministic solving steps and could solve problems in polynomial time, but it is nearly impossible to build a deterministic model for a larger scale problem. Heuristic methods can be used to speed up the process by finding a satisfactory solution, but this method depends on a specific heuristic policy and the policy is not always feasible. Jang et al. [20] proposed a heuristic solution approach to solve the image collection planning problem of the KOMPSAT-2 satellite. Liu et al. [21] combined the neural network method and heuristic search algorithm and the result was superior to the existing heuristic search algorithm in terms of the overall profit. Alternatively, without relying on a specific heuristic policy, metaheuristic methods could provide a sufficiently high-quality and universal solution to an optimization problem. For example, Kim et al. [22] proposed an optimal algorithm based on a genetic algorithm for synthetic aperture radar (SAR) imaging satellite constellation scheduling. Niu et al. [23] presented a multi-objective genetic algorithm to solve the problem of satellite areal task scheduling during disaster emergency responses. Long et al. [24] proposed a two-phase GA-SA hybrid algorithm for the EOSSP which is superior to the GA or SA algorithm alone. Although metaheuristic algorithms could gain better operation results and have been widely adopted in the EOSSP, they easily fall into local optimum [25] due to the dependence on one certain mathematical model. Consequently, this makes us turn to deep reinforcement learning (DRL), which is known as a model-free solution and can autonomously build a general task scheduling strategy by training [26]. It has the promising potential to be applied in combinatorial optimization problems [27][28][29].
As a significant research domain in machine learning, DRL has achieved success in game playing and robot control. Moreover, recently, DRL has also gained more and more attention in optimization domains. Bello et al. [27] presented a DRL framework using neural networks and a policy gradient algorithm for solving problems modeled as the traveling salesman problem (TSP). Additionally, for the TSP, Khalil et al. [28] embedded a graph in DRL networks and it learnt effective policies. Furthermore, Nazari et al. [29] gave an end-to-end framework with a parameterized stochastic policy for solving the vehicle routing problem (VRP), which is an expanded problem based on the TSP. Peng et al. [30] presented a dynamic attention model with dynamic encoder-decoder architecture and obtained a good generalization performance in the VRP. Besides the TSP and VRP, another type of optimization problem, resource allocation [25], also has been solved by RL. Khadilkar et al. [31] adopted RL to schedule time resourced for a railway system, and it was found that the Q-learning algorithm is superior to heuristic approaches in effectiveness. Ye et al. [32] proposed a decentralized resource allocation mechanism for vehicle-to-vehicle communications based on DRL and improved the communication resource allocation.
Inspired by the above applications of DRL, solving the EOSSP by using DRL has become a feasible solution. Hadj-Salah et al. [33] adopted A2C to handle the EOSSP in order to reduce the time to completion of large-area requests. Wang et al. [34] proposed a real-time online scheduling method for image satellites by importing A3C into satellite scheduling. Zhao et al. [35] developed a two-phase neural combinatorial optimization RL method to address the EOSSP with the consideration of the transition time constraint and image quality criteria. Lam et al. [36] proposed a training system based on RL which is fast enough to generate decisions in near real time.
In the present paper, the EOS scheduling problems as a time-continuous model with multiple constraints are revised by adopting the deep deterministic policy gradient (DDPG) algorithm, and comparisons with the traditional metaheuristic methods are conducted with an increase in the task scale. The major highlights are summarized as follows: 1.
Aiming to enhance the task scheduling efficiency further, an improved graph-based minimum clique partition algorithm is introduced as a task clustering preprocess to decrease the task scale and improve the scheduling algorithm's effect.

2.
Different from previous studies, the EOSSP was considered as a time-discrete model when solving by RL algorithms. In this paper, a time-continuous model is established for the EOSSP, which could make accurate observation time decisions for each task by the DDPG algorithm.

3.
Considering practical engineering constraints, comparison experiments were implemented between the RL method and some metaheuristic methods, such as the GA, SA and GA-SA hybrid algorithm, to validate the feasibility of the DDPG algorithm.

Problem Description
As shown in Figure 1, an EOS can maneuver in the direction of three axes (roll, pitch and yaw) for transitions between every two sequential observation tasks. Usually, the mobility of the roll angle represents the slewing maneuverability of the EOS. The maneuvering of the pitch angle enables the targets to be observed in advance or over time. Observation targets are accessible within a period of a specific VTW, which is determined by the maximum off-nadir angle. The observation window (OW) defines the start and the end time for observing target in the VTW. Therefore, the task scheduling algorithm enables an EOS to conduct certain operations for the transformation between two sequential observation tasks, such as slew maneuvering and payload switching. Simultaneously, observation tasks are restricted in a specific time interval, and the observations must be carried out continuously and completely within the VTW [37].  It is noted that targets outside the observation range are invisible, and are be seen as invalid, as shown in Figure 1. An EOS could observe multiple targets simultaneously, and the observation task of the merged targets is defined as a clustered task in this study. Task clustering belongs to preprocessing for EOS task scheduling, which has gained more and more attention as it enables an EOS to finish more tasks at the cost of relatively few optical sensor opening times and satellite maneuver times. To clearly explain the EOSSP, herein, a summary of the most important notations in this paper is given in Table 1. In contrast to task scheduling without clustering, this strategy could save a lot of energy, especially with frequent observations. In addition, task clustering enables some previously conflicting tasks to be executed at the same time. The condition for merging multiple tasks into a clustering task is that these tasks can be finished with the same slewing angle and OW [38], which constrains the task clustering process.
(1) Time window-related constraint The longest observation duration ∆T allowed for a sequential observation is limited because of the characteristic of the sensor. Therefore, the VTW should satisfy the following constraint: Supposing that clustering task t c u is clustered from {t 1 , t 2 , · · · , t n }, where: The time window of clustered tasks should allow the satellite to finish all the component tasks in a common temporal interval.
(2) Slewing angle-related constraint Multiple clustered tasks should guarantee that they can be completed with the same slewing angle. Let θ u denote the slewing angle when observing t u and δθ u denote the feasible slewing angle range, then Equation (4) gives: For the clustered task t c u , the slewing angle could be calculated by the mean value of ∆θ c u : According to the constraints mentioned above, merged tasks need to be screened out, and the graph theory is used to build the clustering model. Firstly, we defined an undirected graph G = V, E , where V is the set of vertexes and V(G i ) represents all valid observation tasks in the ith orbit, E is the set of edges and E(G i ) denotes the links between two tasks. In the graph clustering model, any two original observation tasks with the edge connection satisfying the constraint conditions can be regarded as a clustering task. While expanding to a multiple vertex condition, multiple original tasks can be merged into one clustering task if there are edge connections between any two vertexes. The connected vertexes form a clique, where all vertexes are connected with each other, as shown in Figure 2, {t 3 , t 4 , t 5 }, {t 6 , t 7 } and {t 8 , t 9 , t 1 0, t 1 1} can be seen as cliques and each clique is regarded as a clustering task. In this paper, an adjacency matrix is adopted to better illustrate the utility of the graph clustering model. These original tasks {t 1 , t 2 , · · · , t n } can be described by a set of vertexes V = {v 1 , v 2 , · · · , v n } in the graph theory. Consequently, the graph clustering model could be represented by the adjacency matrix A n×n . If t u and t v meet the clustering constraint conditions, the relationship between two tasks can be described as (v u , v v ) ∈ E(G). Correspondingly, the element in the matrix A uv = 1, otherwise A uv = 0. Finally, the adjacency matrix A n×n consisting of 0 and 1 forms the graph clustering model E(G i ) in the ith orbit.

Scheduling Model
In this paper, a time-continuous resource allocation model for the EOSSP is established. Continuous decision variables are introduced to represent the observation start time within each VTW and decision variables are defined to check whether tasks are scheduled or not. Figure 3 gives a task sequential execution description in one orbit, where TWS and TWE stand for the start and end time of the VTW of an observation task, respectively, d represents the observation duration time of a task, tranT and s represent slewing angle maneuver time and preparation time, respectively, and ObvS and ObvE are the observation start time and end time.

Constraint Conditions
In this paper, the VTW is seen as the allocated resource, and the OWs for tasks are continuous decision variables to decide when to start the observation. The solution of the EOSSP model aims to schedule an observation sequence and maximize the observation profit, subject to corresponding constraints. In practical engineering scenarios, the following constraints are usually taken into account [24]: (1) VTW constraint The VTW constraint ensures that the observation tasks can be executed within the VTW of EOSs in the observation process.
where N is the number of tasks, where d u represents the observation duration time of task t u , d u = ObvE i u − ObvS i u is the observation start time of t u in the ith orbit, ObvE i u is defined as the observation end time. TWS i u and TWE i u represent the start and end time of the VTW for task t i u . (2) Conflict constraint for task execution The conflict constraint for task execution means that there is no crossover between any two tasks as the optical sensor cannot perform two observation tasks at the same time: where x uv is a decision variable and denotes whether to transform execution from task t u to task t v . x uv = 1 means that t v will be executed after t u .
(3) Task conversion time constraint Between any two sequential tasks, enough preparation time is required, mainly including slewing maneuvering time and sensor shutdown-restart setup time [38], which could be described as the following formula: For ∀u, v ∈ N and u < v, where s uv is the preparation time for restarting the sensor and tranT uv is the slewing maneuver time from task t u to t v , and the slewing maneuver time can be calculated as the following formula: In the above formula, θ u and θ v represent the observation slewing angle of t u and t v . v s denotes the angular velocity of the satellite slewing maneuver.
(4) Optical sensor boot time constraint According to the power constraint of the optical payload, the observation time of a task cannot exceed the maximum operating time of the optical sensor, (5) Storage size constraint Limited by the total storage size in the satellite, the constraint could be described as the following equation, where M represents the total data storage capacity of the satellite in one orbit. c i is the storage consumption per unit observation time in one orbit.
(6) Power consumption constraint In each orbit, the energy to be consumed is limited by the maximum capacity, and the corresponding energy consumed by the sensor operation and slewing maneuver is mainly considered in this paper as: In this formula, e i represents the energy consumption per unit time of observation operation. ε uv represents the energy consumption per unit time of the slewing maneuver from t u to t v . E is the total energy available for observation activities in one orbit.

Optimization Objectives
Models of observation satellite scheduling are always built as multiple objective optimization problems, and a scheduling algorithm aims to generate a compromise solution between objectives. Tangpattanakul et al. [39] implemented an indicator-based multiobjective local search method for the EOSSP, whose objectives were to maximize the total profit and simultaneously to ensure the fairness of resource allocation among multiple users. Sometimes, energy balance and fuel consumption are designed as optimization objectives [40,41].
In this paper, to maximize the total observation profit, more tasks and tasks with higher priority were scheduled. Hence, the objective function f was designed to maximize the total profit by the sum of priority associated with selected tasks.
This optimization objective function is subject to the constraint model mentioned above.

Task Preprocess: Graph Clustering
In Section 2.1, we proposed an undirected graph clustering model G = V, E . According to a previous analysis, clustering tasks can be selected by dividing the graph into independent cliques, aiming to minimize the number of clusters, which is known as the minimum clique partition algorithm [42]. Wu et al. [38,43] improved the clique partition algorithm by considering the priorities of vertices (original tasks) and adopted it in the task clustering phase. In this paper, an improved minimum clique partition algorithm is proposed. The maximum task priority and the minimum observation slewing angle of clustering tasks are taken into consideration simultaneously. This improvement could save energy to maintain a smaller observation slewing angle, which is significant in real engineering applications.

Graph Model Establishment
The establishment of a graph-based clustering model involves two steps, establishing the adjacency matrix and updating the model, as described below: (1) Establish the adjacency matrix All tasks in V(G i ) are traversed and whether two original tasks t u and t v satisfy the time window constraint is checked. If the time window constraint is satisfied, A uv = 1, otherwise A uv = 0. After the iteration, the edge (v v , v u ) is generated, and the initial graph model G 0 is gained.
(2) Update graph model by checking other constraints According to the initial graph G 0 built by satisfying the time window constraint, the adjacency matrix elements A uv (u, v = 1, 2, · · · , n) are searched. If A uv = 0, constraints of the observation time window and observation slewing angle are checked sequentially.
Once a constraint condition is not satisfied, A uv = 0. Finally, the clustering graph G and the adjacency matrix A n×n are obtained.

Clique Partition Algorithm
Based on the graph model, each independent cluster represents a clustering task. The purpose of the clique partition algorithm is to minimize the number of clustered tasks and ensuring more original tasks are contained in each divided clique. The algorithm is described as follows: Firstly, the edge e uv with the largest number of common neighbors on the edge set E(G i ) in the graph is selected. Secondly, the edge which needs to delete the least number of edges is screened out after merging. Thirdly, the edge which has a larger evaluation parameter p of the corresponding vertices is selected. Finally, the two vertices are merged into a new virtual vertex, and the edge associated with the merged vertex is deleted. Repeatedly applying the procedure to the updated edge set, the process is stopped when the original E(G i ) becomes empty.
In the algorithm, evaluation parameter p can be calculated as follows: where prio c and θ c are the priority and the minimum slewing angle of the generated clustering task, respectively. The pseudocode of the improved minimum clique partition algorithm process is shown in Algorithm 1: Combine the two vertices into v uv 19 Delete the edges associated with the merged vertex to create a new edge

end
In this paper, an improved clique partition algorithm is adopted by taking the priority and the minimum slewing angle of clustering tasks into consideration. The generated clustering tasks are used as the input of the following DRL algorithm to calculate the scheduling result.

Markov Decision Process Model
Deep reinforcement learning is the process of an agent that learns how to make a decision by interacting with the dynamic environment through trial and error. Agents take actions on the environment and achieve positive or negative reward feedback.
The Markov decision process (MDP) is the fundamental framework of RL for modeling. One agent (a satellite in this research) chooses an action in the current state, then transfers to the next state and receives a reward. This process could be described by a tuple M = S, A, P, R, γ , where S is a finite set of states, A is a finite set of actions, P is a state transition probability matrix. R represents the reward function and γ denotes a discount factor (γ ∈ [0, 1]).
In the EOSSP, the global state S includes the task state S task and the satellite state S sat is given as the following equations: where the global state s t = S = [S task , S sat ]. The task state S task is the collection of start and end times of the VTW, start and end times of the OW, priority and the observation duration. The satellite state S sat is the collection of observation slewing angles. The action space A is the collection of decision variables of each task, and all of the value range is normalized to [−1, 1] as follows: It should be noted that A is not the OW for each task, and the corresponding mapping function is given in the following equation.
The selected OW for task t u is [ObvS u , ObvE u ] and could be seen as the global solution for the EOSSP. It is noted that in a model-free algorithm, it does not need any hypothesis or a prior knowledge P. The VTW-related state s t will transform to the next state s t+1 after an OW-related action a t occurs. This process happens continuously within a finite time, and an immediate reward r t is obtained predictably in each transition step, as shown in Figure 4.

Optimization with DDPG
In 2015, Mnih et al. [44] proposed the first successful deep Q-network (DQN) frame in Atari games, with the main idea of mapping the state to action-value function by deep networks. However, the action taken is given by a t = argmaxQ(s t , a), which means the DQN could handle discrete action space problems. In the optimization research domain, in terms of the TSP model or VRP model problems, the agent is designed to make sequential discrete decisions, and the DQN has been applied successfully [27][28][29].
COntrary to the DQN, the main idea of the DDPG is mapping state to policy (specific actions taken) directly. Therefore the DDPG could make continuous decision variables and has advantages in large-dimensional problems [45]. In this paper, the EOSSP is modeled as a time-continuous resource allocation problem, where the VTW is the crucial resource. The policy made is to decide a specific time for each observation task, and the action space is typically continuous, so it is suitable for the DDPG solution.
The DDPG algorithm conforms to the actor-critic framework, which includes the actor network and critic network. The actor network is represented by parameters θ µ , which offers a strategy action distribution according to the current state. The critic network evaluates the current strategy by calculating the value function and its parameters are denoted by θ q . Figure 5 illustrates the DDPG algorithm. The actor network outputs an action from a continuous action space, which converts the state space into the action a = µ(s). In the critic network, the output Q(s, µ(s)) is learned by using the Bellman equation, which represents the approximation of the discounted total reward. In every step of the optimization training process, the actor network is improved by computing the gradient of the Q(s, µ(s)) function, which could be calculated by applying the chain rule [45]: In order to ensure the scheduling network convergence, target networks are adopted to update parameters periodically. Correspondingly, θ µ and θ q are defined to represent parameters of the target actor network and the target critic network.

Task Scheduling Method
As an off-policy DRL algorithm, the DDPG allows us to train the EOS task scheduling network without knowing the prior information. Moreover, it is noted that in the training process, valid observation tasks are sorted according to the start time of the VTW.
(1) Network architecture As mentioned above, the model consists of two separate networks, which are, respectively, the actor and the critic. Figure 6 shows the network architecture.  Figure 6. Architecture of DDPG actor and critic network.
The input of the network is a sequence of state information (denoted as s), including the start and end time of the VTW, start and end time of the OW, priority and the observation duration. The L2-normalization is applied to the state input layer and the output of the architecture has two parts: an estimation Q-value of total expected profit (denoted as Q(s, a | θ q )) as critic and the policy (denoted as µ(s | θ µ )) for the task as actor. The output value of the actor network is mapped to [−1, 1] by applying a non-linear activation function Tanh, which has the same value range as the action space in Equation (19).
(2) Reward function design In the task scheduling problem, the EOS must guarantee that high-priority observation tasks in the execution sequence can be carried out. Simultaneously, to meet actual requirements, conflicting tasks are not permitted. Therefore, the reward function, denoted by r t , is modeled to supervise the task scheduler to achieve an optimal result. In Section 2.2, we propose the optimization objective function f in Equation (13). Thus, to acquire the largest global profit, the reward function is formulated as follows: where λ is an amplification factor to improve a good action's reward and accelerate the process of training the policy network.
(3) Training method In the training process, the scheduling network makes a decision when to start observing targets, and then receives an instant reward based on the reward function. For considering the future reward given by the current policy, an accumulated reward with a discount factor γ is used to estimate the Q-value in the critic network. Meanwhile, the actor could learn to make an optimal scheduling policy based on the actor network. During each training step, the critic improves its prediction ability by gradient descent of error between actual profit and estimated profit. The actor updates its parameters based on the prediction from the critic. To improve exploration efficiency, noise is introduced into the decision as follows: The noise obeys the normal distribution and α is the attenuation coefficient, σ is the standard deviation. At the end of each training episode, a soft synchronization method is adopted to update network parameters, which is formulated as follows: where τ is the coefficient of synchronization, which is used to make the target networks slowly track the learned networks, significantly improving the stability of learning [45]. The pseudocode for the scheduling algorithm is shown in Algorithm 2: Algorithm 2: Task scheduling method based on DDPG 1 Initialize the actor µ(s t |θ µ ) and the critic Q(s, a|θ q ) with weights θ µ , θ q 2 Initialize the actor target µ and the critic Q with weights θ µ ← θ µ , θ q ← θ q 3 Initialize experience replay buffer and batch 4 while episode < max_episode do 5 while step < max_step do 6 Select a t based on the actor network a t = µ(s t |θ µ ) 7 Add noise: Update new state s t+1 10 Calculate instant reward r t 11 Store [s t , a t , r t , s t+1 ] in experience replay buffer 12 Sample an N-size batch [s i , a i , r i , s i+1 ] from buffer randomly 13 Estimate target Q-value: Calculate the critic network loss function: Calculate the gradient of the critic network: Calculate the gradient of the actor network: Update θ q and θ µ by ADAM optimizer [46]: Update the target networks by a soft synchronization method with parameter τ 25 end 26 end

(4) Resolving conflicts
After the OW is determined by the DDPG, conflicts may still remain. Constraint checking and conflict resolving are performed, and the breadth first search (BFS) is adopted to check the task sequence and update the list {x 1 , x 2 , . . . , x N } to {0, 1, 1, 0, · · · }. After resolving conflicts, the profit obtained in the current step is calculated according to Equation (13).

Simulation Scenario
In this paper, a typical engineering scenario is taken into consideration, where the observation targets were generated randomly with different task numbers in the region between 40 • and 45 • (N) latitude and 117 • and 130 • (E) longitude, as shown in Figure 7. The simulation scenario was implemented by using the System Tool Kit (STK) and six typical orbital elements of an LEO observation satellite are selected, as listed in Table 2.

Parameters Value
Semi-major axis of orbit a 7000km Orbital eccentricity e In addition, 50, 100, 150 and 200 original tasks were generated to simulate practical users' requests. In addition, different task numbers represent the increasing complexity of the EOSSP. In the comparison experiment groups, the performance and scalability of the proposed DDPG algorithm were validated and discussed while the problem scale increased.
The resource-related constraints, including the VTW, slewing angle maneuverability, total energy, total memory constraints, etc., were taken into account, and associated constant variables are summarized in Table 3. Table 3. Constraint conditions in the task scheduling.

Parameters
Value Parameters Value It is emphasized that tasks waiting to be scheduled were arranged in chronological order according to the start time of the VTW. Meanwhile, tasks were set up with different observation durations, ranging from 5 s to 15 s, and different priorities, ranging from 1 to 10.

Results and Discussion
According to the analysis in Section 3, task clustering is an effective approach promoting scheduling efficiency and saving EOS resources. For the task clustering preprocess phase, the proposed minimum clique partitioning algorithm consists of two steps. Firstly, it establishes the clustering graph by considering constraint conditions, and then it partitions original tasks into minimum clique aiming to update the task execution sequence. Figure 8 shows the clique partition result of 50 original tasks, where {t 1 , t 2 , t 3 , t 4 } are inaccessible. The interconnected vertexes such as {t 29 , t 34 , t 37 } can be seen as a merged task.  Furthermore, other groups of comparisons with 100, 150 and 200 original tasks were simulated, as shown in Figure 9. The results indicate that 4, 14, 19 and 25 corresponding invalid tasks are eliminated and 36, 70, 102 and 133 clustering tasks are generated. Moreover, it is found that the task clustering running time is less than 0.5 s. Thus, it could be stated that the improved graph-based minimum clique partition algorithm achieves the desired objective and reduces the task scale obviously and quickly. On the basis of the results from the clustering, the performance of the RL-based algorithm is examined, where the Pytorch deep learning framework is utilized to implement the scheduling networks. Table 4 gives the hyperparameters adopted in the training process. The DRL algorithm runs on an Nvidia GTX1660 GPU and Intel i5-8400 CPU device. In each training episode, there are 100 exploratory steps, and the maximum training episode is 400, which means the total amount of experience is 40,000. The experience memory pool keeps updated data with a size of 3000. In each training step, experiences are randomly sampled from the memory pool with a batch size of 32. Parameters of the actor and the critic networks are initialized randomly before training, and the models are trained with the ADAM optimizer [46].
The task scheduling profit p proposed in Section 2.2 is selected as the evaluation indexes of the training performance. As shown in Figure 10 the episode maximum profit (gray curves) and average profit (blue curves) are selected to demonstrate the algorithm's performance in each training episode. The trendline (red curves) indicates that the task scheduling network could achieve a higher profit in every simulation, which demonstrates that the proposed method is working. The profit score of scheduling is fluctuates upward with increasing episodes, which shows that the network could learn from experience and achieve a better and more stable scheduling policy with higher profit. Additionally, the trained DRL scheduling network gives a solution with observation profit of 3.25 for 50 original tasks, but the profit of allocating 100, 150 and 200 original tasks decreases it to 2.31, 1.83 and 1.56, respectively. Therefore, the desired profit is greatly influenced by the number of original tasks, representing the complexity of the EOSSP. The main possible reason for this is that the observation tasks are under the limitation of the VTW, energy, storage and other constraint conditions, and task scheduling becomes more and more difficult while the number of tasks increases. A compromising solution is to ensure tasks with higher priority are executed. Figure 11 demonstrates the observation time period selection of the DDPG methods in the task scheduling phase for 50 original tasks. The horizontal axis represents the time and the vertical axis represents different clustered observation tasks. The left figure shows the initial state, where all tasks are arranged at the start of the VTW. The scheduling result is shown in the right figure, where valid tasks (marked as green) are executed, and the invalid tasks (marked as red) are not performed because of constraint conflicts.
Moreover, a series of comparisons with the genetic algorithm (GA), the simulated annealing algorithm (SA) and the GA-SA hybrid algorithm is performed to examine the superiority of the DDPG method. Note that the GA-SA hybrid algorithm has been validated in our previous work [24]. In addition, the DDPG without considering the task clustering is simulated to withstand the effect of the preprocess, and it is defined as NTC-DDPG. Correspondingly, the DDPG with task clustering is represented by TC-DDPG.
Comparison results are given in Figure 12. It is indicated that the TC-DDPG method always gives a good optimization result compared with other methods. Interestingly, it was found that for 50 tasks, NTC-DDPG has a relatively low profit, but with the increase in the task number, NTC-DDPG exceeds the non-DRL methods, even though NTC-DDPG does not take the task clustering into account. This indicates that the DDPG could contribute to a high EOS efficiency. It is necessary to point out that the non-DRL methods also include the task clustering preprocess.  Additionally, it is obvious that traditional optimization algorithms achieve a worse profit with the increases in the number of original tasks, and the SA algorithm is even out of work in the 150 and 200 task-scale situations. Hence, the DRL method has practical advantages when addressing a large-scale EOSSP. The results shown in Table 5 illustrate the feasibility of the proposed RL method and this method is rather competitive in the EOSSP, with good profit performance and adaptability to practical applications. In addition, the task clustering algorithm greatly improves the DDPG algorithm, obtaining a higher observation profit and reducing the running time, with a preprocess time of less than 0.5 s.

Conclusions
Observation satellite task scheduling policy plays a crucial role in providing highquality space-based information services. In previous studies, many algorithms based on traditional optimization methods such as GA and SA have been successfully applied in the EOSSP. However, these methods depend upon a mathematical model, and with the increase in the task scale, they may fall into local optimum. In this paper, the EOSSP is considered as a time-continuous model with multiple constraints and, inspired by the progress of DRL and its model-free characteristics, a DRL-based algorithm is proposed to approach the EOSSP. In addition, to decrease the complexity of the solution, an improved graph-based minimum clique partition algorithm is proposed for the task clustering preprocess; this is a relatively new attempt in handling EOSSP optimization. The simulation results show that the DDPG algorithm combined with the task clustering process is practicable and achieves the expected performance. In addition, this solution has a higher optimization performance compared with traditional metaheuristic algorithms (GA, SA and GA-SA hybrid algorithm). In terms of scheduling profits, the experimental results indicate that the DDPG is feasible and efficient for the EOSSP even in a relatively large-scale situation.
Note that, in the present work, satellite constellation task scheduling problems were not addressed. In a future study, we will attempt to adopt multi-agent DRL methods to study the multiple satellite EOSSP.