Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem

: The agile earth observation satellite scheduling problem (AEOSSP) is a combinatorial optimization problem with time-dependent constraints. Recently, many construction heuristics and meta-heuristics have been proposed; however, existing methods cannot balance the requirements of efﬁciency and timeliness. In this paper, we propose a graph attention network-based decision neural network (GDNN) to solve the AEOSSP. Speciﬁcally, we ﬁrst represent the task and time-dependent attitude transition constraints by a graph. We then describe the problem as a Markov decision process and perform feature engineering. On this basis, we design a GDNN to guide the construction of the solution sequence and train it with proximal policy optimization (PPO). Experimental results show that the proposed method outperforms construction heuristics at scheduling proﬁt by at least 45%. The proposed method can also calculate the approximate proﬁts of the state-of-the-art method with an error of less than 7% and reduce scheduling time markedly. Finally, we demonstrate the scalability of the proposed method.


Introduction
Agile earth observation satellites (AEOSs) are a new generation of earth observation satellites (EOSs) with three degrees of freedom: roll, pitch, and yaw.With an extensive observation range, long observation time, and no terrain limitations, AEOSs play an important role in weather forecasting, disaster warning, environment protection, ground mapping, and maritime search and rescue.Compared with traditional EOS that only has roll capability, AEOS has a longer visible time window (VTW) for ground target observation.The observation window (OW) represents the real observation time of the task, whose length is the observation duration requested by the user.The OW is variable and can be any period within the VTW that guarantees the integrity of the observation process, which makes the solution space of the AEOSSP large.When observing two targets in succession, the AEOS must transit attitude, and because AEOS attitude is related to the start and end times of OW, the attitude transition time between the two tasks is variable and time-dependent.The agile earth observation satellite scheduling problem (AEOSSP) requires the determination of the task observation sequence and the OW of each task to satisfy observation integrity, attitude transformation constraints, and some other hard constraints of satellites, such as memory and power consumption.Therefore, the AEOSSP is a typical combinatorial optimization problem with complex constraints that has been shown to be an NP-hard problem [1].
With the expansion of AEOS application fields, observation requests become frequent, and observation requirements become diverse.However, with better observation capabilities, AEOSs are still a scarce resource that cannot satisfy the high demand for observations.In addition, some emergencies, such as earthquakes and floods, require satellites to complete observations as soon as possible.Therefore, a fast and efficient scheduling algorithm is essential to improve the utilization rate of satellites.
In recent decades, many scholars have studied the AEOSSP.Lema tre et al. [2] were the first to research the AEOSSP, and described the AEOSSP as a combinatorial optimization problem considering the selection and scheduling of observation tasks.Due to the complexity of the problem, few exact methods have been proposed.Wang et al. [3] proposed a mixed-integer programming model for the AEOSSP and reduced problem complexity by discretizing the continuous observation angle into three angles.They obtained the approximate upper bound of the problem by CPLEX.Chu et al. [4] designed an implicit enumeration algorithm to construct a solution under the framework of depth-first search and designed three pruning strategies.For the AEOSSP, the exact methods all simplify the time-dependent constraint.In addition, ref [5] showed that solutions cannot be obtained in an acceptable time with the CPLEX solver when the number of VTWs exceeds 27.Currently, research on combinatorial optimization problems primarily focuses on heuristics and meta-heuristics.Heuristic and meta-heuristic algorithms can solve large-scale problems and are widely used in practice.A well-designed algorithm can significantly improve efficiency [6][7][8].For AEOSSP, Lema tre et al. [2] proposed four heuristics: a greedy algorithm, dynamic programming, a constraint planning algorithm, and a local search algorithm.There are also several profit-based construction heuristics [3,9] and the iterative local search [10].Many meta-heuristics include the tabu search algorithm [11,12], the hybrid differential evolution algorithm [13], the improved genetic algorithm [14][15][16][17], and the adaptive large neighbourhood search algorithm [5,18].However, the search difficulty and solution time of these algorithms increase dramatically as the problem scale increases.Traditional heuristic and meta-heuristic algorithms cannot meet the requirements of high efficiency and fast response in practical applications.In addition, these rule-based algorithms rely heavily on the designer's experience, and the solution's quality is poor.Therefore, traditional methods are limited by their solution characteristics and cannot produce high-quality results timely.
In recent years, deep reinforcement learning (DRL) has been applied to many classical combinatorial optimization problems, such as the travelling salesman problem (TSP) and the vehicle routing problem (VRP).Vinyals et al. [19] first proposed using pointer networks (PN) to solve the TSP, and their model refers to the traditional sequence-to-sequence (Seq2Seq) structure.Bello et al. [20] then used the policy gradient and actor-critic algorithms to train the PN model, which can obtain approximate optimal solutions for TSP with 100 tasks.Nazari et al. [21] proposed an end-to-end framework based on DRL, which divides the problem input into static input and dynamic input to solve the VRP with dynamic characteristics.Joshi et al. [22] proposed a graph pointer network (GPN) to solve the TSP.Additionally, DRL methods have been used in several classical real-world problems [23,24].
The literature highlights the strong potential of DRL to solve combinatorial optimization problems, and some scholars have performed related studies.Chen et al. [25] proposed an end-to-end framework based on DRL for the AEOSSP as first attempt to apply DRL to AEOSSP.Zhao et al. [26] proposed a two-phase neural combination optimization method with reinforcement learning and used a neural combination optimization with the reinforcement learning method to determine the observation sequence and a reinforcement learning algorithm based on a deep deterministic policy gradient to determine the start time of the tasks.Wei et al. [27] proposed a deep reinforcement learning and parameter transfer-based approach (RLPT) to solve a multiobjective AEOSSP.All these methods simplify the time-dependent attitude transition time constraint, but this constraint is one of the important constraints of the AEOSSP.To better represent this constraint, we propose to model the AEOSSP as a graph, with edges representing the pose transition time.On this basis, we propose to use GNN to solve the problem.
In this paper, to solve the AEOSSP, we present a graph-based DRL method that is different from existing methods.The primary contributions of this study are as follows: (1) We model the AEOSSP with the time-dependent attitude transition time as a graph, which can more accurately represent tasks and their relationships through nodes and edges.
(2) Based on the graph model of the AEOSSP, we design its MDP solution process and propose a graph attention network (GAT)-based decision neural network (GDNN) to represent the policy, which is trained by an RL method.
(3) We design extensive experiments to demonstrate the effectiveness and timeliness of the proposed method by comparing it with specific competitors.In addition, we perform a model study to verify the structure and generalization of GDNN.
The remainder of this paper is organized as follows.Section 2 models the graph formulation of the AEOSSP, the attitude transition time constraint, and the reformulation of the AEOSSP.Section 3 describes the proposed method, including the GDNN and the training method.Section 4 provides the computational experiments and analysis.Finally, Section 5 concludes the article and presents suggestions for future research.

Parameter
The parameters used in this paper are summarized in Table 1.

Parameter Description n tsk
The number of tasks i, j The index of tasks, i, j = 0, 1, ..., n tsk .0 means a virtual task pri i The profit of task i , pri 0 = 0 θ i,t The roll angle of the satellite to task i at time t ϕ i,t The pitch angle of the satellite to task i at time t ψ i,t The yaw angle of the satellite to task i at time t wb i The VTW start time for task i , wb 0 = 0 we i The VTW end time for task i , we 0 = 0 tb i The OW start time of task i , tb 0 = 0 is the initial state time te i The OW end time of task i , te 0 = 0 is the initial state time ct i The observation duration of task i , ct 0 = 0 ba i The satellite attitude at tb i , which is determined by the angle of roll θ i,tb i , pitch ϕ i,tb i and yaw ψ i,tb i .ba 0 is the initial attitude of the satellite ea i The satellite attitude at te i , which is determined by the angle of roll θ i,te i , pitch ϕ i,te i and yaw ψ i,te i .ea 0 is the initial attitude of the satellite ρ ij The attitude transition angle between task i and task j trans ea i , ba j The attitude transition time between task i and task j x ij Binary decision variable, indicating whether task i is a former task for task j 2.2.Mathematical Formulation tb i + ct i = te i , ∀i ∈ {0, 1, 2, . . . ,n tsk } (3) x ii = 0, ∀i ∈ {0, 1, 2, . . . ,n tsk } x ij ∈ {0, 1}, ∀i, j ∈ {0, 1, 2, . . . ,n tsk } Equation ( 1) is the optimization objective function of the AEOSSP, which is to maximize the sum of completed task priorities; Equation (2) represents the time window constraint of the satellite, which means the task must be observed within the VTW of the satellite; Equation (3) represents the relationship between the start, end time, and duration of the task; Equation (4) represents the time-dependent transformation time constraint between the tasks; Equation (5) represents the the consumed power constraint of the satellite; Equation (6) indicates that there exists at most one former task for each task; Equation (7) indicates that there exists at most one latter task for each task; Equation (8) indicates that a task can neither be its former task nor its latter task; Equation (9) represents the value of the decision variable.

Time-Attitude Adjacency Graph of the AEOSSP
We model AEOSSP as an adjacency graph, which is a typical class of directed acyclic graphs.When the satellite passes directly over the target (i.e., the overhead moment), the pitch angle of the satellite is 0. We introduce time-attitude coordinates to represent the VTW of the task.As shown in Figure 1, the timeline of satellite operation is the x-axis, and the roll angle of the satellite is the y-axis.A task is a node in the graph whose coordinate (t side i , θ side i ) consists of the satellite overhead moment and the roll angle.Each node has seven attributes: pri i , wb i , we i , tb i , te i , ba i , and ea i .The edge weight w ij = trans ea i , ba j between node i and node j indicates the transition time between task i and task j .The optimization objective is to find a path that begins at virtual node 0 and satisfies all constraints to maximize the sum of the profits of all nodes on the path.

Reformulation of AEOSSP
The solution construction can be seen as a sequential decision process.Each task decision can be considered to be a stage.As Figure 2 shows, in each stage, we can determine the next task node based on the current graph state according to the policy, which is represented using the decision neural network in the proposed method.Once the node of one stage is determined, the graph of the current state is updated.The next task node is selected according to the current state.Then, the process is repeated until a scheduling solution is constructed.
We model this construction process as a Markov decision process (MDP) defined by the 5-tuple S, A, T, R, C , where: • S is the state set of the time-attitude adjacency graph model; • A is the set of actions that a satellite can perform (i.e., the candidate task set); • T : S × A → S is the state transition function; • R : S × A→ R + is the reward function, which represents the profit of the selected task; • C : S × A → {0, 1} is the set of constraints, including constraints (2), ( 4) and ( 5).When C(s, a) = 0, T(s, a) =⊥, which means that the constraints are not satisfied and the state transition is infeasible.According to the Bellman equation, under the optimal policy π * , the optimal value function satisfies: The corresponding optimal strategy π * is:

Attitude Transition Time Constraint
For each target, the satellite observation attitude is determined.When observing two consecutive targets, a specific attitude transition time is required and can be calculated using Equations ( 12) and ( 13 To analyze the characteristics of this constraint, we propose a method to determine the earliest observation start time for the next task based on the current task.First, we define a time delay function using the same method as in [28].Definition 1.Time delay function: For consecutive task i and task j , the time delay function under the time-dependent attitude transition time constraint (te i , tb j , trans ea i , ba j ) can be defined as Equation (14).
tidy te i , tb j = te i + trans ea i , ba j − tb j (14) The satellite completes the observation of the former task at te i and then begins the attitude transition.After trans ea i , ba j , the satellite ends the transition and waits until tb j for the latter task observation.When tidy te i , tb j < 0, the satellite has finished the attitude transition before observation, which satisfies the shortest attitude transition time.When tidy te i , tb j > 0, the transition time is insufficient, and the constraint is violated.When tidy te i , tb j = 0, the transition time is just sufficient.
Pralet et al. [28] proved that for agile satellites, the delay function tidy te i , tb j monotonically increases with te i and decreases with tb j .This property shows that determining the earliest observation start time for the latter task can help compress the attitude transition time between two tasks and increase the availability of OWs.We designed the EarliestIm-ageCal algorithm to obtain the earliest observation start time of the latter task.The core idea is to calculate it by the linear approximation iterative method.The pseudo-code of the algorithm is shown in Algorithm 1.
The EarliestImageCal algorithm divides the solution into three situations: (1) when the start time of the latter task VTW meets the attitude transition constraint, the task OW is satisfied with the constraint, as shown in Figure 3a; (2) when the end time of the latter task VTW does not meet the attitude transition constraint, the task OW is not satisfied with the constraint, as shown in Figure 3b; (3) when the start time of the latter task VTW does not meet but the end time meets, we use the linear approximation method to replace the delay function, as shown in Figure 3c.shows that task OW is satisfied with the constraint, (b) demonstrates that task OW is not satisfied with the constraint, and (c) represents the situation where the linear approximation method can be used.

Algorithm 1 EarliestImageCal
Require: the end time of current task te i , the VTW of the latter task wb j , we j , the maximum number of iterations NumIter, calculation time accuracy prc Ensure: the earliest start time of the latter task t m 1: h 1 = tidy te i , wb j 2: if h 1 ≤ 0 then  we j = t m 20: end if 22: end for 23: return we j

GDNN Decision-Making Process
As the solution construction process shows in Figure 4, we first update the features in the current state as input to the GDNN.The network interacts the input features and uses mask mechanism [29] to avoid infeasible tasks.The network outputs the probability of the tasks and selects the tasks.Then, the process will be repeated until the candidate task set is empty.Finally, we obtain the output solution.

Feature Engineering
Appropriate feature extraction is the foundation of network decisions.We describe the AEOSSP as the time-attitude adjacency graph in which node attributes and edge weights are equally important.Therefore, the features of the AEOSSP comprise ten node features and five edge features, as shown in Figure 5.The following parts describe the meaning of each feature.All features must be normalized to improve network generalization ability and avoid weakening or failure of the network decision-making effect caused by the difference in data distribution.
Node features can be divided into task, VTW, and status features.The profit pri i and the observation duration ct i are task features proposed by the users.The former indicates the importance of task i , and the latter indicates the shortest time required for completing the task observation.The matrix of edge features E indicates the edge features between two nodes, where 5 .d ij indicates the distance between two nodes.In time-attitude coordinates, the distance between two nodes indicates the satellite attitude transition angle between two tasks, which can be calculated as Equation (13).In practice, the satellite attitude transition angle is primarily determined by roll and pitch angles.The pitch angle of the satellite is time-dependent and related to the length of the VTW.Therefore, we use the overhead time t side i to represent the pitch angle, thus linking time to the attitude transition angle, as shown in Equation (15).
where tw max is the length of the longest VTW and ϕ max is the maximum pitch angle of the satellite.In addition, l n1 ij , l n5 ij , l n10 ij and l n20 ij are features used to represent the relationship of two nodes.For node i , we sort all d ij in ascending order.If d ij is ranked within the first Kth, then l nK i = 1.Otherwise, l nK i = 0.

GNDD Structure
The graph attention network (GAT) is a graph neural network structure proposed by Veličković [30].The network introduces an attention mechanism into the graph neural network structure and can weigh the relationships between graph nodes.By extracting problem features, the GAT can calculate the probability of the following actions based on the features of the current state.
In the proposed method, we design the GAT-based decision neural network (GDNN) for problem sequence decision-making.As shown in Figure 6, the GDNN consists of nine layers.The first four layers are embedding layers, and each embedding layer is a single-layer GAT network using the attention mechanism to weigh the node and edge features.The following five layers are all fully connected layers, which are only responsible for updating the attributes of features.The fifth layer is the middle layer, which converts the network dimensions.The sixth to eighth layers are hidden layers whose dimensions remain the same.The last layer is the output layer and outputs a one-dimensional action probability.The feature update of the entire network is independent of the graph structure.
In the proposed method, the node feature is v i ∈ R 10 and the edge feature is e ij ∈ R 5 .To avoid complicating the network, the intermediate network structure has the same dimension, which is unified as F 3 .The transfer process of the extracted features in the network layer is as follows, where l is the network layer identifier, l ∈ [1, 9] ∧ l ∈ N + .
(1) Embedded layer and the transfer network (l ∈ [1, 4]).The node feature vector v is transferred in the embedding layer network through Equations ( 16)- (18), where the LeakyReLU function is proposed in [31].The condition shown in Equation ( 19) is satisfied, and the ReLU function [32] is used to activate between layers.The edge feature matrix E is transferred in the embedding layer network though the Equation (20), and the conditions shown in Equations ( 21) and ( 22) are satisfied. e (2) Middle layer and hidden layer network (l ∈ [5, 8]).
The middle layer and hidden layer are all fully connected layers.The input and output dimensions are both F 3 , and the feature transfer adopts the method shown in Equation (23).
(3) Output layer network transfer (l = 9).The output layer is also fully connected, and the output dimension is 1.The feature transmission uses the method shown in Equation (24).
(4) Mask mechanism The mask mechanism is introduced to avoid infeasible action choices.For nodes that violate the constraints, the output probabilities are controlled to zero.The mask label of node i is m i .When choosing the next node, if node i violates the constraint, m i = 0; otherwise, m i = 1.If the probability of the final output node i is v * i , Equations ( 25)-( 28) are used to realize the mask mechanism.

Training Method
The parameters of the GDNN must be obtained by learning from large batches of training data.In the proposed method, we apply the proximal policy optimization (PPO) proposed by [33] to train the GDNN.
The training framework of PPO follows actor-critic [34], which includes an actor network with the parameters Θ Q and a critic network with the parameters Θ V .The pseudocode is shown in Algorithm 2.

Algorithm 2 GDNN-PPO algorithm
Require: Initialize the training parameter cropping factor , the mean square error factor c 1 , the entropy factor c 2 , the batch size K, the parameter update step size T p, the parameter update optimization times k, the number of training episodes N Ensure: The GDNN optimal network parameters Θ 1: repeat 2: Generate the instance Emp = E, v, S sat .

16:
until All T p K batches trained 17: until Update Θ with k epochs 18: Clear Sampling pool.end while 23: until All N instances end.24: return Θ At each step t p , we sample the action a t based on the probability of the actor output p Θ Q (a t |s t ) and save the sample s t , a t , r t , p Θ Q (a t |s t ) in the sampling pool.The parameters are updated when the number of samples t p reaches the parameter update step size T p.We first update the reward r t according to Equation (29) to represent an assessment of the expected reward of action a t .The critic evaluates the value V Θ (a t |s t ) of the actor and probability p Θ Q (a t |s t ).Then, the loss is calculated in Lines 12-14, and the actor parameters Θ Q are updated in Line 15 according to stochastic gradient descent (SGD) [35].The parameters Θ V of the critic are updated by copying the updated parameters of the actor.Finally, the optimal model parameters are generated through iteration.The instance generation of the AEOSSP follows the characteristics of the satellite resources and orbits.The parameters of the instances are randomly generated according to a normal distribution, which can better increase the conflict between tasks.The parameter distributions of the instances are shown in Table 2, and the satellite capability parameters are shown in Table 3.

Parameter Distribution Details
tw max = 300 s 1 The intermediate time of the schedule period. 2 The length of the VTW.The above parameters are generated as integers.Based on the above rules, the experiment generates instances with tasks scaled 40, 60, 80, and 100.The instance with a task scale of 40 is shown in Figure 7, where the label means the task ID and profit.The figure shows that the task VTW distribution is relatively dense, and the conflicts of tasks are sufficiently large to effectively reflect the performance of the algorithms.We propose three indicators to measure the performance of the algorithms: the average scheduling profit (ASP), the average scheduling time (AST), and the percentage of excess scheduling profit (PSP).The ASP measures the solution quality of the algorithm at different scales, the AST is used to reflect the timeliness of the algorithm, and the PSP indicates the difference in the ASP between the proposed method and other algorithms.All network training and experiments use NVIDIA TITAN RTX, i9-11900K CPU, and 64.0 GB memory.The algorithms are coded in Python, and the deep learning framework uses PyTorch 1.9.0.

Competitors
To verify the validity of the proposed method, we use some construction heuristics as baselines of the AEOSSP.Specifically, we design the following four heuristics: the start time of observational time window ascending (STWA), the profit of task descending (PTD), the ratio of profit and image time descending (RPID), and the conflict degree of task descending (CDTD).The construction heuristics solving framework is shown in Algorithm 3.Each heuristic sorts the candidate tasks by the construction rule and inserts the tasks that satisfy the constraints into the solution sequence in turn until there is no task to insert.In CDTD, the conflict degree Cd i means the number of VTW overlaps between task i and other tasks, defined as Equations ( 30) and (31).We constitute the GDNN-DQN by training the GDNN with the DQN.In addition, to validate the efficiency of GDNN-PPO, some high-quality solution algorithms are required for comparison.As mentioned in Section 1, the existing exact methods simplify the time-dependent transition time, and the solution time of CPLEX is unacceptable when the number of VTWs exceeds 27; thus, we do not consider the exact algorithms.We compare the GDNN-PPO algorithm with three high-quality competitors: (1) GRILS [10], which is a state-of-the-art heuristic method for the AEOSSP; (2) self-adaptation differential evolution (SDE) [36], which is an algorithm that has been shown to be effective at solving the AEOSSP; and (3) the self-adaptation genetic algorithm (SGA) [14], which is an improved genetic algorithm designed to solve the AEOSSP.The decoding method of the SGA is the same as Algorithm 3. The crossover and mutation parameters are updated according to the proportion of entering the next generation.The parameters of the high-quality competitors are shown in Table 4.The GDNN parameter settings are shown in Table 5.A network with a structure that is too complex may affect the calculation speed and convergence, and a network with a structure that is too simple may be unable to characterize the problem well.To improve the efficiency of the GDNN, we analyze the GDNN structure.
Due to the limited availability of servers for large-scale training, we only considered three parameters: the embedding layer dimension F 3 , the number of hidden layers n hid , and the number of embedding layers n em .
For the embedding layer dimension F 3 and the number of hidden layers n hid , experiments are designed to test networks of 128 × 4, 64 × 3 and 32 × 2 (F 3 × n hid ) corresponding to high, medium, and low dimensions.We train networks of different dimensions using instances with 100 tasks and training episodes of 10, 000 and test them with 50 instances with tasks scaled 40, 60, 80, and 100.Results are shown in Table 6.In addition, the number of embedding layers is tested with 3, 4, 5 and 6.The networks are trained by instances with a task scale of 40 and training episodes of 10,000.We test them with 50 instances with a task scale of 40.The test result is shown in Table 7.  From Table 6, at the current cost of training, the low-dimensional network is the fastest but least profitable.The solution time of the high-dimensional network is longer, but the improvement in solution quality is small compared to the medium-dimensional network.For the number of embedding layers, results in Table 7 show that the network with four embedding layers can obtain the best ASP.The networks with five and six embedding layers have longer scheduling times but no improvement in scheduling profits.Therefore, the network parameters are finally set to F 3 = 64, n hid = 3 and n em = 4 in this study, and thus, the GDNN-64x3 network with four embedding layers is considered the best network for the AEOSSP.

GDNN Training
In the proposed method, we train GDNN with tasks scaled of 40, 60, 80, and 100 by the PPO.The training processes are shown in Figure 8. Results show that the average profit increases rapidly in the first 5000 episodes, indicating that the network parameters are continuously being updated.When 10,000 episodes are reached, the profit improvement begins to level off and remains stable.In general, after 50,000 training episodes, the network converges stably.

Conclusions
This paper proposes a graph-based DRL method called GDNN-PPO to solve the AEOSSP with time-dependent attitude transition time.We model the AEOSSP with the time-attitude adjacency graph and reformulate the problem as the MDP.Then, we extract the features of the AEOSSP, including node and edge features, and design a GDNN to guide task choice.Finally, we train the GDNN by PPO and design experiments to verify the validity of the proposed method.Experimental results show that GDNN-PPO outperforms all construction heuristics in ASP by at least 45% and surpasses high-quality competitors except for the state-of-the-art algorithm GRILS with regard to ASP and AST.Regarding GRILS, the difference between the GDNN-PPO performance in all instances is less than 7%; However, the AST of GRILS is 345 times longer than that of GDNN-PPO on average.Thus, GDNN-PPO performs well at large scales and with rapid responses when solving the AEOSSP and has strong potential for future applications in large constellations and new management models.
Although GDNN-PPO demonstrates significant advantages in solution time, there is potential to enhance its scheduling priority.In future work, we plan to further improve the GDNN solving efficiency by optimizing the feature selection, structure design, etc.Additionally, we plan to combine the proposed method with other algorithms to solve more complex satellite scheduling problems, such as multi-satellite scheduling problems.

Figure 1 .
Figure 1.Time-attitude adjacency graph of AEOSSP.The number indicates the task number, the black arrow indicates the task that can be selected in the current state, and the red arrow indicates the task that is actually selected.

Figure 2 .
Figure 2. Solution construction process.The numbers in the green squares represent the solution.

Figure 3 .
Figure 3. Three situations of EarliestImageCal.(a) shows that task OW is satisfied with the constraint, (b) demonstrates that task OW is not satisfied with the constraint, and (c) represents the situation where the linear approximation method can be used.

Figure 4 .
Figure 4. Solution construction process for the AEOSSP.

Figure 7 .
Figure 7. Instance with a task scale of 40.

3 :
return wb j //Observing at the earliest visible time of latter tasks 4: end if 5: h 2 = tidy te i , we j 6: if h 1 > 0 then +∞ //The attitude transition cannot be completed in the entire window 8: end if 9: for j = 1 to NumIter do return

11 :
h m = tidy(te i , t m ) The VTW features represent the VTW of the task, including the overhead time t side i , the overhead roll angle θ side i , the start time wb i and end time we i of the VTW, the earliest start time t m i , and its corresponding roll angle θ m i of task i .The status features l wait i is the candidate task in the current state.When task i is among the candidate tasks, l wait After normalization, we obtain the node features v i = pri i , t side i and l last i are updated after each decision.l wait i indicates whether task i indicates whether task i is the last task in the current solution sequence.If task i is the last task, l last i = 1.Otherwise, l last i = 0.
Choose a t by sampling according to p Θ Q (a t |s t ).Execute a t , gather r t and p Θ Q (a t |s t ), update s t+1 = (E t+1 , v t+1 ).Store s t , a t , r t , p Θ Q (a t |s t ) in sampling pool.

Table 4 .
Parameters setting of high-quality competitors.

Table 5 .
Main parameters of the GDNN training.

Table 6 .
Experimental results of different dimensional networks in instances with different scales.

Table 7 .
Experimental results of different numbers of embedding layers in instances with a task scale of 40.Bold indicates the best ASP and AST in all algorithms.