Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem

Chun, Jie; Yang, Wenyuan; Liu, Xiaolu; Wu, Guohua; He, Lei; Xing, Lining

doi:10.3390/math11194059

Open AccessArticle

Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem

by

Jie Chun

^1,†,

Wenyuan Yang

^2,3,†,

Xiaolu Liu

^1,*,

Guohua Wu

⁴,

Lei He

¹

and

Lining Xing

⁵

¹

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

²

College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, China

³

Beijing Institute for Advanced Study, National University of Defense Technology, Beijing 100101, China

⁴

School of Traffic and Transportation Engineering, Central South University, Changsha 410075, China

⁵

College of Electronic Engineering, Xidian University, Xi’an 710126, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(19), 4059; https://doi.org/10.3390/math11194059

Submission received: 13 August 2023 / Revised: 12 September 2023 / Accepted: 23 September 2023 / Published: 25 September 2023

(This article belongs to the Special Issue Evolutionary Computation 2022)

Download

Browse Figures

Versions Notes

Abstract

:

The agile earth observation satellite scheduling problem (AEOSSP) is a combinatorial optimization problem with time-dependent constraints. Recently, many construction heuristics and meta-heuristics have been proposed; however, existing methods cannot balance the requirements of efficiency and timeliness. In this paper, we propose a graph attention network-based decision neural network (GDNN) to solve the AEOSSP. Specifically, we first represent the task and time-dependent attitude transition constraints by a graph. We then describe the problem as a Markov decision process and perform feature engineering. On this basis, we design a GDNN to guide the construction of the solution sequence and train it with proximal policy optimization (PPO). Experimental results show that the proposed method outperforms construction heuristics at scheduling profit by at least 45%. The proposed method can also calculate the approximate profits of the state-of-the-art method with an error of less than 7% and reduce scheduling time markedly. Finally, we demonstrate the scalability of the proposed method.

Keywords:

agile satellite scheduling; task planning; graph attention network; deep reinforcement learning; proximal policy optimization

MSC:

90C27; 68T05; 68T20

1. Introduction

Agile earth observation satellites (AEOSs) are a new generation of earth observation satellites (EOSs) with three degrees of freedom: roll, pitch, and yaw. With an extensive observation range, long observation time, and no terrain limitations, AEOSs play an important role in weather forecasting, disaster warning, environment protection, ground mapping, and maritime search and rescue. Compared with traditional EOS that only has roll capability, AEOS has a longer visible time window (VTW) for ground target observation. The observation window (OW) represents the real observation time of the task, whose length is the observation duration requested by the user. The OW is variable and can be any period within the VTW that guarantees the integrity of the observation process, which makes the solution space of the AEOSSP large. When observing two targets in succession, the AEOS must transit attitude, and because AEOS attitude is related to the start and end times of OW, the attitude transition time between the two tasks is variable and time-dependent. The agile earth observation satellite scheduling problem (AEOSSP) requires the determination of the task observation sequence and the OW of each task to satisfy observation integrity, attitude transformation constraints, and some other hard constraints of satellites, such as memory and power consumption. Therefore, the AEOSSP is a typical combinatorial optimization problem with complex constraints that has been shown to be an NP-hard problem [1].

With the expansion of AEOS application fields, observation requests become frequent, and observation requirements become diverse. However, with better observation capabilities, AEOSs are still a scarce resource that cannot satisfy the high demand for observations. In addition, some emergencies, such as earthquakes and floods, require satellites to complete observations as soon as possible. Therefore, a fast and efficient scheduling algorithm is essential to improve the utilization rate of satellites.

In recent decades, many scholars have studied the AEOSSP. Lemaître et al. [2] were the first to research the AEOSSP, and described the AEOSSP as a combinatorial optimization problem considering the selection and scheduling of observation tasks. Due to the complexity of the problem, few exact methods have been proposed. Wang et al. [3] proposed a mixed-integer programming model for the AEOSSP and reduced problem complexity by discretizing the continuous observation angle into three angles. They obtained the approximate upper bound of the problem by CPLEX. Chu et al. [4] designed an implicit enumeration algorithm to construct a solution under the framework of depth-first search and designed three pruning strategies. For the AEOSSP, the exact methods all simplify the time-dependent constraint. In addition, ref [5] showed that solutions cannot be obtained in an acceptable time with the CPLEX solver when the number of VTWs exceeds 27. Currently, research on combinatorial optimization problems primarily focuses on heuristics and meta-heuristics. Heuristic and meta-heuristic algorithms can solve large-scale problems and are widely used in practice. A well-designed algorithm can significantly improve efficiency [6,7,8]. For AEOSSP, Lemaître et al. [2] proposed four heuristics: a greedy algorithm, dynamic programming, a constraint planning algorithm, and a local search algorithm. There are also several profit-based construction heuristics [3,9] and the iterative local search [10]. Many meta-heuristics include the tabu search algorithm [11,12], the hybrid differential evolution algorithm [13], the improved genetic algorithm [14,15,16,17], and the adaptive large neighbourhood search algorithm [5,18]. However, the search difficulty and solution time of these algorithms increase dramatically as the problem scale increases. Traditional heuristic and meta-heuristic algorithms cannot meet the requirements of high efficiency and fast response in practical applications. In addition, these rule-based algorithms rely heavily on the designer’s experience, and the solution’s quality is poor. Therefore, traditional methods are limited by their solution characteristics and cannot produce high-quality results timely.

In recent years, deep reinforcement learning (DRL) has been applied to many classical combinatorial optimization problems, such as the travelling salesman problem (TSP) and the vehicle routing problem (VRP). Vinyals et al. [19] first proposed using pointer networks (PN) to solve the TSP, and their model refers to the traditional sequence-to-sequence (Seq2Seq) structure. Bello et al. [20] then used the policy gradient and actor–critic algorithms to train the PN model, which can obtain approximate optimal solutions for TSP with 100 tasks. Nazari et al. [21] proposed an end-to-end framework based on DRL, which divides the problem input into static input and dynamic input to solve the VRP with dynamic characteristics. Joshi et al. [22] proposed a graph pointer network (GPN) to solve the TSP. Additionally, DRL methods have been used in several classical real-world problems [23,24].

The literature highlights the strong potential of DRL to solve combinatorial optimization problems, and some scholars have performed related studies. Chen et al. [25] proposed an end-to-end framework based on DRL for the AEOSSP as first attempt to apply DRL to AEOSSP. Zhao et al. [26] proposed a two-phase neural combination optimization method with reinforcement learning and used a neural combination optimization with the reinforcement learning method to determine the observation sequence and a reinforcement learning algorithm based on a deep deterministic policy gradient to determine the start time of the tasks. Wei et al. [27] proposed a deep reinforcement learning and parameter transfer-based approach (RLPT) to solve a multiobjective AEOSSP. All these methods simplify the time-dependent attitude transition time constraint, but this constraint is one of the important constraints of the AEOSSP. To better represent this constraint, we propose to model the AEOSSP as a graph, with edges representing the pose transition time. On this basis, we propose to use GNN to solve the problem.

In this paper, to solve the AEOSSP, we present a graph-based DRL method that is different from existing methods. The primary contributions of this study are as follows:

(1) We model the AEOSSP with the time-dependent attitude transition time as a graph, which can more accurately represent tasks and their relationships through nodes and edges.

(2) Based on the graph model of the AEOSSP, we design its MDP solution process and propose a graph attention network (GAT)-based decision neural network (GDNN) to represent the policy, which is trained by an RL method.

(3) We design extensive experiments to demonstrate the effectiveness and timeliness of the proposed method by comparing it with specific competitors. In addition, we perform a model study to verify the structure and generalization of GDNN.

The remainder of this paper is organized as follows. Section 2 models the graph formulation of the AEOSSP, the attitude transition time constraint, and the reformulation of the AEOSSP. Section 3 describes the proposed method, including the GDNN and the training method. Section 4 provides the computational experiments and analysis. Finally, Section 5 concludes the article and presents suggestions for future research.

2. Problem Description

2.1. Parameter

The parameters used in this paper are summarized in Table 1.

2.2. Mathematical Formulation

m a x \sum_{i = 0}^{n_{t s k}} \sum_{j = 0}^{n_{t s k}} x_{i j} {p r i}_{j}

(1)

w b_{i} \leq t b_{i} \leq t e_{i} \leq w e_{i}, \forall i \in \{0, 1, 2, \dots, n_{t s k}\}

(2)

t b_{i} + c t_{i} = t e_{i}, \forall i \in \{0, 1, 2, \dots, n_{t s k}\}

(3)

t e_{i} + t r a n s (e a_{i}, b a_{j}) - t b_{j} \leq 0, \forall x_{i j} = 1

(4)

\sum_{i = 0}^{n_{t s k}} \sum_{j = 0}^{n_{t s k}} x_{i j} (c t_{j} \cdot u i e + u t e \cdot t r a n s (e a_{i}, b a_{j})) \leq (1 - ζ) \cdot E g y_{m a x}

(5)

\sum_{i = 0}^{n_{tsk}} x_{i j} \leq 1, \forall j \in \{0, 1, 2, \dots, n_{t s k}\}

(6)

\sum_{j = 0}^{n_{t s k}} x_{i j} \leq 1, \forall i \in \{0, 1, 2, \dots, n_{t s k}\}

(7)

x_{i i} = 0, \forall i \in \{0, 1, 2, \dots, n_{t s k}\}

(8)

x_{i j} \in \{0, 1\}, \forall i, j \in \{0, 1, 2, \dots, n_{t s k}\}

(9)

Equation (1) is the optimization objective function of the AEOSSP, which is to maximize the sum of completed task priorities; Equation (2) represents the time window constraint of the satellite, which means the task must be observed within the VTW of the satellite; Equation (3) represents the relationship between the start, end time, and duration of the task; Equation (4) represents the time-dependent transformation time constraint between the tasks; Equation (5) represents the the consumed power constraint of the satellite; Equation (6) indicates that there exists at most one former task for each task; Equation (7) indicates that there exists at most one latter task for each task; Equation (8) indicates that a task can neither be its former task nor its latter task; Equation (9) represents the value of the decision variable.

2.3. Time-Attitude Adjacency Graph of the AEOSSP

We model AEOSSP as an adjacency graph, which is a typical class of directed acyclic graphs. When the satellite passes directly over the target (i.e., the overhead moment), the pitch angle of the satellite is 0. We introduce time-attitude coordinates to represent the VTW of the task. As shown in Figure 1, the timeline of satellite operation is the x-axis, and the roll angle of the satellite is the y-axis. A task is a node in the graph whose coordinate

(t_{i}^{side}, θ_{i}^{side})

consists of the satellite overhead moment and the roll angle. Each node has seven attributes:

p r i_{i}

,

w b_{i}

,

w e_{i}

,

t b_{i}

,

t e_{i}

,

b a_{i}

, and

e a_{i}

. The edge weight

w_{i j} = trans (e a_{i}, b a_{j})

between

n o d e_{i}

and

n o d e_{j}

indicates the transition time between

t a s k_{i}

and

t a s k_{j}

. The optimization objective is to find a path that begins at virtual node 0 and satisfies all constraints to maximize the sum of the profits of all nodes on the path.

2.4. Reformulation of AEOSSP

The solution construction can be seen as a sequential decision process. Each task decision can be considered to be a stage. As Figure 2 shows, in each stage, we can determine the next task node based on the current graph state according to the policy, which is represented using the decision neural network in the proposed method. Once the node of one stage is determined, the graph of the current state is updated. The next task node is selected according to the current state. Then, the process is repeated until a scheduling solution is constructed.

We model this construction process as a Markov decision process (MDP) defined by the 5-tuple

〈S, A, T, R, C〉

, where:

S is the state set of the time-attitude adjacency graph model;
A is the set of actions that a satellite can perform (i.e., the candidate task set);
$T : S \times A \to S$ is the state transition function;
$R : S \times A \to R^{+}$ is the reward function, which represents the profit of the selected task;
$C : S \times A \to \{0, 1\}$ is the set of constraints, including constraints (2), (4) and (5). When $C (s, a) = 0$ , $T (s, a) = ⊥$ , which means that the constraints are not satisfied and the state transition is infeasible.

According to the Bellman equation, under the optimal policy

π^{*}

, the optimal value function satisfies:

V^{*} (s) = max_{a \in A} (R (s, a) + V (T (s, a))) \forall s \in S s . t . T (s, a) \neq ⊥

(10)

The corresponding optimal strategy

π^{*}

is:

π^{*} (s) = \underset{a \in A}{\arg \max} (R (s, a) + V (T (s, a))) \forall s \in S s . t . T (s, a) \neq ⊥

(11)

2.5. Attitude Transition Time Constraint

For each target, the satellite observation attitude is determined. When observing two consecutive targets, a specific attitude transition time is required and can be calculated using Equations (12) and (13), where

b_{0} = 11.66

,

b_{1} = 5

,

b_{2} = 10

,

b_{3} = 16

,

b_{4} = 22

,

a_{1} = 1.5

,

a_{2} = 2

,

a_{3} = 2.5

,

a_{4} = 3

,

z_{0} = 10

,

z_{1} = 30

,

z_{2} = 60

, and

z_{3} = 90

. The length of the transition time depends on the ending attitude of the former task and the starting attitude of the latter task, both of which are time-dependent.

trans (e a_{i}, b a_{j}) = \{\begin{matrix} \begin{matrix} b_{0}, & ρ_{i j} \leq z_{0} \\ b_{1} + ρ_{i j} / a_{1}, & z_{0} < ρ_{i j} \leq z_{1} \\ b_{2} + ρ_{i j} / a_{2}, & z_{1} < ρ_{i j} \leq z_{2} \\ b_{3} + ρ_{i j} / a_{3}, & z_{2} < ρ_{i j} \leq z_{3} \\ b_{4} + ρ_{i j} / a_{4}, & ρ_{i j} > z_{3} \end{matrix} \end{matrix}, \forall x_{i j} = 1

(12)

ρ_{i j} = |θ_{i, t e_{i}} - θ_{j, t b_{j}}| + |φ_{i, t e_{i}} - φ_{j, t b_{j}}| + |ψ_{i, t e_{i}} - ψ_{j, t b_{j}}|, \forall x_{i j} = 1

(13)

To analyze the characteristics of this constraint, we propose a method to determine the earliest observation start time for the next task based on the current task. First, we define a time delay function using the same method as in [28].

Definition 1.

Time delay function: For consecutive

t a s k_{i}

and

t a s k_{j}

, the time delay function under the time-dependent attitude transition time constraint

(t e_{i}, t b_{j},

trans (e a_{i}, b a_{j}))

can be defined as Equation (14).

tidy (t e_{i}, t b_{j}) = t e_{i} + trans (e a_{i}, b a_{j}) - t b_{j}

(14)

The satellite completes the observation of the former task at

t e_{i}

and then begins the attitude transition. After

trans (e a_{i}, b a_{j})

, the satellite ends the transition and waits until

t b_{j}

for the latter task observation. When

tidy (t e_{i}, t b_{j}) < 0

, the satellite has finished the attitude transition before observation, which satisfies the shortest attitude transition time. When

tidy (t e_{i}, t b_{j}) > 0

, the transition time is insufficient, and the constraint is violated. When

tidy (t e_{i}, t b_{j}) = 0

, the transition time is just sufficient.

Pralet et al. [28] proved that for agile satellites, the delay function

tidy (t e_{i}, t b_{j})

monotonically increases with

t e_{i}

and decreases with

t b_{j}

. This property shows that determining the earliest observation start time for the latter task can help compress the attitude transition time between two tasks and increase the availability of OWs. We designed the EarliestImageCal algorithm to obtain the earliest observation start time of the latter task. The core idea is to calculate it by the linear approximation iterative method. The pseudo-code of the algorithm is shown in Algorithm 1.

The EarliestImageCal algorithm divides the solution into three situations: (1) when the start time of the latter task VTW meets the attitude transition constraint, the task OW is satisfied with the constraint, as shown in Figure 3a; (2) when the end time of the latter task VTW does not meet the attitude transition constraint, the task OW is not satisfied with the constraint, as shown in Figure 3b; (3) when the start time of the latter task VTW does not meet but the end time meets, we use the linear approximation method to replace the delay function, as shown in Figure 3c.

Algorithm 1 EarliestImageCal

Require:: the end time of current task $t e_{i}$ , the VTW of the latter task $[w b_{j}, w e_{j}]$ , the maximum number of iterations $N u m I t e r$ , calculation time accuracy $p r c$
Ensure:: the earliest start time of the latter task $t_{m}$
1:: $h_{1} = tidy (t e_{i}, w b_{j})$
2:: if $h_{1} \leq 0$ then
3:: return $w b_{j}$ //Observing at the earliest visible time of latter tasks
4:: end if
5:: $h_{2} = tidy (t e_{i}, w e_{j})$
6:: if $h_{1} > 0$ then
7:: return $+ \infty$ //The attitude transition cannot be completed in the entire window
8:: end if
9:: for $j = 1$ to $N u m I t e r$ do
10:: $t_{m} = (h_{2} \cdot w b_{j} - h_{1} \cdot w e_{j}) / (h_{2} - h_{1})$
11:: $h_{m} = tidy (t e_{i}, t_{m})$
12:: if $|h_{m}| < p r c$ then
13:: return $t_{m}$
14:: end if
15:: if $h_{m} > 0$ then
16:: $w b_{j} = t_{m}$
17:: $h_{1} = h_{m}$
18:: else
19:: $w e_{j} = t_{m}$
20:: $h_{2} = h_{m}$
21:: end if
22:: end for
23:: return $w e_{j}$

3. Methodology

3.1. GDNN Decision-Making Process

As the solution construction process shows in Figure 4, we first update the features in the current state as input to the GDNN. The network interacts the input features and uses mask mechanism [29] to avoid infeasible tasks. The network outputs the probability of the tasks and selects the tasks. Then, the process will be repeated until the candidate task set is empty. Finally, we obtain the output solution.

3.2. Feature Engineering

Appropriate feature extraction is the foundation of network decisions. We describe the AEOSSP as the time-attitude adjacency graph in which node attributes and edge weights are equally important. Therefore, the features of the AEOSSP comprise ten node features and five edge features, as shown in Figure 5. The following parts describe the meaning of each feature. All features must be normalized to improve network generalization ability and avoid weakening or failure of the network decision-making effect caused by the difference in data distribution.

Node features can be divided into task, VTW, and status features. The profit

p r i_{i}

and the observation duration

c t_{i}

are task features proposed by the users. The former indicates the importance of

t a s k_{i}

, and the latter indicates the shortest time required for completing the task observation. The VTW features represent the VTW of the task, including the overhead time

t_{i}^{side}

, the overhead roll angle

θ_{i}^{side}

, the start time

w b_{i}

and end time

w e_{i}

of the VTW, the earliest start time

t_{i}^{m}

, and its corresponding roll angle

θ_{i}^{m}

of

t a s k_{i}

. The status features

l_{i}^{wait}

and

l_{i}^{last}

are updated after each decision.

l_{i}^{wait}

indicates whether

t a s k_{i}

is the candidate task in the current state. When

t a s k_{i}

is among the candidate tasks,

l_{i}^{wait} = 1

. Otherwise,

l_{i}^{wait} = 0

.

l_{i}^{last}

indicates whether

t a s k_{i}

is the last task in the current solution sequence. If

t a s k_{i}

is the last task,

l_{i}^{last} = 1

. Otherwise,

l_{i}^{last} = 0

. After normalization, we obtain the node features

{\overset{⇀}{v}}_{i} = ({\hat{p r i}}_{i}, {\hat{t}}_{i}^{side}, {\hat{θ}}_{i}^{side}, {\hat{c t}}_{i}, {\hat{w b}}_{i}, {\hat{w e}}_{i}, {\hat{t}}_{i}^{m}, {\hat{θ}}_{i}^{m}, l_{i}^{wait}, l_{i}^{last}) \in R^{10}

and the node feature vector

v

.

The matrix of edge features

E

indicates the edge features between two nodes, where

{\overset{⇀}{e}}_{i j} = ({\hat{d}}_{i j}, l_{i j}^{n 1}, l_{i j}^{n 5}, l_{i j}^{n 10}, l_{i j}^{n 20}) \in R^{5}

.

d_{i j}

indicates the distance between two nodes. In time-attitude coordinates, the distance between two nodes indicates the satellite attitude transition angle between two tasks, which can be calculated as Equation (13). In practice, the satellite attitude transition angle is primarily determined by roll and pitch angles. The pitch angle of the satellite is time-dependent and related to the length of the VTW. Therefore, we use the overhead time

t_{i}^{side}

to represent the pitch angle, thus linking time to the attitude transition angle, as shown in Equation (15).

d_{i j} = |t_{i}^{side} - t_{j}^{side}| * 2 φ_{max} / t w_{max} + |θ_{i}^{side} - θ_{j}^{side}|

(15)

where

t w_{max}

is the length of the longest VTW and

φ_{max}

is the maximum pitch angle of the satellite. In addition,

l_{i j}^{n 1}

,

l_{i j}^{n 5}

,

l_{i j}^{n 10}

and

l_{i j}^{n 20}

are features used to represent the relationship of two nodes. For

n o d e_{i}

, we sort all

d_{i j}

in ascending order. If

d_{i j}

is ranked within the first Kth, then

l_{i}^{nK} = 1

. Otherwise,

l_{i}^{nK} = 0

.

3.3. GNDD Structure

The graph attention network (GAT) is a graph neural network structure proposed by Veličković [30]. The network introduces an attention mechanism into the graph neural network structure and can weigh the relationships between graph nodes. By extracting problem features, the GAT can calculate the probability of the following actions based on the features of the current state.

In the proposed method, we design the GAT-based decision neural network (GDNN) for problem sequence decision-making. As shown in Figure 6, the GDNN consists of nine layers.

The first four layers are embedding layers, and each embedding layer is a single-layer GAT network using the attention mechanism to weigh the node and edge features. The following five layers are all fully connected layers, which are only responsible for updating the attributes of features. The fifth layer is the middle layer, which converts the network dimensions. The sixth to eighth layers are hidden layers whose dimensions remain the same. The last layer is the output layer and outputs a one-dimensional action probability. The feature update of the entire network is independent of the graph structure.

In the proposed method, the node feature is

{\overset{⇀}{v}}_{i} \in R^{10}

and the edge feature is

{\overset{⇀}{e}}_{i j} \in R^{5}

. To avoid complicating the network, the intermediate network structure has the same dimension, which is unified as

F_{3}

. The transfer process of the extracted features in the network layer is as follows, where l is the network layer identifier,

l \in [1, 9] \land l \in N^{+}

.

(1) Embedded layer and the transfer network (

l \in [1, 4]

).

The node feature vector

v

is transferred in the embedding layer network through Equations (16)–(18), where the LeakyReLU function is proposed in [31]. The condition shown in Equation (19) is satisfied, and the ReLU function [32] is used to activate between layers. The edge feature matrix

E

is transferred in the embedding layer network though the Equation (20), and the conditions shown in Equations (21) and (22) are satisfied.

{\overset{⇀}{z}}_{i j}^{(l)} = W_{v}^{(l)} [{\overset{⇀}{v}}_{i}^{(l)} ∥{\overset{⇀}{e}}_{i j}^{(l)} ∥{\overset{⇀}{v}}_{j}^{(l)}]

(16)

α_{i j}^{(l)} = \frac{exp (LeakyReLU ({\overset{⇀}{a}}^{(l)}^{T} {\overset{⇀}{z}}_{i j}^{(l)}))}{\sum_{k \in N_{i}} exp (LeakyReLU ({\overset{⇀}{z}}_{i k}^{(l)}))}, {\overset{⇀}{a}}^{(l)} \in R^{F_{3}}

(17)

{\overset{⇀}{v}}_{i}^{(l + 1)} = ReLU (\sum_{j \in N_{i}} α_{i j}^{(l)} {\overset{⇀}{z}}_{i j}^{(l)})

(18)

W_{v}^{(l)} \in \{\begin{matrix} \begin{matrix} R^{F_{3} \times (2 F_{1} + F_{2})}, & l = 1 \\ R^{F_{3} \times (2 F_{3})}, & l \in [2, 4] \end{matrix} \end{matrix}

(19)

{\overset{⇀}{e}}_{i j}^{(l + 1)} = ReLU (W_{E}^{(l)} {\overset{⇀}{v}}_{i}^{(l)} + W_{E}^{(l)} {\overset{⇀}{v}}_{j}^{(l)} + W_{EE}^{(l)} {\overset{⇀}{e}}_{i j}^{(l)})

(20)

W_{E}^{(l)} \in \{\begin{matrix} R^{F_{3} \times F_{1}}, l = 1 \\ R^{F_{3} \times F_{3}}, l \in [2, 4] \end{matrix}

(21)

W_{EE}^{(l)} \in \{\begin{matrix} R^{F_{3} \times F_{2}}, l = 1 \\ R^{F_{3} \times F_{3}}, l \in [2, 4] \end{matrix}

(22)

(2) Middle layer and hidden layer network (

l \in [5, 8]

).

The middle layer and hidden layer are all fully connected layers. The input and output dimensions are both

F_{3}

, and the feature transfer adopts the method shown in Equation (23).

{\overset{⇀}{v}}_{i}^{(l + 1)} = ReLU (W_{v}^{(l)} {\overset{⇀}{v}}_{i}^{(l)}), W_{v}^{(l)} \in R^{F_{3} \times F_{3}}

(23)

(3) Output layer network transfer (

l = 9

).

The output layer is also fully connected, and the output dimension is 1. The feature transmission uses the method shown in Equation (24).

{\overset{⇀}{v}}_{i}^{(l + 1)} = W_{v}^{(l)} {\overset{⇀}{v}}_{i}^{(l)}, W_{v}^{(l)} \in R^{1 \times F_{3}}

(24)

(4) Mask mechanism

The mask mechanism is introduced to avoid infeasible action choices. For nodes that violate the constraints, the output probabilities are controlled to zero. The mask label of

n o d e_{i}

is

m_{i}

. When choosing the next node, if

n o d e_{i}

violates the constraint,

m_{i} = 0

; otherwise,

m_{i} = 1

. If the probability of the final output

n o d e_{i}

is

v_{i}^{*}

, Equations (25)–(28) are used to realize the mask mechanism.

v_{i}^{*} = v_{i}^{*} + |min_{i} v_{i}^{*}|

(25)

v_{i}^{*} = v_{i}^{*} - max_{i} (v_{i}^{*} m_{i})

(26)

v_{i}^{*} = \frac{exp (v_{i}^{*} m_{i})}{\sum_{k} exp (v_{k}^{*} m_{k})}

(27)

v_{i}^{*} = \frac{v_{i}^{*} m_{i}}{\sum_{k} v_{k}^{*} m_{k}}

(28)

3.4. Training Method

The parameters of the GDNN must be obtained by learning from large batches of training data. In the proposed method, we apply the proximal policy optimization (PPO) proposed by [33] to train the GDNN.

The training framework of PPO follows actor–critic [34], which includes an actor network with the parameters

Θ_{Q}

and a critic network with the parameters

Θ_{_{V}}

. The pseudo-code is shown in Algorithm 2.

Algorithm 2 GDNN-PPO algorithm

Require:: Initialize the training parameter cropping factor $ϵ$ , the mean square error factor $c_{1}$ , the entropy factor $c_{2}$ , the batch size K, the parameter update step size $T p$ , the parameter update optimization times k, the number of training episodes N
Ensure:: The GDNN optimal network parameters $Θ$
1:: repeat
2:: Generate the instance $E m p = \{E, v, S^{sat}\}$ .
3:: while $s_{t}$ is not done do
4:: Choose $a_{t}$ by sampling according to $p_{Θ_{Q}} (a_{t} |s_{t})$ .
5:: Execute $a_{t}$ , gather $r_{t}$ and $p_{Θ_{Q}} (a_{t} |s_{t})$ , update $s_{t + 1} = (E_{t + 1}, v_{t + 1})$ .
6:: Store $\{s_{t}, a_{t}, r_{t}, p_{Θ_{Q}} (a_{t} |s_{t})\}$ in sampling pool.
7:: $t p = t p + 1$
8:: if $t p = T p$ then
9:: Update r
10:: repeat
11:: repeat
12:: $u_{t} (Θ) = \frac{{p_{Θ}}_{_{V}} (a_{t} |s_{t})}{{p_{Θ}}_{_{Q}} (a_{t} |s_{t})}$
13:: ${\hat{A}}_{t} = {\hat{r}}_{t} - V_{Θ} (a_{t} |s_{t})$
14:: $L_{t} (Θ) = {\hat{E}}_{t} [L_{t}^{C L I P} (Θ) + c_{1} L_{t}^{V F} (Θ) + c_{2} S [p_{Θ}] (s_{t})]$
15:: Update $Θ$ using $SGD$ .
16:: until All $⌊T p / K⌋$ batches trained
17:: until Update $Θ$ with k epochs
18:: $Θ_{V} = Θ_{Q}$
19:: Clear Sampling pool.
20:: $t p = 1$
21:: end if
22:: end while
23:: until All N instances end.
24:: return $Θ$

At each step

t_{p}

, we sample the action

a_{t}

based on the probability of the actor output

p_{Θ_{Q}} (a_{t} |s_{t})

and save the sample

\{s_{t}, a_{t}, r_{t}, p_{Θ_{Q}} (a_{t} |s_{t})\}

in the sampling pool. The parameters are updated when the number of samples

t_{p}

reaches the parameter update step size

T p

. We first update the reward

r_{t}

according to Equation (29) to represent an assessment of the expected reward of action

a_{t}

. The critic evaluates the value

V_{Θ} (a_{t} |s_{t})

of the actor and probability

{p_{Θ}}_{_{Q}} (a_{t} |s_{t})

. Then, the loss is calculated in Lines 12–14, and the actor parameters

Θ_{Q}

are updated in Line 15 according to stochastic gradient descent (SGD) [35]. The parameters

Θ_{V}

of the critic are updated by copying the updated parameters of the actor. Finally, the optimal model parameters are generated through iteration.

{\hat{r}}_{i} = \sum_{k = i}^{n} r_{k} = \sum_{k = i}^{n} γ * {p r i}_{j_{k}}

(29)

4. Experiments

4.1. Experimental Setting for AEOSSP

4.1.1. Datasets

The instance generation of the AEOSSP follows the characteristics of the satellite resources and orbits. The parameters of the instances are randomly generated according to a normal distribution, which can better increase the conflict between tasks. The parameter distributions of the instances are shown in Table 2, and the satellite capability parameters are shown in Table 3.

Based on the above rules, the experiment generates instances with tasks scaled 40, 60, 80, and 100. The instance with a task scale of 40 is shown in Figure 7, where the label means the task ID and profit. The figure shows that the task VTW distribution is relatively dense, and the conflicts of tasks are sufficiently large to effectively reflect the performance of the algorithms.

We propose three indicators to measure the performance of the algorithms: the average scheduling profit (ASP), the average scheduling time (AST), and the percentage of excess scheduling profit (PSP). The ASP measures the solution quality of the algorithm at different scales, the AST is used to reflect the timeliness of the algorithm, and the PSP indicates the difference in the ASP between the proposed method and other algorithms. All network training and experiments use NVIDIA TITAN RTX, i9-11900K CPU, and 64.0 GB memory. The algorithms are coded in Python, and the deep learning framework uses PyTorch 1.9.0.

4.1.2. Competitors

To verify the validity of the proposed method, we use some construction heuristics as baselines of the AEOSSP. Specifically, we design the following four heuristics: the start time of observational time window ascending (STWA), the profit of task descending (PTD), the ratio of profit and image time descending (RPID), and the conflict degree of task descending (CDTD). The construction heuristics solving framework is shown in Algorithm 3. Each heuristic sorts the candidate tasks by the construction rule and inserts the tasks that satisfy the constraints into the solution sequence in turn until there is no task to insert. In CDTD, the conflict degree

C d_{i}

means the number of VTW overlaps between

t a s k_{i}

and other tasks, defined as Equations (30) and (31). We constitute the GDNN-DQN by training the GDNN with the DQN.

C d_{i} = \sum_{k = 1}^{n_{t s k}} o_{i k}

(30)

o_{i j} = \{\begin{matrix} 1, if w b_{i} < w e_{j} \land w b_{j} < w e_{i} \\ 0, otherwise \end{matrix}

(31)

Algorithm 3 General framework of construction heuristics

Require:: Task sequence $T s k$ , sequence rule $p t$ , ascending sign $l_{rank}$
Ensure:: Scheduling solution $S l n$
1:: $S l n = \emptyset$
2:: $S l n_{temp} = \emptyset$
3:: if $1 = = l_{rank}$ then
4:: $T s k = RankAscendingBy (T s k, p t)$
5:: else
6:: $T s k = RankDescendingBy (T s k, p t)$
7:: end if
8:: for $i = 0$ to $|T s k|$ do
9:: $S l n_{temp} = S l n$
10:: $S l n_{temp} . InsertTask (T s k_{i})$
11:: $i s F e a s i b l e = ConstraintCheck (S l n_{temp})$
12:: if $i s F e a s i b l e = = TRUE$ then
13:: $S l n = S l n_{temp}$
14:: end if
15:: end for
16:: return $S l n$

In addition, to validate the efficiency of GDNN-PPO, some high-quality solution algorithms are required for comparison. As mentioned in Section 1, the existing exact methods simplify the time-dependent transition time, and the solution time of CPLEX is unacceptable when the number of VTWs exceeds 27; thus, we do not consider the exact algorithms. We compare the GDNN-PPO algorithm with three high-quality competitors: (1) GRILS [10], which is a state-of-the-art heuristic method for the AEOSSP; (2) self-adaptation differential evolution (SDE) [36], which is an algorithm that has been shown to be effective at solving the AEOSSP; and (3) the self-adaptation genetic algorithm (SGA) [14], which is an improved genetic algorithm designed to solve the AEOSSP. The decoding method of the SGA is the same as Algorithm 3. The crossover and mutation parameters are updated according to the proportion of entering the next generation. The parameters of the high-quality competitors are shown in Table 4.

4.2. Training Process Analysis

4.2.1. Training Parameters

The GDNN parameter settings are shown in Table 5.

4.2.2. GDNN Structure Optimization

A network with a structure that is too complex may affect the calculation speed and convergence, and a network with a structure that is too simple may be unable to characterize the problem well. To improve the efficiency of the GDNN, we analyze the GDNN structure.

Due to the limited availability of servers for large-scale training, we only considered three parameters: the embedding layer dimension

F_{3}

, the number of hidden layers

n_{h i d}

, and the number of embedding layers

n_{e m}

.

For the embedding layer dimension

F_{3}

and the number of hidden layers

n_{h i d}

, experiments are designed to test networks of

128 \times 4

,

64 \times 3

and

32 \times 2

(

F_{3} \times n_{h i d}

) corresponding to high, medium, and low dimensions. We train networks of different dimensions using instances with 100 tasks and training episodes of

10, 000

and test them with 50 instances with tasks scaled 40, 60, 80, and 100. Results are shown in Table 6. In addition, the number of embedding layers is tested with 3, 4, 5 and 6. The networks are trained by instances with a task scale of 40 and training episodes of 10,000. We test them with 50 instances with a task scale of 40. The test result is shown in Table 7.

From Table 6, at the current cost of training, the low-dimensional network is the fastest but least profitable. The solution time of the high-dimensional network is longer, but the improvement in solution quality is small compared to the medium-dimensional network. For the number of embedding layers, results in Table 7 show that the network with four embedding layers can obtain the best ASP. The networks with five and six embedding layers have longer scheduling times but no improvement in scheduling profits. Therefore, the network parameters are finally set to

F_{3} = 64

,

n_{h i d} = 3

and

n_{e m} = 4

in this study, and thus, the GDNN-64x3 network with four embedding layers is considered the best network for the AEOSSP.

4.2.3. GDNN Training

In the proposed method, we train GDNN with tasks scaled of 40, 60, 80, and 100 by the PPO. The training processes are shown in Figure 8. Results show that the average profit increases rapidly in the first 5000 episodes, indicating that the network parameters are continuously being updated. When 10,000 episodes are reached, the profit improvement begins to level off and remains stable. In general, after 50,000 training episodes, the network converges stably.

4.3. Performance Analysis

We generate instances with scales of 40, 60, 80, and 100 to test the efficiency of the proposed method. Instances with different scales are solved by the GDNN trained on the corresponding scales. For comparison, we divide the competitors into construction heuristics and high-quality competitors.

4.3.1. Comparison with Construction Heuristics

For test data, we generate 1000 instances with 40, 60, 80 and 100 tasks, and experimental results are shown in Table 8. GDNN-PPO outperforms all the competitors in all task-scale instances with the highest Wilcoxon p-value of 0. Specifically, the PSP of GDNN-DQN reaches 11.6%, and the PSPs of the other four algorithms are all over 45%. Therefore, GDNN-PPO can obtain high-quality solutions for solving the AEOSSP. Regarding AST, GDNN-PPO has the longest solution time in all instances. However, even for instances with 100 tasks, the AST of GDNN-PPO is within 2 s, which is acceptable for its high solution quality.

4.3.2. Comparison with High-Quality Competitors

Due to the randomness and high computational cost of meta-heuristics, we choose one instance with a task scaled of 40, 60, 80, and 100, and repeat it 30 times to obtain ASP, AST, and PSP. Experimental results are shown in Table 9.

GRILS is shown to obtain the best ASP in all scale instances, demonstrating its state-of-the-art capability to solve the AEOSSP. GDNN-PPO outperforms SDE and SGA in ASP by more than 23%, and its advantage becomes more apparent as the scale increases. Although GDNN-PPO does not perform as well as GRILS in ASP (4.2% lower on average), it decreases ASP by orders of magnitude. Also, unlike GDNN, which constructs the solution at one time, meta-heuristics must repeat the evaluation of the same solution, which is time-consuming. As shown in Figure 9, the ASTs of GRILS and SDE increase nonlinearly and dramatically with the scale of tasks. The AST of SGA increases marginally but is still much higher than that of GNDD-PPO. Therefore, GDNN-PPO has a time-sensitive advantage over high-quality competitors in solving large-scale AEOSSP.

4.3.3. Validation of Scalability

To validate the upwards and downwards scale-solving ability of the GDNN-PPO trained at a specific instance scale, we use GDNN-PPO-X, where X is the number of instance tasks used for training, to solve 50 test instances with different task scales and calculate the ASP. Results are shown in Table 10, and the gap between the ASP of GDNN-PPO-X and the optimal ASP is shown in Figure 10.

Because the conflict degree of VTWs increases with the task scale, the decision neural networks trained by instances with specific scales can learn different strategies. All GDNN-PPOs are shown to calculate a solution with a small difference from optimal solutions, highlighting the scalability of this method. Among these four models, GDNN-PPO-40 has better scalability and can achieve optimal performance in solving the AEOSSP with 40, 60 and 80 task scales.

5. Conclusions

This paper proposes a graph-based DRL method called GDNN-PPO to solve the AEOSSP with time-dependent attitude transition time. We model the AEOSSP with the time-attitude adjacency graph and reformulate the problem as the MDP. Then, we extract the features of the AEOSSP, including node and edge features, and design a GDNN to guide task choice. Finally, we train the GDNN by PPO and design experiments to verify the validity of the proposed method. Experimental results show that GDNN-PPO outperforms all construction heuristics in ASP by at least 45% and surpasses high-quality competitors except for the state-of-the-art algorithm GRILS with regard to ASP and AST. Regarding GRILS, the difference between the GDNN-PPO performance in all instances is less than 7%; However, the AST of GRILS is 345 times longer than that of GDNN-PPO on average. Thus, GDNN-PPO performs well at large scales and with rapid responses when solving the AEOSSP and has strong potential for future applications in large constellations and new management models.

Although GDNN-PPO demonstrates significant advantages in solution time, there is potential to enhance its scheduling priority. In future work, we plan to further improve the GDNN solving efficiency by optimizing the feature selection, structure design, etc. Additionally, we plan to combine the proposed method with other algorithms to solve more complex satellite scheduling problems, such as multi-satellite scheduling problems.

Author Contributions

Conceptualization, W.Y.; methodology, W.Y. and J.C.; software, W.Y. and J.C.; validation, J.C., G.W and L.H.; resources, L.H. and X.L.; writing—original draft preparation, J.C. and L.H.; writing—review and editing, X.L. and L.X.; supervision, G.W.; funding acquisition, L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (72001212), Young Elite Scientists Sponsorship Program by CAST (2022QNRC001) and Hunan Postgraduate Research Innovation Project (CX20210034).

Data Availability Statement

Not acceptable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wolfe, W.J.; Sorensen, S.E. Three scheduling algorithms applied to the earth observing systems domain. Manag. Sci. 2000, 46, 148–166. [Google Scholar] [CrossRef]
Lemaître, M.; Verfaillie, G.; Jouhaud, F.; Lachiver, J.M.; Bataille, N. Selecting and scheduling observations of agile satellites. Aerosp. Sci. Technol. 2002, 6, 367–381. [Google Scholar] [CrossRef]
Wang, P.; Tan, Y. A model, a heuristic and a decision support system to solve the scheduling problem of an earth observing satellite constellation. Comput. Ind. Eng. 2011, 61, 322–335. [Google Scholar] [CrossRef]
Chu, X.; Chen, Y.; Tan, Y. A Branch and Bound Algorithm for Agile Earth Observation Satellite Scheduling. Adv. Space Res. 2017, 2017, 1–15. [Google Scholar] [CrossRef]
Liu, X.; Laporte, G.; Chen, Y.; He, R. An adaptive large neighborhood search metaheuristic for agile satellite scheduling with time-dependent transition time. Comput. Oper. Res. 2017, 86, 41–53. [Google Scholar] [CrossRef]
Jiang, X.; Tian, Z.; Liu, W.; Suo, Y.; Chen, K.; Xu, X.; Li, Z. Energy-efficient scheduling of flexible job shops with complex processes: A case study for the aerospace industry complex components in China. J. Ind. Inf. Integr. 2022, 27, 100293. [Google Scholar] [CrossRef]
Tian, Z.; Jiang, X.; Liu, W.; Li, Z. Dynamic energy-efficient scheduling of multi-variety and small batch flexible job-shop: A case study for the aerospace industry. Comput. Ind. Eng. 2023, 178, 109111. [Google Scholar] [CrossRef]
Li, R.; Gong, W.; Lu, C.; Wang, L. A Learning-Based Memetic Algorithm for Energy-Efficient Flexible Job-Shop Scheduling with Type-2 Fuzzy Processing Time. IEEE Trans. Evol. Comput. 2023, 27, 610–620. [Google Scholar] [CrossRef]
Xu, R.; Chen, H.; Liang, X.; Wang, H. Priority-based constructive algorithms for scheduling agile earth observation satellites with total priority maximization. Expert Syst. Appl. 2016, 51, 195–206. [Google Scholar] [CrossRef]
Peng, G.; Song, G.; He, Y.; Yu, J.; Xiang, S.; Xing, L.; Vansteenwegen, P. Solving the agile earth observation satellite scheduling problem with time-dependent transition times. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 1614–1625. [Google Scholar] [CrossRef]
Lin, W.C.; Liao, D.Y.; Liu, C.Y.; Lee, Y.Y. Daily imaging scheduling of an earth observation satellite. IEEE Trans. Syst. Man Cybern.-Part A Syst. Humans 2005, 35, 213–223. [Google Scholar] [CrossRef]
Bianchessi, N.; Cordeau, J.F.; Desrosiers, J.; Laporte, G.; Raymond, V. A heuristic for the multi-satellite, multi-orbit and multi-user management of Earth observation satellites. Eur. J. Oper. Res. 2007, 177, 750–762. [Google Scholar] [CrossRef]
Li, G.; Chen, C.; Yao, F.; He, R.; Chen, Y. Hybrid differential evolution optimisation for earth observation satellite scheduling with time-dependent earliness-tardiness penalties. Math. Probl. Eng. 2017, 2017, 2490620. [Google Scholar] [CrossRef]
Li, Y.; Xu, M.; Wang, R. Scheduling Observations of Agile Satellites with Combined Genetic Algorithm. In Proceedings of the Third International Conference on Natural Computation (ICNC 2007), Haikou, China, 24–27 August 2007; Volume 3, pp. 29–33. [Google Scholar]
Xiang, R. Agile Satellite Mission Scheduling Technology Research; National University of Defense Technology: Changsha, China, 2010. [Google Scholar]
Tangpattanakul, P.; Jozefowiez, N.; Lopez, P. Biased random key genetic algorithm with hybrid decoding for multi-objective optimization. In Proceedings of the 2013 Federated Conference on Computer Science and Information Systems, Krakow, Poland, 8–11 September 2013; pp. 393–400. [Google Scholar]
Sun, K.; Xing, L.N.; Chen, Y.W. Agile earth observing satellites mission scheduling based on decomposition optimization algorithm. Comput. Integr. Manuf. Syst. 2013, 19, 127–136. [Google Scholar]
He, L.; Liu, X.; Laporte, G.; Chen, Y.; Chen, Y. An improved adaptive large neighborhood search algorithm for multiple agile satellites scheduling. Comput. Oper. Res. 2018, 100, 12–25. [Google Scholar] [CrossRef]
Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer Networks. Comput. Sci. 2015, 28. [Google Scholar]
Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural Combinatorial Optimization with Reinforcement Learning. In Proceedings of the 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Nazari, M.; Oroojlooy, A.; Snyder, L.V.; Takáč, M. Reinforcement Learning for Solving the Vehicle Routing Problem. In Advances in Neural Information Processing Systems 31; Curran Associates, Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Joshi, C.K.; Laurent, T.; Bresson, X. An Efficient Graph Convolutional Network Technique for the Travelling Salesman Problem. In Proceedings of the INFORMS Annual Meeting, Washington, DC, USA, 20–23 October 2019. [Google Scholar]
Zhou, X.; Wu, L.; Zhang, Y.; Chen, Z.S.; Jiang, S. A robust deep reinforcement learning approach to driverless taxi dispatching under uncertain demand. Inf. Sci. 2023, 646, 119401. [Google Scholar] [CrossRef]
Wang, D.; Hu, M.; Weir, J.D. Simultaneous task and energy planning using deep reinforcement learning. Inf. Sci. 2022, 607, 931–946. [Google Scholar] [CrossRef]
Chen, M.; Chen, Y.; Chen, Y.; Qi, W. Deep Reinforcement Learning for Agile Satellite Scheduling Problem. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019. [Google Scholar]
Zhao, X.; Wang, Z.; Zheng, G. Two Phase Neural Combinatorial Optimization with Reinforcement Learning for Agile Satellite Scheduling. J. Aerosp. Inf. Syst. 2020, 17, 346–357. [Google Scholar] [CrossRef]
Wei, L.; Chen, Y.; Chen, M.; Chen, Y. Deep reinforcement learning and parameter transfer based approach for the multi-objective agile earth observation satellite scheduling problem. Appl. Soft Comput. 2021, 110, 107607. [Google Scholar] [CrossRef]
Pralet, C.; Verfaillie, G. Time-dependent simple temporal networks. In Principles and Practice of Constraint Programming, Proceedings of the 18th International Conference, Quebec City, QC, Canada, 8–12 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 608–623. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical Evaluation of Rectified Activations in Convolutional Network. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair. In Proceedings of the International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. [Google Scholar]
Wu, W.; Sun, D.; Jin, K.; Sun, Y.; Si, P. Proximal policy optimization-based committee selection algorithm in blockchain-enabled mobile edge computing systems. China Commun. 2022, 19, 50–65. [Google Scholar] [CrossRef]
Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Jentzen, A.; Kuckuck, B.; Neufeld, A.; von Wurstemberger, P. Strong error analysis for stochastic gradient descent optimization algorithms. IMA J. Numer. Anal. 2020, 41, 455–492. [Google Scholar] [CrossRef]
Yang, W.; Chen, Y.; He, R.; Chang, Z.; Chen, Y. The bi-objective active-scan agile earth observation satellite scheduling problem: Modeling and solution approach. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar]

Figure 1. Time-attitude adjacency graph of AEOSSP. The number indicates the task number, the black arrow indicates the task that can be selected in the current state, and the red arrow indicates the task that is actually selected.

Figure 2. Solution construction process. The numbers in the green squares represent the solution.

Figure 3. Three situations of EarliestImageCal. (a) shows that task OW is satisfied with the constraint, (b) demonstrates that task OW is not satisfied with the constraint, and (c) represents the situation where the linear approximation method can be used.

Figure 4. Solution construction process for the AEOSSP.

Figure 5. Problem feature extraction.

Figure 6. GDNN structure for the AEOSSP.

Figure 7. Instance with a task scale of 40.

Figure 8. GDNN training process.

Figure 9. AST of GDNN-PPO and high-quality competitors.

Figure 10. Gap between the solutions of GDNN-PPOs and the optimal solution.

Table 1. Parameter description.

Parameter	Description
$n_{t s k}$	The number of tasks
$i, j$	The index of tasks, $i, j = 0, 1, . . ., n_{t s k}$ . 0 means a virtual task
$p r i_{i}$	The profit of $t a s k_{i}$ , $p r i_{0} = 0$
$θ_{i, t}$	The roll angle of the satellite to $t a s k_{i}$ at time t
$φ_{i, t}$	The pitch angle of the satellite to $t a s k_{i}$ at time t
$ψ_{i, t}$	The yaw angle of the satellite to $t a s k_{i}$ at time t
$w b_{i}$	The VTW start time for $t a s k_{i}$ , $w b_{0} = 0$
$w e_{i}$	The VTW end time for $t a s k_{i}$ , $w e_{0} = 0$
$t b_{i}$	The OW start time of $t a s k_{i}$ , $t b_{0} = 0$ is the initial state time
$t e_{i}$	The OW end time of $t a s k_{i}$ , $t e_{0} = 0$ is the initial state time
$c t_{i}$	The observation duration of $t a s k_{i}$ , $c t_{0} = 0$
$b a_{i}$	The satellite attitude at $t b_{i}$ , which is determined by the angle of roll $θ_{i, t b_{i}}$ , pitch $φ_{i, t b_{i}}$ and yaw $ψ_{i, t b_{i}}$ . $b a_{0}$ is the initial attitude of the satellite
$e a_{i}$	The satellite attitude at $t e_{i}$ , which is determined by the angle of roll $θ_{i, t e_{i}}$ , pitch $φ_{i, t e_{i}}$ and yaw $ψ_{i, t e_{i}}$ . $e a_{0}$ is the initial attitude of the satellite
$ρ_{i j}$	The attitude transition angle between $t a s k_{i}$ and $t a s k_{j}$
$trans (e a_{i}, b a_{j})$	The attitude transition time between $t a s k_{i}$ and $t a s k_{j}$
$x_{i j}$	Binary decision variable, indicating whether $t a s k_{i}$ is a former task for $t a s k_{j}$

Table 2. Parameter distribution of the instances.

Parameter	Distribution	Details
$θ^{side}$	$U (- θ_{max}, θ_{max})$	$θ_{max} = 45^{\circ}$
$χ$ ¹	$U (λ n_{t s k} + t w_{max} / 2, T_{p l a n} - λ n_{t s k} - t w_{max} / 2)$	$t w_{max} = 300$ s, $T_{p l a n} = 5400$ s
$t^{side}$	$U (χ - λ n_{t s k}, χ + λ n_{t s k})$	$λ = 12$
$c t$	$U (c t_{min}, c t_{max})$	$c t_{min} = 5$ s, $c t_{max} = 20$ s
$p r i$	$U (1, p r i_{max})$	$p r i_{max} = 10$
$l_{t w}$ ²	$U (t w_{max} / 4, t w_{max} / 2)$	$t w_{max} = 300$ s

¹ The intermediate time of the schedule period. ² The length of the VTW. The above parameters are generated as integers.

Table 3. Satellite capability parameters.

Parameter	Description	Value
$θ_{max}$	Maximum roll angle	$45^{\circ}$
$φ_{max}$	Maximum pitch angle	$45^{\circ}$
$ψ_{max}$	Maximum yaw angle	$90^{\circ}$
$E g y$	Initial power	5000 unit
$ζ$	Minimum power	$0.05$ unit
$θ_{0}$	Initial roll angle	$0^{\circ}$
$φ_{0}$	Initial pitch angle	$0^{\circ}$
$ψ_{0}$	Initial yaw angle	$0^{\circ}$
$t_{0}$	Initial state time	0 s
$u t e$	Power consumption for attitude transition	2 unit/s
$u i e$	Power consumption for observation	2 unit/s

Table 4. Parameters setting of high-quality competitors.

Method	Parameter
GRILS	Maximum iteration number with no improvement: 300
	StartGreed = 1, GreedRange = 0.2
	GreedDecrease = 0.02
	Place of start removal $S_{d} = 1$
	Number of observations removed each time $R_{d} = 1$
SDE	Population size $P o p = 50$
	Iteration number $N_{iter} = 200$
	Initial crossover probability $C r = 0.5$
	Initial mutation probability $M p = 0.2$
	Encoding method: real-valued encoding
	Offspring selection: championship
	Elite solution retention ratio: 0.1
SGA	Population size $P o p = 50$
	Iteration number $N_{iter} = 200$
	Initial crossover probability $C r = 0.8$
	Initial mutation probability $M p = 0.2$
	Encoding method: integer encoding
	Offspring selection: roulette
	Elite solution retention ratio: 0.1

Table 5. Main parameters of the GDNN training.

Parameter	Description	Value
$ϵ$	Trim function parameter	0.1
$c_{1}$	Mean square error factor	0.5
$c_{2}$	Entropy factor	0.001
K	Batch size	32
$T_{p}$	Parameter update step	1024
k	Parameter update optimization times	3
N	Training episode generation	50,000
$l r$	Learning rate	0.0005

Table 6. Experimental results of different dimensional networks in instances with different scales.

Instance Scale	$32 \times 2$		$64 \times 3$		$128 \times 4$
Instance Scale	ASP	AST (S)	ASP	AST (S)	ASP	AST(S)
40	176.4	0.240	183.7	0.263	183.5	0.309
60	263.9	0.443	266.3	0.488	269.8	0.637
80	349.7	0.757	356.6	0.868	354.0	1.180
100	433.2	1.244	439.4	1.466	437.6	1.843

Bold indicates the best ASP in all algorithms.

Table 7. Experimental results of different numbers of embedding layers in instances with a task scale of 40.

Embedding Layers	3	4	5	6
ASP	186.3	188.6	188.3	186.2
AST (S)	0.192	0.228	0.269	0.301

Bold indicates the best ASP and AST in all algorithms.

Table 8. Experimental results of construction heuristics in different instances.

Instance Scale		40.0	60.0	80.0	100.0
GDNN-PPO	ASP	190.3	277.4	364.0	451.7
	AST (S)	0.283	0.581	1.175	1.671
	PSP (%)	0.00	0.00	0.00	0.00
GDNN-DQN	ASP	180.1	257.3	328.1	370.2
	AST (S)	0.206	0.392	0.659	1.134
	PSP (%)	5.61	7.82	10.94	22.01
STWA	ASP	115.4	172.1	227.9	284.5
	AST (S)	0.004	0.006	0.008	0.010
	PSP (%)	64.82	61.24	59.71	58.75
PTD	ASP	125.6	187.8	249.1	311.1
	AST (S)	0.007	0.012	0.020	0.027
	PSP (%)	51.49	47.74	46.16	45.21
RPID	ASP	121.2	181.6	240.7	301.0
	AST (S)	0.007	0.013	0.020	0.028
	PSP (%)	57.02	52.75	51.22	50.07
CDTD	ASP	99.6	149.7	197.2	246.6
	AST (S)	0.007	0.013	0.020	0.028
	PSP (%)	90.93	85.36	84.63	83.17

Bold indicates the best ASP in all algorithms.

Table 9. Experimental results of meta-heuristics in different instances.

Instance Scale		40	60	80	100
GAT-PPO	ASP	197.0	272.0	380.0	454.0
	AST (S)	0.270	0.568	1.162	1.676
	PSP (%)	0.00	0.00	0.00	0.00
GRILS	ASP	201.5	281.4	397.8	487.4
	AST (S)	73.452	224.986	370.964	600.308
	PSP (%)	2.23	3.33	4.47	6.86
SDE	ASP	159.0	215.7	294.0	348.8
	AST (S)	58.760	162.508	342.463	618.514
	PSP (%)	23.90	26.10	29.27	30.17
SGA	ASP	154.5	194.6	294.1	336.1
	AST (S)	24.170	32.661	45.607	57.583
	PSP (%)	27.54	39.80	29.19	35.09

Bold indicates the best ASP in all algorithms.

Table 10. Experimental results of scalability in different instances.

Task Scale	40	60	80	100
GDNN-PPO-40	190.62	278.62	367.82	448.24
GDNN-PPO-60	188.90	277.58	366.74	449.14
GDNN-PPO-60	188.36	277.42	365.74	449.70
GDNN-PPO-100	186.96	277.42	366.00	447.02

Bold indicates the best ASP in all algorithms.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chun, J.; Yang, W.; Liu, X.; Wu, G.; He, L.; Xing, L. Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem. Mathematics 2023, 11, 4059. https://doi.org/10.3390/math11194059

AMA Style

Chun J, Yang W, Liu X, Wu G, He L, Xing L. Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem. Mathematics. 2023; 11(19):4059. https://doi.org/10.3390/math11194059

Chicago/Turabian Style

Chun, Jie, Wenyuan Yang, Xiaolu Liu, Guohua Wu, Lei He, and Lining Xing. 2023. "Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem" Mathematics 11, no. 19: 4059. https://doi.org/10.3390/math11194059

APA Style

Chun, J., Yang, W., Liu, X., Wu, G., He, L., & Xing, L. (2023). Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem. Mathematics, 11(19), 4059. https://doi.org/10.3390/math11194059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning for the Agile Earth Observation Satellite Scheduling Problem

Abstract

1. Introduction

2. Problem Description

2.1. Parameter

2.2. Mathematical Formulation

2.3. Time-Attitude Adjacency Graph of the AEOSSP

2.4. Reformulation of AEOSSP

2.5. Attitude Transition Time Constraint

3. Methodology

3.1. GDNN Decision-Making Process

3.2. Feature Engineering

3.3. GNDD Structure

3.4. Training Method

4. Experiments

4.1. Experimental Setting for AEOSSP

4.1.1. Datasets

4.1.2. Competitors

4.2. Training Process Analysis

4.2.1. Training Parameters

4.2.2. GDNN Structure Optimization

4.2.3. GDNN Training

4.3. Performance Analysis

4.3.1. Comparison with Construction Heuristics

4.3.2. Comparison with High-Quality Competitors

4.3.3. Validation of Scalability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI