Next Article in Journal
Structural and Topological Optimization of a Novel Elephant Trunk Mechanism for Morphing Wing Applications
Previous Article in Journal
Numerical Analysis of Ejector Flow Performance for High-Altitude Simulation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Actor–Critic-Based Hyper-Heuristic Autonomous Task Planning Algorithm for Supporting Spacecraft Adaptive Space Scientific Exploration

1
Key Laboratory of Electronics and Information Technology for Space System, National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Aerospace 2025, 12(5), 379; https://doi.org/10.3390/aerospace12050379
Submission received: 13 March 2025 / Revised: 11 April 2025 / Accepted: 23 April 2025 / Published: 28 April 2025
(This article belongs to the Special Issue Intelligent Perception, Decision and Autonomous Control in Aerospace)

Abstract

:
Traditional spacecraft task planning has relied on ground control centers issuing commands through ground-to-space communication systems; however, as the number of deep space exploration missions grows, the problem of ground-to-space communication delays has become significant, affecting the effectiveness of real-time command and control and increasing the risk of missed opportunities for scientific discovery. Adaptive Space Scientific Exploration requires that spacecraft have the ability to make autonomous decisions to complete known and unknown scientific exploration missions without ground control. Based on this requirement, this paper proposes an actor–critic-based hyper-heuristic autonomous mission planning algorithm, which is used for mission planning and execution at different levels to support spacecraft Adaptive Space Scientific Exploration in deep space environments. At the bottom level of the hyper-heuristic algorithm, this paper uses the particle swarm optimization algorithm, grey wolf optimization algorithm, differential evolution algorithm, and positive cosine optimization algorithm as the basic operators. At the high level, a reinforcement learning strategy based on the actor–critic model is used, combined with the network architecture, to construct a framework for the selection of advanced heuristic algorithms. The related experimental results show that the algorithm can meet the requirements of Adaptive Space Scientific Exploration, and exhibits a quality solution with higher comprehensive evaluation in the test. This study also designs an example application of the algorithm to a space engineering mission based on a collaborative sky and earth control system to demonstrate the usability of the algorithm. This study provides an autonomous mission planning method for spacecraft in the complex and ever-changing deep space environment, which supports the further construction of spacecraft autonomous capabilities and is of great significance for improving the exploration efficiency of deep space exploration missions.

1. Introduction

Traditional spacecraft payload operation modes predominantly rely on ground control centers to issue commands through Earth–space communication systems, guiding spacecraft in task planning, orbital adjustments, data collection, and processing. This approach is highly efficient in Earth–Moon systems or near-Earth orbital missions, where communication delays are minimal, allowing ground control centers to monitor spacecraft status almost in real time and adjust mission plans accordingly. However, as space exploration progressively delves into the unknown realms of deep space, accompanied by an increasing number of deep space exploration projects, this mode of operation faces significant challenges. The extended distances of Earth–space communication result in considerable communication delays, adversely affecting real-time command control and increasing the risk of missing scientific discovery opportunities. Moreover, the uncertainty of deep space exploration environments, coupled with the often unknown nature of exploration targets, demands a higher level of spacecraft autonomy.
Adaptive Space Scientific Exploration (ASSE) refers to the capability of spacecraft to make autonomous decisions based on their limited capabilities, resources, and knowledge, without reliance on ground control centers, to fulfill both known and unknown scientific exploration tasks. This requires spacecraft to determine the necessary tasks and objectives based on telemetry data, health status, operational parameters, and the real-time conditions of deep space, ensuring high operational precision, robust performance, strong adaptability to the environment, and longevity. Implementing ASSE necessitates the development of a system architecture platform that supports intelligent capabilities for spacecraft, along with leveraging a variety of intelligent technologies to enhance autonomy.
Viewed through the lens of the integrated electronic systems’ capabilities within spacecraft, the ASSE paradigm mandates the provision of several pivotal capabilities for spacecraft. These include the ability to discover and identify both unseen and known scientific targets, an autonomous mission planning capability for executing requisite tasks or instructions, an autonomous task execution management capability, and a comprehensive system of self-management and monitoring capability. Notably, the spacecraft’s autonomous mission planning capability—rooted in the outcomes of target identification and significantly impacting the operational state of the spacecraft subsequent to identification—emerges as a seminal research topic within the ambit of spacecraft’s autonomous operation and control. Deploying such planning methodologies directly on spacecraft, as opposed to terrestrial control systems, necessitates heightened demands on the systems’ and algorithms’ flexibility, robustness, and reliability.
Within the domain of autonomous operation and control of spacecraft, Daniel D. Dvorak and his colleagues have pioneered a revolutionary operational paradigm termed goal-driven operation [1]. This paradigm marks a fundamental shift in operational underpinnings from executing a sequenced set of commands to a declarative specification of operational intents, thereby facilitating guidance via well-defined objectives. This method significantly augments operational robustness amidst uncertainties and amplifies the system’s autonomous decision-making capabilities through the lucid articulation of operational intents.
In this study, the essence of spacecraft task planning is fundamentally oriented toward the planning of “objectives”. During autonomous operation, spacecraft are required to complete numerous pending tasks, among which there are variations in priority, resource consumption, and scientific detection needs. The sequence in which tasks are executed significantly impacts the spacecraft’s operational status, which in turn directly affects the efficiency, quality, and stability of scientific detection. Therefore, the development of a planning algorithm that is both adaptable to the deep space environment and capable of effectively addressing these challenges is particularly crucial.
Confronted with the dynamic nature and unpredictability of space exploration, the limitations of existing planning methods are increasingly evident. These traditional approaches are largely based on the predictability of a pre-set environment, utilizing pre-programmed instructions and models. Researchers choose appropriate heuristic methods to optimize task sequences based on controlled environments. However, the effectiveness of these methods relies on accurate environmental predictions, making them ill-suited for unknown or changing objectives. The complexity of deep space exploration demands that planning algorithms possess a high degree of flexibility and adaptability to address unforeseen challenges. In short, a singular algorithm cannot guarantee superiority across all environments and instances in deep space due to a lack of necessary adaptive mechanisms. This discrepancy can lead to difficulties in estimating convergence speed and optimization efficiency, as well as adapting to the unknown environments of deep space, missing opportunities for scientific discovery, and sometimes even jeopardizing the success of missions. This situation underscores the urgent need to move beyond traditional methods and adopt more advanced planning strategies.
Hyper-heuristic algorithms offer a novel approach to tackling such issues by operating at a higher level of abstraction. This method manages and manipulates a series of low-level heuristics (LLH) to forge new heuristic solutions, applicable to a wide array of combinatorial optimization problems [2]. The process encompasses two principal layers: initially, at the problem’s lower level, algorithms build mathematical models grounded in the problem’s representation and characteristics, designing specific solutions within a predetermined meta-heuristic framework. At a higher heuristic layer, an “intelligent computation expert” role is established, employing an efficient management and manipulation mechanism to derive new heuristic solutions from the lower level’s algorithm library and characteristic information [3]. This design enables the intelligent computation expert to autonomously select the most fitting heuristic algorithm based on current environmental data, allowing, in theory, for adaptability to various environmental conditions if the lower-level algorithms are judiciously chosen.
Traditional hyper-heuristic algorithm research has primarily focused on methods based on Simple Random [4], Choice Function [5], Modified Choice Function [6], Tabu Search [7], Ant Colony [8], and reinforcement learning [9,10,11]. Recently, integrating hyper-heuristic algorithms based on reinforcement learning with neural network technologies has emerged as a growing research area. This new direction aims to develop more accurate and reliable “intelligent computation experts” through deep reinforcement learning methods [12], capitalizing on neural networks’ non-linearity, adaptability, robustness, and parallel information processing capabilities. Deep reinforcement learning utilizes the perceptual power of deep learning to comprehend the environment, combined with the decision-making mechanisms of reinforcement learning, to identify optimal behavioral strategies across various settings [13]. By amalgamating deep reinforcement learning methods, optimized neural network architectures, and diverse heuristic algorithms, this approach not only merges the strengths of multiple technologies but also crafts algorithms specifically for deep-space mission planning, thereby enhancing spacecraft’s planning capabilities in the face of complex and changing environments.
Based on this, the main contributions of this paper are as follows:
  • Relying on the philosophy of building autonomous capabilities onboard spacecraft and the “goal-driven” methodology, this study introduces a spacecraft task planning framework designed to meet the requirements of adaptive scientific exploration and conducts mathematical modeling of the planning issue.
  • Based on a mathematical model, we designed an Actor–Critic-based Hyper-heuristic Autonomous Task Planning Algorithm (AC-HATP) to support spacecraft Adaptive Space Scientific Exploration.
  • At the lower tier of hyper-heuristic algorithms, by considering three aspects, namely global search capabilities, quality of solution optimization, and speed of convergence, we selected and designed suitable heuristic algorithms, establishing an algorithmic library.
  • At the higher tier of hyper-heuristic algorithms, we employed a reinforcement learning strategy based on the actor–critic model, in conjunction with the network architecture, to construct an advanced heuristic algorithm selection framework.
  • Through designed experiments, our research validated that the algorithm meets the needs of adaptive scientific exploration. Compared with other algorithm types, it was demonstrated that our approach achieves faster convergence speeds and superior solution quality in addressing deep space exploration challenges.

2. Related Work

The problem of autonomous mission planning for spacecraft can be defined as an optimization issue of how to efficiently allocate a series of tasks to satellites within the constraints of limited satellite maneuverability and fixed time windows, aiming to maximize payload utilization efficiency and optimize target information collection [14]. This represents a variant of common planning optimization problems. Planning typically involves specifying a problem’s initial and target states, along with a description of actions, requiring the automatic discovery of a sequence of actions that allows the system to transition from the initial state to the target state [15]. In this paper, we categorize autonomous spacecraft mission planning methods into three main groups for discussion: traditional methods (including planning based on predicate logic, network graphs, and timelines), heuristic algorithm methods, and reinforcement learning methods.

2.1. Traditional Methods of Autonomous Task Planning

The earliest methods for addressing planning problems relied on strict linguistic and logical structures for problem description and resolution. At this stage, various planning representation methods existed, such as first-order logic [16] and situation calculus [17], which served as second-order logic languages depicting the dynamics of the world. Subsequently, researchers like Nillson introduced the STRIPS planning description methodology, marking the preliminary formation of methodologies within the planning domain [18]. Building upon the foundation of STRIPS, the planning research domain gradually developed a mature description language known as PDDL. Then, Mc Dermott [19] formally proposed the Planning Domain Definition Language (PDDL), which has since undergone continuous refinement and development, resulting in multiple versions including PDDL 2.1 [20], PDDL 2.2 [21], and PDDL 3.0 [22], and even the advent of the PDDL+ version [23]. Owing to its outstanding features, PDDL has been widely applied in spacecraft mission planning, particularly in autonomous mission planning on satellites, for task description and modeling. For instance, researchers like ZHU Liying, in the domain of autonomous flight intelligent planning for small body exploration, through designing knowledge models based on PDDL, mathematical models based on CSP, and solving algorithms based on genetic strategies, effectively achieved efficient task management and simplified operational processes [24]. Researchers like Emma, by integrating task planning with execution monitoring, have enhanced the autonomous operational capabilities of space robots, especially through enhancing the robots’ intelligent task processing with PDDL [25]. Li Xuan [26] used PDDL to model and validate inter-satellite transmission mission planning in a collaborative satellite network where microwave and laser links coexist. Researchers such as Ma Manhao have utilized PDDL to focus on the constraints between observation tasks, modeling from the aspects of constraints, activities, and planning objectives, thus constructing an applied mission planning model for Earth Observation Satellites (EOSs) [27]. Researchers like Chen [28], based on a deep analysis of the characteristics of imaging satellite mission planning issues, used PDDL to address duration constraints, complex resource constraints, and special external resource constraints, establishing a mission planning model for urban and rural satellites. Xue [29] established an autonomous mission planning model for satellites in emergency situations, achieved the definition of constraints based on PDDL, and constructed a model that comprehensively considers the constraints of satellite platforms and payloads.
In the practical planning process, especially when facing time-sensitive planning issues, relying solely on the descriptive method of predicate propositional logic can be challenging to adequately describe factors such as time constraints. Conversely, the timeline model, with its straightforward and intuitive feature of time constraints, becomes an ideal choice for satellite mission planning with lower concurrent demands and more singular tasks. For example, researchers like Xu [30], in the study of autonomous mission planning for deep space probes, have adopted representation methods based on states and state timelines to describe the tasks of the probes and their constraint relationships. Researchers like Wang [31], through an object-oriented formal description method, categorized domain knowledge, including four models such as the timeline model, and simplified the method of establishing constraints. NASA’s ASPEN system [32], with its iterative repair philosophy, has developed a variety of reasoning mechanisms and developed a new type of mission planning method centered on the state timeline. The European Space Agency, based on the ASPI system, models the timeline pattern of scientific tasks, considering defects as the core object, and drives the planning process through collecting, selecting, and solving defects [33].
Additionally, various methodologies have been adopted for solving spacecraft mission planning challenges. The SPLKE system in the United States, designed to cater to the servicing needs of the Hubble Space Telescope, employs an algorithm based on the Constraint Satisfaction Problem (CSP) for task planning [34]. Jiang [35] have devised a task planning strategy based on constraint grouping. This method, which places a premium on action constraints, circumvents the issue of diminished constraint capacity with the expansion of problem size. Du [36] has utilized colored Petri nets for system modeling, categorizing the model into a top-level model, control model, target imaging mission planning model, and image transmission mission planning model, thereby applying this planning approach to imaging satellites. Liang [37] have designed an autonomous mission planning method based on a priority approach. This method, which leverages timeline technology and takes into consideration task priorities, facilitates the effective planning of task sequences. Bucchioni [38] proposed an innovative rendezvous strategy in cis-lunar space, combining passive and active collision avoidance to ensure safety during the approach to the Moon’s L2 point, filling a gap in the literature on autonomous guidance systems in the presence of third-body influences and significantly advancing the field of autonomous mission planning.

2.2. Task Planning Based on Heuristic and Metaheuristic Algorithms

Heuristic algorithms are strategies that rely on experience and intuition to find solutions, particularly suitable for scenarios where precise solutions cannot be obtained within a reasonable time frame. Although heuristic algorithms do not guarantee the optimal solution, they often provide a satisfactory solution within an acceptable timeframe. For example, NASA’s DS-1 spacecraft utilized a planning-space-based heuristic algorithm for mission planning. This method demonstrates excellent scalability and partial orderliness in outcomes, thereby enhancing the flexibility of execution planning [39]. Xue [29] employed Relaxation-based Graph Planning (RGP), an Enhanced Hill-Climbing Method, and Greedy Best-First Search (GBFS) to segment the satellite mission planning into sequence planning and time scheduling. Chang et al. [40] addressed the challenges in planning for optical video satellites with variable imaging durations, proposing a Simple Heuristic Greedy Algorithm (SHGA) to enhance its performance. Zhao et al. [41] explored the scheduling issues of satellite observation missions, implementing a task clustering planning algorithm to improve the observational efficiency of agile satellites, using Tabu algorithms to generate local and global observation paths within the clustered regions. Jin et al. [42] introduced a heuristic estimation strategy and search algorithm to enhance planning efficiency on spacecraft, with experimental results showing superior performance compared to Europa2. The optimal design of an active space junk removal mission similar to the time-dependent orientation problem was solved with the A* algorithm by Federici [43].
Meta Heuristic Algorithms (MHAs) represent a sophisticated optimization strategy, aimed at guiding and controlling heuristic search processes to identify the most optimal solutions possible within a solution space. The primary advantage of these algorithms is their independence from specific domain knowledge, which endows them with significant versatility, allowing their application across a wide range of optimization challenges. Common metaheuristic algorithms include genetic algorithms, simulated annealing, and particle swarm optimization. Notably, these algorithms have been extensively explored for the autonomous task planning of spacecraft. For instance, Long [44] developed an autonomous management and collaboration architecture for multi-agent systems tailored to the complexity and variability of managing multi-satellite systems. They introduced a hybrid genetic algorithm with simulated annealing (H-GASA) to address autonomous mission planning challenges in multi-satellite cooperation. Xiao et al. [45] investigated a hybrid optimization algorithm that integrates tabu search and an enhanced ant colony optimization algorithm, designed to tackle the maintenance task planning of large-scale space solar power stations. Wang [46] considered time and resource constraints, proposed the concept of dynamic resources and devised an individual coding rule based on fixed-length integer sequence coding to reduce the search space. They introduced a genetic algorithm that combines multi-mode crossover mutation, and designed a replanning algorithm framework based on rolling horizon replanning. Zhao and Chen [47], in the context of Earth observation satellite design, incorporated two generations of competitive technologies and an optimal retention strategy to address local multi-conflict observation tasks with an improved genetic algorithm. Feng et al. [48] designed a payload mission planning algorithm based on genetic algorithms capable of generating a complete command sequence according to tasks and directives, thereby implementing an autonomous operation system architecture for spacecraft based on multi-agent systems.

2.3. Task Planning Based on Reinforcement Learning

Reinforcement learning (RL) is an algorithm that learns optimal behavioral strategies through a system of rewards and punishments. In the context of spacecraft task planning, RL algorithms refine actions through a process of defining functions and actions, utilizing feedback on the effects of these actions on the final outcome to achieve an optimal solution. Despite the inherent conflict between the exploratory nature of RL and the high reliability requirements of spacecraft, which has limited its application in the aerospace field, ongoing research in this area has led to the exploration of this artificial intelligence technique in spacecraft task planning and decision-making processes. Harris et al. [49] have applied deep reinforcement learning (DRL) to spacecraft decision-making challenges, addressing issues of problem modeling, dimensionality reduction, simplification using expert knowledge, sensitivity to hyperparameters, and robustness, and ensured safety by integrating appropriately designed control techniques. Hu et al. [9] proposed an end-to-end DRL-based step planner named SP-ResNet for global path planning of planetary rovers, employing a dual-branch residual network for action value estimation, validated on the real lunar terrain of the CE2TMap2015 dataset. Huang et al. [50] explored the scheduling of Earth observation satellite missions, adopting a deep deterministic policy gradient algorithm to address the problem of continuous-time satellite mission scheduling, with experimental results indicating superiority over traditional meta-heuristic optimization algorithms. Wei et al. [51] introduced a method based on deep reinforcement learning and parameter transfer (RLPT) for iteratively solving the Multi-Objective Agile Earth Observing Satellite Scheduling Problem (MO-AEOSSP), surpassing three classical multi-objective evolutionary algorithms (MOEAs) in terms of solution quality, distribution, and computational efficiency, demonstrating high universality and scalability. Zhao et al. [52] proposed a dual-phase neural combinatorial optimization method based on reinforcement learning for the scheduling of agile Earth observing satellites (AEOSs). Eddy and Kochenderfer [53] presented a semi-Markov decision process (SMDP) formulation for satellite mission scheduling that considers multiple operational goals and plans transitions between different functional modes. This method performed comparably to baseline methods in single-objective scenarios with faster speed and achieved higher scheduling rewards in multi-objective scenarios.

2.4. Summary of Related Work

The research for the related work can be summarized as shown in Table 1.
In the complex and uncertain environment of deep space, traditional rule-based methods are limited in flexibility, making it challenging to meet the requirements for problem resolution. Heuristic and meta-heuristic algorithms, constrained by their generalization capabilities and environmental adaptability, require integration with a high-level architecture of hyper-heuristic algorithms, and the construction of a heuristic algorithm library at the lower level to cover as diverse a range of potential environmental states as possible. Regarding reinforcement learning methods, given the unique environmental constraints in deep space, using online reinforcement learning to train models is impractical. For offline reinforcement learning methods, the significant increase in environmental uncertainty may result in trained models that fail to meet practical needs, presenting challenges similar to those faced by meta-heuristic algorithms. Therefore, considering the autonomous learning capabilities of reinforcement learning, although it cannot be directly applied to spacecraft onboard task planning in deep space, it can serve as an upper-layer “expert system” within a hyper-heuristic algorithm framework, responsible for selecting appropriate algorithms. By designing a sufficiently rich algorithm library, and utilizing reinforcement learning for algorithm selection, the system can adapt to a broad and variable environment, thus enhancing the model’s adaptability. Moreover, the limited action space of this type of reinforcement learning significantly reduces the difficulty of training the model.
Consequently, this study proposes a hyper-heuristic algorithm framework with reinforcement learning at the upper layer and meta-heuristic algorithms at the base layer, aimed at enhancing the adaptability of spacecraft in the uncertain conditions of deep space.

3. Framework and Modeling for Spacecraft Onboard Autonomous Task Planning

According to the literature [54], spacecraft operating in deep space can have their intelligence levels classified into three categories: “automatic”, “autonomous”, and “self-governing”. In this classification, “automatic” spacecraft are capable of substituting manual operations with software, hardware, and algorithms, though their operations still depend on human intervention, such as receiving and executing commands. At the “autonomous” level, spacecraft simulate human operational processes and are able to independently carry out simple task executions and self-learning, such as executing commands in a pre-determined sequence. “Self-governing” spacecraft are capable of analyzing their current state and surrounding environment, and making rational decisions based on this analysis to more effectively achieve predefined objectives.
ASSE poses a challenge for spacecraft intelligence capabilities to transition from autonomous to self-governing operation, ensuring stable and continuous functioning in the complex environment of deep space. This requires comprehensive coordination and integration across three critical aspects: spacecraft architectural design, data description methods, and algorithm development. Firstly, it is essential to design a spacecraft architecture that supports self-governing capabilities, facilitates the operation and deployment of relevant algorithms, and controls the spacecraft based on the outcomes of these algorithms. Secondly, to enable effective data interchange between the architecture and algorithms, a suitable data description format must be designed. Finally, problem-specific algorithms need to be developed and deployed on the architecture using established data description formats.

3.1. Target-Driven and Task-Level Objective Commands

Traditional spacecraft operations primarily involve the method of data injection to transmit action commands to the spacecraft. This method is widely used due to its directness and reliability. However, with the increasing uncertainties of deep space exploration missions and extended communication delays, this approach can result in missed opportunities for scientific objectives, thus impacting the efficiency of the exploration. As the number of spacecraft increases and operational modes become more mature, some routine operations can transition from manual to automatic execution. Consequently, the scope of spacecraft operations should shift from specific “actions” to specific “objectives”, allowing the spacecraft to autonomously select and execute commands that align with the current objectives. This operational mode is referred to as “goal-driven”.
According to research by MAULLO [55], the concept of “goal-driven” operations involves shifting the basis of operations from a sequence of command instructions to declarative operational intentions, or goals, thereby reducing the workload of operators and allowing them to focus on “what” to do rather than “how” to do it. This method enhances the system’s autonomy and its ability to respond to unpredictable environments. By clearly defining operational intentions, the system can verify the successful achievement of objectives and, when necessary, employ alternative methods to achieve these goals [1].
Through the use of “goal-oriented” commands, spacecraft can encapsulate specific sequences of behavioral instructions, thereby concentrating on the objectives to be fulfilled rather than the operational details of the commands themselves. These commands do not have a fixed design framework; each organization and spacecraft manufacturer can customize the command format based on their specific requirements. In this study, given the emphasis on planning for spacecraft mission objectives, these are designated as “Task-Oriented Commands” (TOCs) [56].
This research utilizes TOCs as the fundamental unit of mission planning. When a spacecraft is required to manage multiple TOCs simultaneously, it must holistically assess the current resource information, the resource consumption associated with the objectives, and the environmental conditions in which the spacecraft operates. An optimal mission execution strategy is then selected, aiming to achieve maximum efficiency and minimal resource consumption in the shortest possible time and with the fewest iterations.

3.2. Spacecraft Autonomous and Task Planning Framework

The task planning framework in this study is based on the intelligent flight software architecture for spacecraft proposed by Lyu [57]. This architecture incorporates the Spacecraft Onboard Interface Services (SOIS), selecting service according to the needs of intelligent capabilities. The entire framework is segmented into the subnet layer, application support layer, and application layer, with the task planning module positioned at the upper echelon of the application layer. This study focuses on the design of task planning capabilities at the higher level of the application layer of the framework.
The autonomous task planning capabilities of the spacecraft comprise three main services: decision-making, planning, and scheduling. The relationship between these services and the overall architecture is depicted in Figure 1. The decision-making service involves the spacecraft generating and formulating task objectives based on the current environment and status, outputting several TOCs. The planning service is responsible for determining the execution sequence of various TOCs and generating a TOC execution sequence. The scheduling service entails decomposing each TOC into specific actions and commands executable by the spacecraft, ensuring that each TOC’s implementation effectively reaches the intended target state. Detailed descriptions of these services and their inputs and outputs are provided in Table 2. In summary, the task planning capabilities generate appropriate TOCs based on the spacecraft’s environment and status and plan their execution sequence. Subsequently, during implementation, these TOCs are broken down into concrete, executable commands, enabling the spacecraft to autonomously generate action commands in response to the current environment, thus fulfilling the requirements of ASSE.
This study aims to design a task planning service that optimizes the execution sequence of TOCs based on the onboard environment of the spacecraft, thereby enhancing the efficiency of task execution and the computational speed of the algorithm.

3.3. Mathematical Description of Spacecraft Autonomous Mission Planning Problem

This study explores the sequence planning problem for TOCs, with the objective of maximize scientific exploration benefits within the shortest possible operational duration, thereby optimizing the cost–benefit ratio. The spacecraft is required to methodically execute each mission objective until all known targets are completed. This issue represents a variant of the traveling salesman problem (TSP), a combinatorial optimization challenge that seeks to determine the shortest route by which a salesman departs from a city, visits each other city exactly once, and returns to the starting city, ensuring that the total path length (or cost) is minimized [58]. In this research, the total cost is defined in terms of the cost–benefit ratio, with additional considerations given to resources and environmental factors during the completion of each task.
Table 3 provides a comprehensive list of the symbols and their definitions used throughout the model.
The cost–benefit ratio c i j can be expressed as
c i j = V i j + n     e i j n n     r i n   + v i n t i j
The mathematical model established in this study is as follows:
m a x Z = i = 1 n   j = 1 n   k = 1 n   c i j x i j k
subject to
j = 1 n   k = 1 n   x i j k = 1 ,   i T
i = 1 n   j = 1 n   k = 1 n   r i n e i j n x i j k R total   n , n
i = 1 j 1   x i j k 1 , k , j T
u i u j + n x i j k n 1 , i , j : 2 i j n
i = 1 n   j = 1 n   c i j x i j k P min   , k
x i j k S i j , i , j T , k
j = 1 n   x 0 j k = 1 , k
i = 1 n   x i n + 1 , k = 1 , k
i = 1 n   j = 1 n   r i n e i j n x i j k 0 , n , k
k = 1 n   x i j k W i j n , i , j T , n
t j t i k = 1 n   M 1 x i j k τ m a x i j , i , j T
i = 1 n   j = 1 n   k = 1 n   v i n t i j x i j k R total   n , n
V i j V m a x i j , i , j T
The objective function of the model is delineated in Equation (2), with the primary aim of maximizing the total cost–benefit ratio during scientific exploration. Constraint set (3) ensures that each task i is scheduled only once within the entire sequence of tasks. Constraint set (4) guarantees that, for each resource n, the consumption of resources during task execution, minus any potential replenishment, does not exceed the total capacity of that resource. Constraint set (5) defines the order of task execution to form a closed loop, preventing temporal conflicts between tasks. This is achieved by assigning a specific position u i to each task, thereby preventing the formation of multiple independent sub-cycles. Constraint set (6) is used to preclude the occurrence of sub-cycles in the solution, ensuring a complete operational loop rather than multiple fragmented cycles. Constraint set (7) ensures that the cost–benefit ratio of any executed sequence of tasks does not fall below a predefined minimum threshold. Constraint set (8) takes into account the need to avoid hazardous areas or maintain safe distances when executing tasks in deep space environments, as well as assesses the feasibility from task i to task j. Constraint sets (9) and (10) ensure that tasks start with a specified initial task and conclude with a designated final task. Constraint set (11) guarantees that the residual amount of resources never falls below zero at any point during task execution. Constraint set (12) specifies that certain tasks may only utilize specific resources within designated time frames. Constraint set (13) stipulates the maximum time interval for task execution, requiring that the interval between specific tasks does not exceed the set maximum time limit to avoid missing exploration opportunities. Constraint set (14) notes that, for some resources, consumption may be linked to the duration of task execution, necessitating consideration of the consumption rate. Constraint set (15) stipulates that the anticipated benefit of each task does not exceed the maximum limit.
Based on the description provided, this mathematical model can be constructed to plan the TOC execution in deep space. This model stipulates that the execution of TOCs consumes relevant resources and yields corresponding benefits. The sequence in which tasks are executed impacts both the amount of resources consumed and the benefits realized. Consequently, the model incorporates an objective function designed to maximize the cost–benefit ratio, thereby facilitating the optimization of the TOC execution sequence for spacecraft operations in deep space.

4. Methodology

4.1. Overview

The architecture of the hyper-heuristic autonomous task planning algorithm based on the actor–critic model is depicted in Figure 2. This algorithm is structured into two levels: a higher level and a lower level. The lower level comprises multiple meta-heuristic algorithms, employed as core operators. Real-life scenarios inspire these meta-heuristics and are distinctively designed to meet the operational needs of spacecraft in the uncertain environment of deep space, with the four algorithms providing complementary features. The higher-level algorithm is based on reinforcement learning. In its operational process, the higher level initially selects appropriate lower-level operators based on the current environmental conditions and TOC parameters. Subsequently, the selected lower-level operator executes multiple times to alter the current state of the environment. Following this, the higher-level algorithm selects operators based on the updated environmental state, repeating this cycle until a predetermined number of iterations is completed. Once the reinforcement learning model is fully trained, the spacecraft, in theory, can select the most suitable operator based on real-time environmental data, thereby achieving optimized solutions more swiftly. The design of this computational method offers advantages in adaptability and flexibility over traditional single meta-heuristic algorithms. The high-level reinforcement learning strategy employs a policy-based actor–critic approach, utilizing the network to construct the actor and critic networks, which enhances the adaptability of operator selection and increases environmental carrying capacity.

4.2. Low-Level Heuristic Algorithm Selection and Design

(1)
Model mapping method
The TSP is intrinsically a discrete optimization problem. This paper utilizes several heuristic algorithms. Within the framework of heuristic algorithms, each computational instance involves data that effectively constitute a matrix. Let X denote the solution matrix, where X i (for i = 1,2 , , m ) represents the row vector of the i th solution, and each column corresponds to a specific TOC, formally represented as
X = X 1 ; X 2 ; ; X m
Here, X i = x i 1 , x i 2 , , x i n is the vector for the i th solution. In heuristic algorithms, x i j denotes the optimization value of the j th task in solution i . However, in the context of the TSP, x i j does not represent a specific numerical value. In this study, we define the value of each solution by its magnitude factor, thus conceptualizing the entire solution as a sequence ordered from the largest to the smallest x i j . The resolution strategy involves sorting all task optimization values, subsequently assigning a sequential factor to each to denote its position in the sequence, and restoring these tasks to their pre-sorted state. Finally, the original task values are replaced by their sequential factors, completing the construction of the solution.
The rationality of such mappings can be elucidated through mathematical principles.
Define S x as the operation of sorting vector x , arranging the elements x i in ascending order based on their values, thereby generating a new sequence x :
x = S x
During this process, each x i is mapped to its respective position post-sorting. The sorting operation relies on comparisons among elements, and the sorting map S x ensures that the relative size relationships among the vector’s elements are preserved after the mapping, satisfying transitivity (if x a < x b and x b < x c , then x a < x c ). This guarantees the consistency and uniqueness of the sort. The process leverages sorting as an intermediary step, thus ensuring that the mapping from a continuous space to a discrete ordinal space is both consistent and effective.
Furthermore, this type of mapping must also possess uniqueness. Since the mapping is based on the relative sizes of elements in the original sequence, any difference in the original sequences ensures that at least one element will be indexed differently after mapping. Consequently, different original sequences will map to distinct sequences.
(2)
Low-level heuristic operator selection
The unit of high-level operator selection is the underlying algorithm, so the overall optimization quality of the algorithm is related to the selection of the underlying algorithm. In the deep space environment, facing the uncertainty of the environment, in order to improve the adaptability and flexibility of the algorithm to the environment, it is crucial to select operators with different emphases. However, to ensure the convergence of the model, a balanced selection of algorithms is essential. Therefore, this research adopts the following four dimensions to select pertinent operators:
1. Global Search Capability:
Global search capability refers to an optimization algorithm’s ability to extensively explore the entire search space. This capability enables the algorithm to thoroughly probe the search space, thereby preventing it from merely settling into local optima and, ultimately, facilitating the discovery of global optima. In mathematical models, global search is often achieved by introducing randomness and diversity [59].
2. Quality of Solution Optimization:
A high-quality solution is not just a locally optimal solution but rather the best or near-best solution within the context of the optimization problem. An effective optimization algorithm should be capable of providing sufficiently high-quality solutions [60]. The quality of solutions is evaluated through a fitness function, which should be designed to differentiate between solutions of varying qualities and guide the algorithm towards developing higher-quality solutions.
3. Convergence Speed:
Convergence speed refers to the number of iterations or time required for an algorithm to meet its stopping criteria. Algorithms that converge quickly can find satisfactory solutions more rapidly, which directly impacts the efficiency of the optimization process [61]. Rapid convergence is demonstrated by the algorithm’s ability to quickly reduce the solution space and swiftly adjust solutions towards the optimal direction.
Consequently, this research aims to select four algorithms, each endowed with specific characteristics, to ensure that the chosen algorithms exhibit both flexibility and excellent adaptability. The four algorithms selected for this study are particle swarm optimization (PSO), grey wolf optimizer (WOA), sine cosine algorithm (SCA), and differential evolution algorithm (DE). The foundational principles, features, and their corresponding characteristics of these algorithms are detailed in the accompanying Table 4.
Among the four algorithms in the table, the particle swarm optimization (PSO) algorithm adjusts its position based on both its own historical best position and the global best position of the swarm. This information-sharing mechanism enhances the algorithm’s global search capability. Each particle explores a wide area in the search space and, through communication with other particles, avoids getting trapped in local optima, thus improving the global search ability of the algorithm. The grey wolf optimizer (GWO) algorithm simulates the hunting behavior of grey wolves by tracking the prey’s position and adjusting according to the prey’s dynamics, allowing the algorithm to quickly converge to the optimal solution. The sine cosine algorithm (SCA) is based on the periodic characteristics of sine and cosine functions, which allows it to extensively explore the search space in the early stages of the algorithm, enhancing global search capability. The differential evolution (DE) algorithm generates new candidate solutions through differential operations and combines multiple solutions to create new ones, thus avoiding premature convergence to local optima and enhancing its global search capability. Through this global exploration mechanism, DE is able to find the optimal solution in complex optimization problems, achieving a high quality of solution optimization.
In the following sections, these four algorithms will be described in detail.
1. Particle Swarm Optimization (PSO)
Particle swarm optimization (PSO) is a swarm intelligence-based optimization technique, initially proposed by Kennedy and Eberhart [62]. Inspired by the social behaviors of bird flocks, it simulates the foraging process of birds, enabling particles (i.e., solutions) within the algorithm to seek the optimal solution based on both individual and collective experiences in the solution space. The movement of particles is guided by both their historical best positions and the global best position, aiming to enhance global search efficiency through collaborative efforts.
PSO comprises four parameters: particle position, particle velocity, individual best position, and global best position. Within an n-dimensional search space, the position of a particle is denoted as X i = x i 1 , x i 2 , , x i n , which correlates with potential task execution sequences. The velocity of a particle dictates the direction and magnitude of its movement in the search space and is represented as V i = v i 1 , v i 2 , , v i n . The individual best position (pbest) refers to the optimal location each particle identifies during the search, formulated as P i = p i 1 , p i 2 , , p i n . The global best position (gbest) represents the optimal position discovered by the entire swarm during the search process, expressed as G = g 1 , g 2 , , g n .
The velocity and position of the particles are updated according to the following formula:
V i d new   = w V i d + c 1 r 1 P i d X i d + c 2 r 2 G d X i d
where V i d is the velocity of particle i in dimension d , w is the inertia weight, c 1 and c 2 are learning factors, and r 1 and r 2 are random numbers between 0 and 1.
The formula for updating the position is as follows:
X i d new   = X i d + V i d new  
where X i d is the position of particle i in dimension d.
The PSO algorithm initially initializes the positions and velocities of the particle swarm. Once the algorithm commences, it calculates the fitness value for each particle, subsequently updating the individual and global best positions. Thereafter, the velocities and positions of each particle are adjusted according to Formulas (18) and (19). This process is repeated until the stopping criteria are met. If the fitness function F X i is Equation (2), the pseudo-code of the algorithm is shown in Algorithm 1.
Algorithm 1: Particle Swarm Optimization (PSO) with Cost-Benefit Ratio
Initialize particle positions X i and velocities V i based on tasks T
while termination criteria not met do
        for each particle i   do
                Calculate fitness F X i using the cost-benefit ratio c i j k
                if  F X i is better than F P i  then
                        Update P i with X i
                end if
                if  F X i is better than F G  then
                        Update global best G with X i
                end if
                Update V i and X i based on P i and G
        end for
end while
return the global best solution G
2. Grey Wolf Algorithm (WOA)
The grey wolf algorithm (WOA) was proposed by Mirjalili [63] inspired by the social hierarchy and hunting behaviors of grey wolves. The algorithm emulates the strategies of tracking, encircling, and capturing prey employed by wolf packs during the search process. It divides the wolf pack into leaders (alpha, beta, delta wolves) and followers, with the leaders guiding and the followers updating their positions, thereby facilitating effective global and local searches to find optimal solutions.
The core of the WOA is to emulate the hunting mechanisms of wolf packs. The algorithm designates three lead wolves, identified as α , β , and δ . Initially, the distances between the pack and the prey are calculated as follows:
D α = C X α X i D β = C X β X i D δ = C X δ X i
where C is a coefficient vector determined by the formula C = 2 r a n d , with r a n d generating a vector of random numbers, each within the interval 0,1 .
Subsequent to this, the position update of the “wolf pack” is performed using Formula (21):
X i new   = X α A D α + X β A D β + X δ A D δ
where Formula (21) models the pack’s behavior of encircling and hunting the prey. Here, A is computed using A = 2 a r a n d a , with a being a linearly decreasing parameter from 2 to 0.
Finally, the position update is finalized using Formula (22)
X i = X i new   3
In this formula, X i represents the current position of wolf i , and X i , new   is the newly calculated position.
The WOA algorithm starts by evaluating fitness, then simulates the hunting process of the wolf pack, updates the positions of the wolves and the global optimum solution, and repeats these steps until meeting the termination conditions. If the fitness function F X i is Equation (2), the pseudo-code of the algorithm is shown in Algorithm 2.
Algorithm 2: Grey Wolf Optimizer (WOA) with Cost-Benefit Ratio
Initialize wolf positions X i based on tasks T
Identify alpha, beta, and delta wolves based on F X i
while termination criteria not met do
    for each wolf i  do
        Update the position X i towards alpha, beta, and delta using D α , D β , D δ
        Calculate fitness F X i using the cost-benefit ratio c i j k  
    end for
    Update alpha, beta, and delta positions based on best F X i values
end while
return the position of the alpha wolf
3. Sine Cosine Algorithm (SCA)
The sine cosine algorithm (SCA) was developed by Mirjalili [64], utilizing the mathematical sine and cosine functions to update the positions of solutions. By dynamically adjusting the search direction and step size, the algorithm strikes a balance between global exploration and local search. This method is particularly well-suited for solving complex multimodal optimization problems, as it flexibly adjusts the search paths through the sine and cosine rules, thus avoiding local optima.
In the SCA algorithm, the position of each solution is updated in every iteration according to Formulas (23) and (24):
X i new   = X i + r 1 sin r 2 r 3 P X i
X i new   = X i + r 1 cos r 2 r 3 P X i
In these formulas, X i represents the current position of the solution, and X i ,   new   is the position of the solution after it has been updated. The parameters r 1 , r 2 , and r 3 are randomly generated to adjust the search trajectory of the solution: r 1 controls the step size, r 2 determines whether a sine or cosine function is used for the update, and r 3 dictates the direction of the search. P denotes the position of the optimal solution in the current iteration. If the fitness function F X i is Equation (2), the pseudo-code of the algorithm is shown in Algorithm 3.
Algorithm 3: Sine Cosine Algorithm (SCA) with Cost-Benefit Ratio
Initialize solutions X i for all tasks T
Calculate fitness F X i for all X i using c i j
Identify the best solution X best  
while not converged do
    for each solution X i  do
        for each dimension d  do
            Calculate r 1 , r 2 , r 3 randomly
            if rand (); 0.5 then
                 X i d new   = X i d + r 1 s i n r 2 r 3 X b e s t , d X i d
            else
X i d n e w = X i d + r 1 c o s r 2 r 3 X b e s t , d X i d
end if
        end for
Update X i   if   X i new   improves the fitness
        end for
Update X b e s t if better solutions are found
end while
return  X b e s t
4. Differential Evolution Algorithm (DE)
The differential evolution (DE) algorithm is a global optimization algorithm mainly used to solve optimization problems on continuous parameter spaces. Its principle is based on simple but effective genotype mutation, crossover, and selection operations on individuals in a population to explore the solution space and find the optimal solution [65].
In the variation step, three different individuals a , b , and c are randomly selected from the current population and used to generate a new candidate solution v i . The mutation vector v i is given by
v i = a + F b c
where F is a positive scaling factor, usually between 0.5 and 1.0. This factor controls the strength of the perturbation of the solution vector a by the difference vectors b c .
In the crossover step, the algorithm combines the variance vector v i and the target individual x i to generate the trial vector u i . For each dimension j, the jth component of the trial vector u i is determined in the following way:
v i , j   if   r a n d j C R   or   j = r a n d 1 , D x i , j   otherwise  
CR is the crossover probability, which determines the acceptance probability of the variance vector components on each dimension; r a n d j is a random number uniformly distributed in the range [0,1]; r a n d 1 , D ensures that at least one of the dimensions is selected from v i to introduce new genetic information, where D is the dimension of the problem.
In the selection step, the fitness of the target individual x i is directly compared with that of the test individual u i :
u i   if   f u i f x i x i   otherwise  
If the fitness of the test vector u i is better than (or equal to) the fitness of the current individual x i , then u i will replace x i in the next-generation population. If the fitness function F X i is Equation (2), the pseudo-code of the algorithm is shown in Algorithm 4.
Algorithm 4: Differential Evolution Algorithm (DE)
Initialize population vectors X i , g for i = 1 to N P
Evaluate the fitness F X i , g of each individual X i , g
Identify the best individual X best  
while not converged do
    for each individual X i in the population do
        Select random individuals a , b , c from the population, a b c i
        Generate the donor vector V i , g + 1 = X a , g + F X b , g X c , g
        Initialize trial vector U i , g + 1 to be an empty vector
        for each dimension j  do
if  r a n d ( j ) C R or j = r a n d ( 1 , D )  then
                                                       U i , g + 1 , j = V i , g + 1 , j
else
                                                       U i , g + 1 , j = X i , g , j
end if
end for
Evaluate the fitness F U i , g + 1
if  F U i , g + 1 F X i , g  then
                                                       X i , g + 1 = U i , g + 1
else
                                                       X i , g + 1 = X i , g
end if
end for
Update X best   if a better solution is found
end while
return  X best  

4.3. High-Level Algorithm Based on Actor–Critic Reinforcement Learning

(1)
Overview
Lower-level operators, by encompassing various initial uncertain states, necessitate the involvement of a “smart computing expert” at a higher level to assess the current environment and accordingly select an appropriate algorithm based on that environment and state. Once the algorithm is chosen, it is executed using the set parameters to activate the lower-level operators, a process that affects and modifies the current environment and state, prompting the subsequent selection of suitable operators based on the new conditions, and so forth. Thus, the essence of the entire algorithmic process depends critically on the “smart computing expert’s” ability to accurately select and implement the appropriate algorithms.
During the operator selection phase, it is essential to consider relevant design methodologies from reinforcement learning, such as methods for describing states, definitions of reward functions, criteria for action definition and selection, network architecture design, and other training and strategy designs.
Overall, the algorithm describes the spacecraft’s relevant resources and attributes as state features, utilizing the deep neural network to enhance the model’s parameterization and adaptability to complex and diverse environments. Moreover, action selection is based on the ε-Greedy Strategy, and dynamic reward functions are defined across four dimensions: global search capabilities, solution optimization quality, algorithm convergence speed, and types of applicable problems. This approach aims to differentiate reward functions in the early and late phases of training, thereby enhancing the model’s training effectiveness.
(2)
State
The design of the state significantly affects the computational accuracy of hyper-heuristic algorithms. In this study, we regard several solutions optimized by heuristic algorithms as part of the state, which are integrated with other elements (such as resource consumption, cost, and current resources) to form a comprehensive state representation matrix. According to Equation (16), the current solution matrix is denoted as X , thereby defining the state space S as follows:
S = X , R , C , P , W
Here, resource information R includes all details pertaining to the resources required for task execution r i n , the total available resources R total   n , and the resources accrued after completing tasks e i j n . Cost and benefit information C encompasses the cost–benefit ratio c i j from task i to task j and the benefits obtained after completing task j V i j . P represents the threshold for the minimum cost–benefit ratio P m i n , and W indicates whether time-related constraints are present.
Given the variations in dimensions and shapes of matrices formed by these constraints, methods such as normalization, padding, and feature extraction are necessary to process these matrices. This allows the extracted features to be input as a cohesive state into the model. Such processed state inputs enable the trained model to adapt to diverse tasks and constraints effectively.
(3)
Action selection
In this paper, the reinforcement learning algorithm is defined with an action set A =   a 1 , a 2 , a 3 , a 4 . The action set comprises four heuristic algorithm operators. Based on the current environmental conditions, the algorithm selects the appropriate operator (i.e., heuristic algorithm) and executes the corresponding code according to pre-set parameters.
During model training, the early action selection strategy has a significant impact on the performance of the algorithm since the initial probability distribution is random or directly specified. For instance, if high probability actions are consistently chosen early on, the algorithm may overly rely on these known optimal actions, thereby causing the model to converge on local optima as it struggles to explore more advantageous strategies.
The action selection strategy employed in this study is the ε -Greedy Strategy, which effectively addresses the trade-off between exploration and exploitation. The basic strategy is as follows:
a t = r a n d o m A if   random A < ϵ a r g m a x a   Q s t , a otherwise
where a t is the action chosen at time t . The algorithm defines an exploration rate ϵ . If the result of r a n d o m < ϵ , the exploration mechanism is activated; otherwise, the action estimated to yield the highest expected reward is greedily selected, facilitating exploitation.
Regarding the adjustment of the exploration rate, given the instability of the initial probability distribution, a higher exploration rate is warranted initially. As training progresses to ensure stability in the decision-making process, the exploration rate should gradually decrease. The adjustment formula for the exploration rate is
ϵ t = ϵ m i n + ϵ m a x ϵ m i n × e λ t
Here, it is necessary to define the initial exploration rate ϵ m a x , the minimum exploration rate ϵ m i n , and the decay rate λ . ϵ t represents the exploration rate at time t . Through this method, the exploration rate starts at a higher value and gradually decreases over time to a lower value.
(4)
Actor–Critic network structure design
The actor–critic (AC) method is a sophisticated reinforcement learning algorithm that amalgamates the advantages of policy gradient methods with those of value function optimization techniques [66]. In the upper layer of our study’s algorithm, we have adopted this method as the primary reinforcement learning framework.
In this approach, there are two interconnected network structures: the actor network (Figure 3) and the critic network (Figure 4). The actor network (A) receives the state of the environment as input and outputs the probabilities of selecting each possible action. Its primary objective is to select actions based on the current policy, with the aim of maximizing expected rewards through policy learning. The critic network (C) also inputs the state of the environment but outputs an estimate of the current state’s value, assessing the value of states or state–action pairs by learning the value function of actions [67].
Initially, the actor network acts within the environment according to the current policy. Subsequently, the critic network evaluates the effectiveness of this action and computes the temporal difference (TD) error. The actor network then updates its strategy based on feedback from the critic network. Simultaneously, the critic network updates its estimate of the value function based on the TD error. The pseudo-code for the algorithm is provided in Algorithm 5.
Algorithm 5: Actor-Critic Based Metaheuristic Algorithm
Initialize policy network parameters θ π , value network parameters θ v
Initialize environment and state s
for each episode do
    Reset environment and observe initial state s
    while not done do
          action a according to the policy π a s , θ π
        Execute action a in the environment
                Observe reward r and new state s
Compute advantage estimate A ( s , a ) = r + γ V s , θ v V s , θ v
Update policy θ π θ π + α θ π l o g π a s , θ π A ( s , a )
Update value θ v θ v β ( r + γ V s , θ v V s , θ v ) θ v V s , θ v
                                                                 s s
            end while
if end of evaluation period then
Evaluate the policy
end if
end for
For the actor network, the action probability distribution is defined as follows:
π a S ; θ π = 1 ϵ t s o f t m a x f π S ; θ π + ϵ t 1 A
Here, a represents the action, S is the state, and θ π are the parameters of the actor network, with f π denoting its function. This formula delineates the probability of selecting action a in state S . Initially, f π computes scores for all actions, which are subsequently converted into a probability distribution using the softmax function, ensuring that the sum of all action probabilities equals one and that the selection probability is positively correlated with the scores. Moreover, the exploration rate ϵ t ensures a degree of random exploration.
In this study, we designed the architecture of the actor network, which tailored in our setting to output four operations (operators), with a complex input matrix (state S ). The entire network consists of two convolutional layers, two pooling layers, and two fully connected layers, culminating in a Softmax output. The network architecture is illustrated as follows:
f π S ; θ π = Softmax F C 2 Activation F C 1 Pool 2 Conv 2 Pool 1 Conv 1 S
The loss function for the actor network is defined in two parts. The first part consists of the expected negative log probability multiplied by the action’s advantage function A s , a , aimed at guiding the policy to enhance reward values. The second part involves the entropy of the policy to encourage exploration of new actions, defined as
L θ π = E S ρ π , a π log π a S ; θ π A S , a + β H π S ; θ π
The network parameters θ π are updated through gradient ascent. Here, β is the coefficient for the entropy regularization term, and H π S ; θ π represents the entropy of the policy. High entropy implies greater randomness in action selection. The advantage function A S , a assesses the average merit in a given state S and is defined as
A S , a = r + γ V S V S
Here, r is the immediate reward obtained after executing action a , γ is the discount rate for future rewards, and V S is the estimated value function for the next state S . This methodology allows for the approximation of the advantage function through the value function estimated by the critic network, simplifying the computation.
For the critic network, the value function is defined as
V S ; θ v = f v S ; θ v
where θ v are the parameters of the critic network, and f v is a function composed of two convolutional layers and three fully connected layers. This simplified network version is chosen due to the simplicity of the output from the policy function.
The loss function of the critic network calculates the error between the network’s value estimation and the actual rewards:
L θ v = E s ρ π V s ; θ v R t 2
where R t is the actual return starting from state s . The network parameters θ v are updated through gradient descent.
Based on the described methods, an adaptable actor–critic network can be constructed, providing ample justification for action selection.
(5)
Optimization Metrics
Before defining the reward function, it is essential to clarify the key performance metrics that the algorithm aims to improve, as the design of the reward function in high-level reinforcement learning is closely tied to these metrics. In this study, the defined optimization metrics are directly proportional to the level of improvement achieved; thus, higher improvements correspond to higher reward values. Consequently, the goal of the algorithm is to enhance these metrics by obtaining high reward values.
This research establishes four critical metrics: global search capability, quality of solution optimization, algorithm convergence speed, and applicability to problem types. For each metric, both stepwise and overall rewards must be considered within the reward function.
Global search index Δ : The Global Search Index is a metric used to quantify the diversity of solutions in heuristic algorithms by evaluating the ratio of distinct solutions generated during a specified iteration period to the theoretical maximum number of possible solutions. This metric reflects the algorithm’s ability to explore the search space globally, indicating how well it maintains diversity throughout the search process. The mathematical expression is
Δ global   = N unique   I × S
where N unique   denotes the number of distinct solutions, I represents the number of iterations, and S indicates the size of the solution space.
Quality of solution optimization ( η ): This metric assesses the improvement of a solution compared to the initial state. The stepwise reward is defined below:
η quality = 1 P X P prev X
where P(X) is the path length of the current solution X . The overall reward is calculated based on the specific improvement over the initial solution.
η = P X before   P X after  
Weighted Convergence Speed ( τ ): This metric quantifies the average number of steps required for the algorithm to converge to the current optimal solution. It measures the speed and stability of the algorithm by assigning a weight to each adaptation improvement.
τ = i = 1 I w i Δ f i I
Here, w i is the weight of the i-th adaptation improvement, set according to the actual situation. Δ f i is the improvement value of adaptation after the ith iteration. Iteration is the total number of iterations required to reach the highest adaptation value.
With these metrics, an algorithm that reaches a high level of adaptation quickly in the early stages and then remains unchanged will receive a higher τ value because the weighting of reaching a high level of adaptation quickly in the early stages is greater. Additionally, if the algorithm is still making small improvements as it approaches its final fitness, these improvements, although small, will be factored into the overall evaluation, although they will not significantly affect the evaluation metrics.
Algorithm Composite Evaluation Index ( ξ ): This metric aims to quantify the average performance of the algorithm. For algorithms, surely the algorithm that achieves the best fitness is the most superior algorithm. However, in aerospace, the real-time and optimization efficiency of the algorithm needs to be considered so that the algorithm can achieve the best possible results in as short a time as possible, while ensuring adequate coverage of the number of solutions, and still achieve a good degree of fitness. Combining the above needs, this study defines the Algorithm Composite Evaluation Index, which is used to combine the above three metrics to comprehensively evaluate the performance of an algorithm so that it can be applied to the specific field of aerospace and related scenarios.
The index is used to compare the combined performance of the above three metrics between multiple algorithms and in the same case, so multiple algorithms need to calculate the index at the same time in order to be comparable. A higher index indicates that the algorithm performed relatively better (i.e., converged faster and produced better results) in that instance. The specific calculation method of the index is as follows:
ξ = w 1 τ τ m a x + w 2 e 0.01 η m a x η 1 3 + w 3 Δ 3
In the above formula, τ m a x and η m a x refer to the optimal value of the corresponding metrics in the algorithm in the same period, and the rest of the relevant participation is used for the metrics’ normalization. According to the importance ranking of the above three metrics, in this study we define w 1 : w 2 : w 3 = 3 : 5 : 2 , i.e., the algorithm’s fitness is calculated to be the most important, followed by the speed of convergence, and lastly, the extent of the algorithm’s coverage of the solution. The following text study completes the comprehensive evaluation of the algorithm based on this metric.
(6)
Reward
In this study, the designed algorithm aims to meet the flexibility requirements in diverse environments, thus imposing stringent demands on maintaining consistent performance under various conditions. The selection of the four underlying algorithms must possess significant advantages to ensure the efficiency and effectiveness of the overarching meta-heuristic algorithm. Initially, the goal is to surpass the performance of individual algorithms through optimization based on a relative evaluation of higher-level decisions. As training progresses, the model theoretically should select the optimal solution that exceeds the performance of each standalone algorithm, rendering simple relative evaluations inadequate. Hence, it is necessary to assess the merits and demerits of algorithms and decisions from an absolute perspective to guide the training direction.
Consequently, the reward function in this research is divided into two main parts: the absolute factor R absolute   and the relative factor R relative   . As training time increases and training effects improve, the proportion of the absolute factor gradually increases, while that of the relative factor correspondingly decreases. The expression is given by
R t = 1 1 + e k t t 0 R absolute t + 1 1 1 + e k t t 0 R relative
where t 0 is the point in time when α and β are equal.
In the reward function, the metrics are the four optimization indicators previously mentioned, calculated differently depending on whether the perspective is absolute or relative. For the absolute factor, the focus is on the absolute values of the optimization metrics and the changes before and after algorithm implementation. The formula is as follows:
R absolute t = i   w i k = 0 T t     γ k R ˜ i t
Here, R step   , i t and R global ,   i represent the normalized step reward and global reward at time t for metric i , respectively, with w i being the weight of metric i , and i   w i = 1 . T is the total number of iterations, and γ is the discount factor, used to adjust the weight of future rewards. The set of metrics includes i Δ , η , τ , ξ .
For the relative factor, it is only necessary to compare the rankings of the meta-heuristic algorithm against the other four standalone heuristic algorithms. The higher the ranking of a corresponding metric, the greater the reward obtained. To encourage higher rewards, it is stipulated that higher rankings result in exponentially increasing rewards, as specified in the relative reward R relative:   :
R relative = r i Δ , η , τ , ξ   w i e N + 1 rank i
Here, r a n k i is the ranking position of the algorithm, N is the total number of algorithms. r is the raw reward value obtained through the optimization indicators associated with the absolute factor. Each metric i has an assigned weight and ranking.
After integrating both absolute and relative factors, the training of the algorithm focuses on surpassing the performance of individual underlying algorithms and striving for better solutions, thereby enhancing the overall performance while optimizing the four specified metrics.
(7)
Training strategy
Building upon the aforementioned description, this study further incorporates the following training strategies. Firstly, a dynamic learning rate adjustment is utilized. Within the optimizer, the learning rate, η , is updated after each epoch via a scheduler according to the formula
η t + 1 = η t γ λ
Here, η t + 1 represents the learning rate for the upcoming epoch, η t is the learning rate of the current epoch, γ is a decay rate (typically less than 1), and λ controls the rate of decrease in the learning rate. By progressively reducing the learning rate, the algorithm can converge more effectively, while also minimizing parameter fluctuations and overfitting in the later stages of training. An exponential decay strategy is employed to adjust the learning rate.
Secondly, for the actor and critic networks, the Adam optimizer is employed:
θ t + 1 = θ t η m t v t + ϵ
where θ denotes the network parameters, η is the learning rate, m t is the bias-corrected estimate of the first-order momentum, v t is the bias-corrected estimate of the second-order momentum, and ϵ is a small constant added to ensure numerical stability. The Adam optimizer is utilized to adjust model parameters, allowing for the adaptive adjustment of learning rates for each parameter.

5. Experiment

To validate the effectiveness and performance of the proposed algorithm in real-world environments, this study has designed several sets of experiments. Initially, the validity and training process of the model are analyzed by examining changes in the reward function values and anticipated model changes through the training procedure. Subsequently, by integrating the four operators at the base of the hyper-heuristic algorithm, the effectiveness of the higher-level algorithm in selecting these operators is assessed. Finally, by comparing with other heuristic algorithms and common algorithms, the comprehensive effectiveness of this algorithm on relevant evaluation metrics against other operators and algorithms is verified.

5.1. Parameter and Environment Settings

Due to the unique nature of this problem, there are no standard datasets available for testing. Therefore, the datasets used in this paper consist of task instances randomly generated according to the actual engineering requirements. We define two parameters for the spacecraft mission objectives with value ranges of (0,100), and the spacecraft’s initial position and state are set as default values. Subsequently, based on the task sequence, the spacecraft calculates and generates an optimized command sequence. The relevant parameters for the instances are defined in the following Table 5.
In the proposed hyper-heuristic algorithm, the determination of relevant parameters influences the algorithm’s performance. Based on multiple previous tests and taking into account the experience from related research, we have identified the relevant parameters for the high-level aspects of the hyper-heuristic algorithm and the parameters for the lower-level operators. These parameters are shown in the Table 6.
In this study, all algorithms were coded using Pytorch 2.0.0 and Python 3.9.18, and implemented on a personal computer with an Intel(R) Core(TM) Ultra 5 125 H processor (manufactured by Intel Corporation, Santa Clara, CA, USA)running at 1.2 GHz with 32 GB RAM for training and principle testing. During the training process, CUDA and an NVIDIA GeForce RTX 4060 Laptop GPU (manufactured by Intel Corporation, Santa Clara, CA, USA)were used for computational acceleration.

5.2. Model Training

In this study, the training scenario is as follows: the task parameters are defined based on a TOC command structure, where each task (TOC) represents a motion planning action for a gimbal system, characterized by two key parameters: Azimuth (horizontal direction, ranging from 0 to 180 degrees) and Elevation (vertical direction, ranging from 0 to 180 degrees). Task instances are generated according to the rules in Table 4, producing between 30 and 70 TOCs, with parameters such as time window, resource consumption, and resource replenishment randomly set within reasonable ranges. For example, the time window is set to 0 or 1 to simulate task timing constraints, and resource consumption is uniformly distributed to ensure that the total consumption does not exceed the resource limit. The environment simulation assumes that both the spacecraft and target positions are randomly distributed within a two-dimensional space, ranging from (0,0) to (100,100).
The training process of the model described in this study was based on CUDA and utilized the aforementioned GPU acceleration. During training, the algorithm’s reward function was defined in accordance with the earlier discussion on rewards, and continuous monitoring and recording were conducted. The study carried out 20 training sessions, each with a sufficient number of iterations. The trend graphs of the average, upper, and lower bounds of the reward function during these trainings are shown in Figure 5.
Here, the red fold represents the reward function change for one of the training sessions, the blue finding fold represents the average value of the reward function change, and the blue fill represents the sliding window of the highest and lowest values of the reward function change over the course of the multiple testing sessions described above. The value on the red fold represents the specific value of the reward function that can be obtained based on the algorithm completing one planning at the current number of iterations.
According to the above figure, it can be seen that the reward function score is low in the initial phase of the algorithm. Afterwards, the reward obtained gradually increases as the training progressively deepens. At the same time, the hyper-heuristic algorithm gradually outperforms the four underlying heuristics alone and obtains higher scores. The reward values fluctuate slightly due to the change in the reward function during the training iterations, which causes the algorithm to favor the global rather than outperform the individual four algorithms, although it ultimately stays within an interval. Comprehensively, the above figure reflects that the change in reward value will gradually improve, thus proving that the reinforcement learning algorithm can effectively improve the evaluation results of the relevant indicators, proving the effectiveness of the algorithm.
Meanwhile, during the training process, the selection scheme for the four underlying algorithms at each training is recorded. Figure 6 shows the stacked area plot of the proportion of the number of times the selection tendency was made for each of the four times during the training iteration cycle, initially and after the end of the model training, respectively.
Observing the above figure, we can find that the first two figures are the distribution of the model’s action selection before training. We can find that, before training, the model’s choices tend to be random, and the distribution of each action is basically equal. After the completion of multiple rounds of training, the last eight graphs show four times the model’s choice of operators at runtime. It can be noticed that the choice of operators has been differentiated due to the differences in the problems. For example, for Figure 6c,d, the tendency is to use the grey wolf optimization algorithm and the particle swarm algorithm to speed up the iterations at the beginning of the computation, while for Figure 6e–h, the tendency is to use the differential evolution algorithm to iterate quickly. Figure 6i,j, on the other hand, would use GWO and DE to speed up the iterations in the early stages, but used the PSO algorithm a number of times in the later stages to try to obtain more solutions. This illustrates the algorithm’s ability to target the right operators to achieve better results when faced with different particle states.

5.3. Comparative Experiments with AC-HATP and Operators

To validate the optimization performance of the algorithm, this section of the experiment first compares the algorithm implemented in this study with four underlying operators against relevant optimization metrics, to demonstrate the degree of optimization in the actual results by high-level selection, as well as the extent to which the advantages and disadvantages of several algorithms are combined.
Firstly, we compare the optimal fitness of optimization. In the Table 7 below, each instance corresponds to a set of tasks, and the number of tasks per instance can be obtained from the table. Each set of instances underwent 20 experiments, with the results indicating the final fitness value of the current experiment. We filled the average, minimum values, and standard deviations into the table. The minimum value allows for an observation of the lowest fitness that the algorithm can achieve, reflecting the algorithm’s optimal performance, although this does not represent the overall level of the algorithm. The average value reflects the average convergence time of the entire algorithm, representing the overall level of the algorithm. The standard deviation indicates the stability of the algorithm; a smaller standard deviation means that the solutions obtained by the algorithm are more stable.
By examining the table, it can be observed that, in terms of final fitness, the GWO and DE algorithms perform similarly to AC-HATP, all achieving good convergence results. PSO and SCA, however, show slightly inferior performance in terms of fitness. Additionally, in most instances, AC-HATP performs slightly better than DE, with specific instances showing AC-HATP’s optimal fitness performance significantly stronger than both GWO and DE (for example, Task Case 13 and Task Case 16). However, there are instances wherein DE’s optimal fitness surpasses the algorithm presented in this paper (such as Task Case 6). Also, the standard deviation of this paper’s algorithm is slightly larger than that of DE, suggesting the potential to achieve superior solutions in some cases. This result confirms that AC-HATP can address the issue wherein GWO, despite its fast convergence speed, may only find local optima in certain cases and can also harness the advantages of the differential evolution algorithm to achieve superior solutions. That is, AC-HATP can integrate additional features from other algorithms to achieve overall superior fitness performance.
Next, a comparison of the convergence time of the underlying operators and the algorithm of this paper is presented. In the Table 8 below, each set of instances underwent 20 experiments, with the results reflecting the convergence time defined in the aforementioned metrics. We filled in the average, minimum values, and standard deviations into the table. The minimum value allows for an observation of the smallest values achieved by the algorithm, the average value reflects the overall average convergence time of the algorithm, and the standard deviation indicates the stability of the algorithm, with a smaller standard deviation implying greater stability.
By observing the table, it can be found that PSO generally struggles to achieve faster convergence speeds, while DE and AC-HATP can achieve better convergence speeds, significantly outperforming other algorithms. Additionally, the overall convergence speed of AC-HATP is superior to both DE and GWO. SCA has slower convergence speeds, although it can also achieve relatively fast convergences in some cases. Overall, the convergence speed of the algorithm designed in this study is superior to that of the other operators.
We have selected Task Case 8 from several test runs to plot the convergence curves of the algorithm, as shown in Figure 7.
Combining the above figures, it can be seen that the algorithm designed in this study achieves a fast convergence speed in the initial stages, and the final convergence results are maintained at a good level. Considering the practical engineering requirements of aerospace, the algorithm needs to obtain superior solutions in a short period of time, which demonstrates that the algorithm designed in this study can support practical applications.
Next, a comparison of the diversity of solutions between the underlying operators and the algorithm of this study is conducted. In the Table 9 below, each set of instances underwent 20 experiments, with the results reflecting the diversity index of solutions (ranging from 0 to 1) as defined in the previous metrics. We filled the average values into the Table 9.
By observing the table, it can be seen that PSO and SCA generally achieve a sufficient number of solutions, while GWO and DE obtain a limited number of solutions. This is also why these two algorithms are prone to falling into local optima. AC-HATP also manages to obtain a considerable number of solutions, but overall fewer than PSO and SCA.
From the above comparisons, we can draw the conclusion. The AC-HATP algorithm designed in this study achieves fitness levels almost as excellent as DE, and its diversity of solutions is higher than DE, making it less likely to fall into local optima traps and more capable of obtaining superior solutions in various complex and diverse scenarios. Additionally, AC-HATP boasts excellent convergence time, achieving convergence in a shorter period. Therefore, AC-HATP combines the strengths of the four underlying operators, and overall, it is able to obtain superior solutions in a shorter time while maintaining diversity. These experiments also fully demonstrate the effectiveness of high-level reinforcement learning algorithms in operator selection.

5.4. Comparative Experiments with Other Algorithms

To enhance the credibility of this algorithm and its overall level under various conditions, this study selected two heuristic algorithms, two meta-heuristic algorithms, and two reinforcement learning-based hyper-heuristic algorithms [68,69] to compare with the algorithm presented in this paper.
The comparison table for optimal fitness is shown in the table below. In the Table 10, each instance corresponds to a set of tasks, and the number of tasks for each instance can be obtained from the table. Each set of instances underwent 20 experiments, with the results representing the final fitness value of the current experiment. We filled in the average, minimum values, and standard deviations into the table below.
Based on the table above, it can be observed that, for the instances of this problem, GA and WDO struggle to achieve excellent solutions. SA shows some advantage in solving small-scale problems, but this advantage diminishes for larger-scale solutions. Although TSA has relatively strong stability, its fitness function results are poor and do not demonstrate its advantages in this problem. At the same time, this paper uses hyper-heuristic algorithms DMAB and SLMAB for comparative experiments, and the results prove that these two algorithms can also achieve good fitness. However, relatively speaking, SLMAB has a larger standard deviation, and both DMAB and SLMAB show some disparities in performance on large-scale problems, though these disparities are not significant. The experimental results also show that the AC-HATP proposed in this paper can achieve effects similar to typical hyper-heuristic algorithms.
Next, a comparison of algorithm convergence time between the underlying operators and the algorithm of this paper is conducted. In the Table 11 below, each set of instances underwent 20 experiments, with the results representing the convergence time as defined in the aforementioned metrics. We filled in the average, minimum values, and standard deviations into the table below.
From the comprehensive analysis of the tables and data, it is evident that GA has a fast algorithm convergence speed, but its overall standard deviation is large, indicating that the algorithm’s convergence is not stable. The TSA algorithm has a generally slow convergence speed and a small variance, suggesting weaker downward convergence capabilities. SA, DMAB, and SLMAB have sufficient convergence speeds, although slightly lower than that of the algorithm discussed in this paper. The experimental results prove that the algorithm of this study maintains good performance in terms of convergence speed, and the variance is within an acceptable range, indicating strong stability of the algorithm.
Finally, based on the Algorithm Composite Evaluation Index mentioned earlier, this study calculates the index for several algorithms mentioned above. The following Table 12 shows the index for each algorithm across 16 instances, where the index is calculated as the average of 20 experiments.
Based on the table above, it can be concluded that, from a comprehensive perspective, the algorithm discussed in this paper achieved good scores most of the time, with the highest scores in 13 out of 16 instances. The DE algorithm also showed good advantages, but its overall score was slightly lower than that of the algorithm in this study due to the lower diversity index of its solutions. The other algorithms had slightly lower overall scores. Therefore, the experimental results prove that the algorithm proposed in this study achieves better results in the Algorithm Composite Evaluation Index, thereby exhibiting better adaptability in diverse environments.

6. Applications

The algorithm proposed in this study requires engineering deployment based on practical conditions when addressing different scientific problems and mission objectives.
Before the algorithm can be applied, an engineering requirements analysis must first be conducted to clearly define the scientific mission objectives and expected performance metrics. This process begins by defining the relevant scientific tasks and designing corresponding TOCs. The final scientific mission objectives are then represented using mathematical methods. Next, corresponding payload, telemetry parameters, and engineering parameters must be designed, along with the related telecommand commands. Finally, the expected performance metrics, such as the minimum data collection volume and maximum resource consumption, should be specified. Subsequently, data preparation and standardization must be completed, which includes designing the TOCs and associated constraints, defining resource constraints, and designing the reinforcement learning reward function as well as the parameters and weights in the neural network.
Based on the prepared data, and in accordance with the results of the requirements analysis and data preparation, an experimental environment is created on the ground, including telemetry parameters, mission-level objective commands (TOCs), and reward functions. The model is then trained and the neural network parameters are optimized through offline reinforcement learning and self-training on ground-based equipment. If the test results meet the expected performance metrics defined in the requirements analysis, the trained model parameters and related code can be deployed onto the spacecraft for offline application.
To verify the usability and deployment of the algorithm in actual engineering projects, this study is based on the improved Space and Ground Cooperative Management Control System (SCMCS) [70] designing relevant scenarios as shown in Figure 8, constructing mission-level instructions, and validating the application of the algorithm in actual engineering projects, thereby continuously building the intelligence capabilities of spacecraft.
This system is primarily used to meet the comprehensive management and control requirements of spacecraft. Within the system, onboard simulation equipment and the payload manager are connected via the 1553B bus. The payload manager obtains data related to digital payloads through an RS422 interface and controls the execution of related operations by the digital payloads.
In this architecture, as mentioned in the literature next to TOCs, the conversion from TOCs to primitive-level commands has already been implemented. This study focuses on optimizing the execution sequence of TOCs. After the spacecraft has acquired several targets, the execution order of these targets affects the observational efficiency of the spacecraft. Faced with multiple TOCs, based on the algorithm designed in this study, a more appropriate execution order is obtained. Subsequently, the continual decomposition of TOCs is implemented to achieve autonomous mission planning for the spacecraft.
Based on Table 13, this study sets the spacecraft’s gimbal path planning as the TOCs, and sets two parameters: azimuth and elevation angles. All planned mission-level commands are based on the TOC format shown in the table below.
Based on the instance generation rules described above, we have set and generated 30 targets for the spacecraft. According to the mission objectives, we input these targets into the spacecraft via data injection, and the algorithm is executed based on the relevant parameters discussed previously. The parameters of one such execution of the algorithm are shown in Table 14.
The convergence curves and visualization of the results of the above runs are shown in Figure 9 below.
The set of figures presented in this work consists of twelve plots, organized into four groups labeled from a to l. Each group includes three plots: the first represents the convergence curve, the second shows the distribution of the initial task-level instruction set, and the third illustrates the sequence of task-level instructions at the conclusion. In the second and third plots of each group, the x and y axes correspond to the values of two parameters of the task-level instructions. The convergence curve reflects the cost–benefit ratio, demonstrating that the algorithm is able to significantly reduce the execution resource cost–benefit ratio between tasks. This illustrates the algorithm’s effectiveness in optimizing task sequences and improving resource efficiency.
Based on the presented figures and tables, the algorithm demonstrates its efficiency in obtaining high-quality solutions within an acceptable time frame, with execution times consistently ranging from 19.8 to 20.7 s across various instances. The convergence time also stabilizes quickly, ranging from 2.5 to 4.8 s, reflecting the algorithm’s ability to reach a solution within a reasonable duration. Notably, the fitness values show substantial improvements, decreasing from over 2000 in the initial state to between 619 and 820 in the final results, indicating successful optimization. The solution diversity index remains stable across instances, ranging from 0.2993 to 0.3177, suggesting that the algorithm effectively explores diverse solution spaces without compromising efficiency. Furthermore, the memory usage, ranging from 8963 KB to 9284 KB, remains within reasonable limits, supporting the algorithm’s feasibility for typical mission constraints. These results highlight the algorithm’s capability to optimize task planning, runtime, and resource usage effectively, making it suitable for practical deployment in space missions.

7. Conclusions and Future Outlook

This study, based on the actual needs of adaptive scientific exploration, designed a scheduling strategy for spacecraft mission-level command execution based on the concept of spacecraft TOCs. The algorithm effectively enhances the autonomy of the spacecraft and achieves good solutions in various adaptive environments, meeting the operational needs of spacecraft in deep space.
In this study, we designed relevant experiments to verify the degree of optimization, effectiveness, and robustness of the algorithm. The experiments prove that the algorithm performs better overall compared to related independent operators and achieves good performance in a variety of complex environmental conditions. This indicates that this study has significant implications for supporting the continuous development of spacecraft autonomy.
The optimization algorithm discussed in this study still faces certain challenges in practical applications that need to be addressed further. On the one hand, since reinforcement learning faces unpredictable environments and limited training opportunities in deep space missions, the models and methods in this study cannot support real-time learning and must rely on offline reinforcement learning. This leads to limited adaptability of policies to novel scenarios and potentially suboptimal decisions due to insufficient coverage of training data. On the other hand, as the number of TOCs increases, how to balance the algorithm’s planning time and memory usage with the degree of optimization remains to be further resolved and researched. Additionally, the autonomous generation of spacecraft TOCs is also a challenge that needs to be addressed, which is crucial for further enhancing the intelligence capabilities of spacecraft. This will be a key focus for future research in this area.

Author Contributions

Conceptualization, L.L.; Methodology, J.Z.; Software, J.Z.; Validation, J.Z.; Writing—original draft, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by: China’s Beijing Science and Technology Program, cultivated by the Space Science Laboratory of Beijing Huai rou Comprehensive National Science Center under grant Z201100003520006, and Strategic Priority Research Program (Class A) of the Chinese Academy of Sciences—Space Science (Phase II): Space Science Program Overall under grant XDA15060000. The APC was funded by National Space Science Center of CAS.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study and due to the requirements of the author’s institution. Requests to access the datasets should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Dvorak, D.D.; Ingham, M.D.; Morris, J.R.; Gersh, J.R. Goal-based operations: An overview. J. Aerosp. Comput. Inf. Commun. 2009, 6, 123–141. [Google Scholar] [CrossRef]
  2. Turky, A.; Sabar, N.R.; Dunstall, S.; Song, A. Hyper-heuristic local search for combinatorial optimization problems. Knowl.-Based Syst. 2020, 205, 106264. [Google Scholar] [CrossRef]
  3. Pillay, N.; Qu, R. Assessing hyper-heuristic performance. J. Oper. Res. Soc. 2021, 72, 2503–2516. [Google Scholar] [CrossRef]
  4. Asta, S.; Özcan, E.; Curtois, T. A tensor based hyper-heuristic for nurse rostering. Knowl.-Based Syst. 2016, 98, 185–199. [Google Scholar] [CrossRef]
  5. Pour, S.M.; Drake, J.H.; Burke, E.K. A choice function hyper-heuristic framework for the allocation of maintenance tasks inf Danish railways. Comput. Oper. Res. 2018, 93, 15–26. [Google Scholar] [CrossRef]
  6. Choong, S.S.; Wong, L.P.; Lim, C.P. An artificial bee colony algorithm with a modified choice function for the traveling salesman problem. Swarm Evol. Comput. 2019, 44, 622–635. [Google Scholar] [CrossRef]
  7. Lamghari, A.; Dimitrakopoulos, R. Hyper-heuristic approaches for strategic mine planning under uncertainty. Comput. Oper. Res. 2020, 115, 104590. [Google Scholar] [CrossRef]
  8. Singh, E.; Pillay, N. A study of ant-based pheromone spaces for generation constructive hyper-heuristics. Swarm Evol. Comput. 2022, 72, 101095. [Google Scholar] [CrossRef]
  9. Hu, R.J.; Zhang, Y.L. Fast path planning for long-range planetary roving based on a hierarchical framework and deep reinforcement learning. Aerospace 2022, 9, 101. [Google Scholar] [CrossRef]
  10. Kallestad, J.; Hasibi, R.; Hemmati, A.; Sörensen, K. A general deep reinforcement learning hyperheuristic framework for solving combinatorial optimization problems. Eur. J. Oper. Res. 2023, 309, 446–468. [Google Scholar] [CrossRef]
  11. Qin, W.; Zhuang, Z.L.; Huang, Z.Z.; Huang, H. A novel reinforcement learning-based hyper-heuristic for heterogeneous vehicle routing problem. Comput. Ind. Eng. 2021, 156, 107252. [Google Scholar] [CrossRef]
  12. Panzer, M.; Bender, B.; Gronau, N. A deep reinforcement learning based hyper-heuristic for modular production control. Int. J. Prod. Res. 2024, 62, 2747–2768. [Google Scholar] [CrossRef]
  13. Tu, C.; Bai, R.; Aickelin, U.; Zhang, Y.; Du, H. A deep reinforcement learning hyper-heuristic with feature fusion for online packing problems. Expert Syst. Appl. 2023, 230, 120568. [Google Scholar] [CrossRef]
  14. Chen, K.W.; Bei, A.N.; Wang, Y.J.; Zhang, H. Modeling of imaging satellite mission planning based on PDDL. Ordnance Ind. Autom. 2018, 27, 41–44. [Google Scholar]
  15. Chen, A.X.; Jiang, Y.F.; Cai, X.L. Research on the Formal Representation of Planning Problem. Comput. Sci. 2008, 35, 105–110. [Google Scholar]
  16. Green, C. Theorem proving by resolution as a basis for question-answering systems. Mach. Intell. 1969, 4, 183–205. [Google Scholar]
  17. McCarthy, J. Situations, Actions, and Causal Laws; Comtex Scientific: New York, NY, USA, 1963; pp. 410–417. [Google Scholar]
  18. Fikes, R.E.; Nilsson, N.J. STRIPS: A new approach to the application of theorem proving to problem solving. Artif. Intell. 1971, 2, 189–208. [Google Scholar] [CrossRef]
  19. Ghallab, M.; Howe, A.; Knoblock, C.; McDermott, D.; Ram, A.; Veloso, M.; Weld, D.; Wilkins, D. PDDL—The Planning Domain Definition Language—Version 1.2; Technical Report CVC TR-98-003/DCS TR-1165; Yale Center for Computational Vision and Control, Yale University: New Haven, CT, USA, 1998. [Google Scholar]
  20. Fox, M.; Long, D. PDDL2.1: An extension to PDDL for expressing temporal planning domains. J. Artif. Intell. Res. 2003, 20, 61–124. [Google Scholar] [CrossRef]
  21. Edelkamp, S.; Hoffmann, J. PDDL2.2: The Language for the Classical Part of the Fourth International Planning Competition; Technical Report 195; Institut für Informatik, Albert-Ludwigs-Universität Freiburg: Freiburg, Germany, 2004. [Google Scholar]
  22. Gerevini, A.; Long, D. Plan Constraints and Preferences in PDDL3: The Language of the Fifth International Planning Competition; University of Brescia Italy: Brescia, Italy, 2005. [Google Scholar]
  23. Batusov, V.; Soutchanski, M. A logical semantics for PDDL+. Proc. Int. Conf. Autom. Plan. Sched. 2019, 29, 40–48. [Google Scholar] [CrossRef]
  24. Zhu, L.Y.; Ye, Z.L.; Li, Y.Q.; Fu, Z.; Xu, Y. Modeling of Autonomous Flight Mission Intelligent Planning for Small Body Exploration. J. Deep. Space Explor. 2019, 6, 463–469. [Google Scholar]
  25. Zemler, E.; Azimi, S.; Chang, K.; Morris, R.A.; Frank, J. Integrating task planning with robust execution for autonomous robotic manipulation in space. In Proceedings of the ICAPS Workshop on Planning and Robotics, Nancy, France, 19–30 October 2020. [Google Scholar]
  26. Li, X.; Li, C.G.; Guo, X.Y.; Zhi, Q. A Modeling Method for Inter-Satellite Transmission Tasks Planning in Collaborative Network based on PDDL. In Proceedings of the 2019 14th IEEE International Conference on Electronic Measurement & Instruments (ICEMI) 2019, Changsha, China, 1–3 November 2019; pp. 1460–1467. [Google Scholar]
  27. Ma, M.H.; Zhu, J.H.; Fan, Z.L.; Luo, X. A Model of Earth Observing Satellite Application Task Describing. J. Natl. Univ. Def. Technol. 2011, 33, 89–94. [Google Scholar]
  28. Chen, J.Y.; Zhang, C.; Li, Y.B. Multi-star cooperative task planning based on hyper-heuristic algorithm. J. China Acad. Electron. Inf. Technol. 2018, 13, 254–259. [Google Scholar]
  29. Xue, Z.J.; Yang, Z.; Li, J.; Zhao, B. Autonomous Mission Planning of Satellite for Emergency. Command. Control Simul. 2015, 37, 24–30. [Google Scholar]
  30. Xu, W.M. Autonomous Mission Planning Method and System Design of Deep Space Explorer. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2006. [Google Scholar]
  31. Wang, X.H. Study on Autonomous Mission Planning Technology for Deep Space Explorer Under Dynamic Uncertain Environment. Master’s Thesis, Nanjing University of Aeronautics and Astronautics, Nanjing, China, 2017. [Google Scholar]
  32. Chien, S.; Rabideau, G.; Knight, R.; Sherwood, R.; Engelhardt, B.; Mutz, D.; Estlin, T.; Smith, B.; Fisher, F.; Barrett, T.; et al. Aspen-automated planning and scheduling for space mission operations. In Proceedings of the Space Ops, Cape Town, South Africa, 18–22 May 2000; p. 82. [Google Scholar]
  33. Fratini, S.; Cesta, A. The APSI framework: A platform for timeline synthesis. In Proceedings of the Workshop on Planning and Scheduling with Timelines, Sao Paulo, Brazil, 25–29 June 2012; pp. 8–15. [Google Scholar]
  34. Johnston, M.D. Spike: Ai scheduling for nasa’s hubble space telescope. In Proceedings of the Sixth Conference on Artificial Intelligence for Applications, Santa Barbara, CA, USA, 5–9 May 1990; IEEE Computer Society: Los Alamitos, CA, USA, 1990; pp. 184–185. [Google Scholar]
  35. Jiang, X.; Xu, R.; Zhu, S.Y. Research on Task Planning Problems for Deep Space Exploration Based on Constraint Satisfaction. J. Deep. Space Explor. 2018, 5, 262–268. [Google Scholar]
  36. Du, J.W. Modeling mission planning for imaging satellite based on colored Petri nets. Comput. Appl. Softw. 2012, 29, 324–328. [Google Scholar]
  37. Liang, J.; Zhu, Y.H.; Luo, Y.Z.; Zhang, J.-C.; Zhu, H. A precedence-rule-based heuristic for satellite onboard activity planning. Acta Astronaut. 2021, 178, 757–772. [Google Scholar] [CrossRef]
  38. Bucchioni, G.; De Benedetti, M.; D’Onofrio, F.; Innocenti, M. Fully safe rendezvous strategy in cis-lunar space: Passive and active collision avoidance. J. Astronaut. Sci. 2022, 69, 1319–1346. [Google Scholar] [CrossRef]
  39. Muscettola, N. HSTS: Integrating Planning and Scheduling; The Robotics Institute, Carnegie Mellon University: Pittsburgh, PA, USA, 1993. [Google Scholar]
  40. Chang, Z.X.; Chen, Y.N.; Yang, W.Y.; Zhou, Z. Mission planning problem for optical video satellite imaging with variable image duration: A greedy algorithm based on heuristic knowledge. Adv. Space Res. 2020, 66, 2597–2609. [Google Scholar] [CrossRef]
  41. Zhao, Y.B.; Du, B.; Li, S. Agile satellite mission planning via task clustering and double-layer tabu algorithm. Comput. Model. Eng. Sci. 2020, 122, 235–257. [Google Scholar] [CrossRef]
  42. Jin, H.; Xu, R.; Cui, P.Y.; Zhu, S.; Jiang, H.; Zhou, F. Heuristic search via graphical structure in temporal interval-based planning for deep space exploration. Acta Astronaut. 2020, 166, 400–412. [Google Scholar] [CrossRef]
  43. Federici, L.; Zavoli, A.; Colasurdo, G. On the use of A* search for active debris removal mission planning. J. Space Saf. Eng. 2021, 8, 245–255. [Google Scholar] [CrossRef]
  44. Long, J.; Wu, S.; Han, X.; Wang, Y.; Liu, L. Autonomous task planning method for multi-satellite system based on a hybrid genetic algorithm. Aerospace 2023, 10, 70. [Google Scholar] [CrossRef]
  45. Xiao, P.; Ju, H.; Li, Q.; Xu, H. Task planning of space maintenance robot using modified clustering method. IEEE Access 2020, 8, 45618–45626. [Google Scholar] [CrossRef]
  46. Wang, F.R. Research on Autonomous Mission Planning Method of Microsatellite Based on Improved Genetic Algorithm. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2017. [Google Scholar]
  47. Zhao, P.; Chen, Z.M. An adapted genetic algorithm applied to satellite autonomous task scheduling. Chin. Space Sci. Technol. 2016, 36, 47–54. [Google Scholar]
  48. Feng, X.E.; Li, Y.Q.; Yang, C.; He, X.; Xu, Y.; Zhu, L. Structural design and autonomous mission planning method of deep space exploration spacecraft for autonomous operation. Control Theory Appl. 2019, 36, 2035–2041. [Google Scholar]
  49. Harris, A.; Valade, T.; Teil, T.; Schaub, H. Generation of spacecraft operations procedures using deep reinforcement learning. J. Spacecr. Rocket. 2022, 59, 611–626. [Google Scholar] [CrossRef]
  50. Huang, Y.; Mu, Z.; Wu, S.; Cui, B.; Duan, Y. Revising the observation satellite scheduling problem based on deep reinforcement learning. Remote Sens. 2021, 13, 2377. [Google Scholar] [CrossRef]
  51. Wei, L.N.; Chen, Y.N.; Chen, M.; Chen, Y. Deep reinforcement learning and parameter transfer based approach for the multi-objective agile earth observation satellite scheduling problem. Appl. Soft Comput. 2021, 110, 107607. [Google Scholar] [CrossRef]
  52. Zhao, X.X.; Wang, Z.K.; Zheng, G.T. Two-phase neural combinatorial optimization with reinforcement learning for agile satellite scheduling. J. Aerosp. Inf. Syst. 2020, 17, 346–357. [Google Scholar] [CrossRef]
  53. Eddy, D.; Kochenderfer, M. Markov decision processes for multi-objective satellite task planning. In Proceedings of the 2020 IEEE Aerospace Conference, Big Sky, MT, USA, 7–14 March 2020; pp. 1–12. [Google Scholar]
  54. Truszkowski, W.; Hallock, H.; Rouff, C.; Karlin, J.; Rash, J.; Hinchey, M.; Sterritt, R. Autonomous and Autonomic Systems: With Applications to NASA Intelligent Spacecraft Operations and Exploration Systems; Springer Science & Business Media: London, UK, 2009. [Google Scholar]
  55. Maullo, M.J.; Calo, S.B. Policy management: An architecture and approach. In Proceedings of the 1993 IEEE 1st International Workshop on Systems Management, Los Angeles, CA, USA, 14–16 April 1993; pp. 13–26. [Google Scholar]
  56. Zhang, J.W.; Lyu, L.Q. A Spacecraft Onboard Autonomous Task Scheduling Method Based on Hierarchical Task Network-Timeline. Aerospace 2024, 11, 350. [Google Scholar] [CrossRef]
  57. Lyu, L.Q. Design and Application Study of Intelligent Flight Software Architecture on Spacecraft. Ph.D. Thesis, University of Chinese Academy of Sciences (National Space Science Center of Chinese Academy of Sciences), Beijing, China, 2019. [Google Scholar]
  58. Menger, K.; Dierker, E.; Sigmund, K.; Dawson, J.W. Ergebnisse eines Mathematischen Kolloquiums; Springer: Vienna, Austria, 1998. [Google Scholar]
  59. Gai, W.D.; Qu, C.Z.; Liu, J.; Zhang, J. An improved grey wolf algorithm for global optimization. In Proceedings of the 2018 Chinese Control and Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; pp. 2494–2498. [Google Scholar]
  60. Floudas, C.A.; Gounaris, C.E. An overview of advances in global optimization during 2003–2008. Lect. Glob. Optim. 2009, 55, 105–154. [Google Scholar]
  61. Lee, C.Y.; Zhuo, G.L. A hybrid whale optimization algorithm for global optimization. Mathematics 2021, 9, 1477. [Google Scholar] [CrossRef]
  62. Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
  63. Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Grey wolf optimizer. Adv. Eng. Softw. 2014, 69, 46–61. [Google Scholar] [CrossRef]
  64. Mirjalili, S. SCA: A sine cosine algorithm for solving optimization problems. Knowl.-Based Syst. 2016, 96, 120–133. [Google Scholar] [CrossRef]
  65. Storn, R.; Price, K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
  66. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  67. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  68. DaCosta, L.; Fialho, A.; Schoenauer, M.; Sebag, M. Adaptive operator selection with dynamic multi-armed bandits. In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation, Atlanta, GA, USA, 12–16 July 2008; pp. 913–920. [Google Scholar]
  69. Fialho, Á.; Da Costa, L.; Schoenauer, M.; Sebag, M. Analyzing bandit-based adaptive operator selection mechanisms. Ann. Math. Artif. Intell. 2010, 60, 25–64. [Google Scholar] [CrossRef]
  70. Lu, G.Y.; Lyu, L.Q.; Zhang, J.W. Design of Data Injection Tool Based on CCSDS RASDS Information Object Modeling Method. Spacecr. Eng. 2023, 32, 90–96. [Google Scholar]
Figure 1. Interaction diagram of task planning capabilities and overall architecture.
Figure 1. Interaction diagram of task planning capabilities and overall architecture.
Aerospace 12 00379 g001
Figure 2. Schematic and data flow diagram of an actor–critic-based hyper-heuristic autonomous task planning algorithm.
Figure 2. Schematic and data flow diagram of an actor–critic-based hyper-heuristic autonomous task planning algorithm.
Aerospace 12 00379 g002
Figure 3. Schematic representation of the structure and shape of the actor network in the actor–critic method.
Figure 3. Schematic representation of the structure and shape of the actor network in the actor–critic method.
Aerospace 12 00379 g003
Figure 4. Schematic representation of the structure and shape of the critic network in the actor–critic method.
Figure 4. Schematic representation of the structure and shape of the critic network in the actor–critic method.
Aerospace 12 00379 g004
Figure 5. Change in reward function during reinforcement learning training.
Figure 5. Change in reward function during reinforcement learning training.
Aerospace 12 00379 g005
Figure 6. Schematic folded and stacked plots of operator choices before (1 session) and after (4 sessions) training.
Figure 6. Schematic folded and stacked plots of operator choices before (1 session) and after (4 sessions) training.
Aerospace 12 00379 g006aAerospace 12 00379 g006b
Figure 7. Plot of four iterations of run case for test case 8.
Figure 7. Plot of four iterations of run case for test case 8.
Aerospace 12 00379 g007
Figure 8. The basic structure of the ground collaborative management and control system.
Figure 8. The basic structure of the ground collaborative management and control system.
Aerospace 12 00379 g008
Figure 9. Convergence plot after running the algorithm on the example.
Figure 9. Convergence plot after running the algorithm on the example.
Aerospace 12 00379 g009aAerospace 12 00379 g009b
Table 1. Summary of research in autonomous mission planning methods.
Table 1. Summary of research in autonomous mission planning methods.
CategoryMethodsFeaturesApplicationsLimitations
Traditional Methods
-
First-order Logic
-
STRIPS
-
Situation Calculus
-
PDDL Variants
  • Logical rigor and strict syntactic structures
  • Supports complex problem descriptions
  • Allows detailed domain modeling
-
Deep Space 1 (DS1)
-
Cassini–Huygens mission
-
Mars Rover missions
  • Inflexible in dynamic, unpredictable environments typical of deep space
  • Too rigid for complex scenarios
Heuristic Algorithms
-
RGP
-
GBFS
-
SHGA
  • Employs intuitive solution paths
  • Facilitates quick convergence to satisfactory solutions
  • Scalable to large problem sizes
-
Hubble Space Telescope Servicing Missions
-
Earth Observing-1 (EO-1)
-
Autonomous Nano Satellite Guardian Evaluating Local Space (ANGELS)
  • Suboptimal in complex, multi-variable environments
  • Often fails to find the global optimum, limited by specific heuristic rules
Meta-Heuristic Algorithms
-
GA
-
SA
-
PSO
-
H-GASA
  • Capable of exploring large search spaces
  • Adaptable to varying problem constraints
  • Can find near-optimal solutions with sufficient computational resources
-
Swarm satellite systems
-
DARPA’s Orbital Express
-
Galaxy 15 satellite reactivation
  • It may require extensive computation
  • Can struggle with convergence in highly complex environments
  • Generalizations across different tasks can be poor
Reinforcement Learning
-
DRL
-
DQN
-
DDPG
-
RLPT
-
SMDP
  • Continuous learning from environment interaction
  • Adjusts strategies based on reward feedback
  • Suitable for dynamic adaptation
-
Lunar Gateway (NASA’s planned space station in lunar orbit)
-
SPHERES satellites on the ISS
-
Mars Sample Return Rover
  • Limited by the need for large amounts of training data
  • Impractical for online training in deep space
  • Can be overly sensitive to hyperparameters and initial conditions
Table 2. Summary of autonomous spacecraft task planning processes.
Table 2. Summary of autonomous spacecraft task planning processes.
NameDecision-MakingPlanningScheduling
InputSpacecraft’s current environment and statusUnordered set of task-level objective commands without timestampsTask-level objective commands
OutputTask-level objective commands and their parametersTimestamped sequence of task-level objective commandsSchedule-level commands
Primitive-level commands
Problem CategoryDecision problemOptimization problemDecomposition problem
Table 3. Summary of symbols and definitions for the task sequence planning model.
Table 3. Summary of symbols and definitions for the task sequence planning model.
TermDefinition
T Represents   the   set   of   TOCs ,   T = t 1 , t 2 , , t n
x i j k Equals   1   if   task   j   is   executed   at   position   k   following   task   i , otherwise 0
d i j m The   cost   in   the   m -th dimension (e.g., time, fuel consumption) of transitioning from task i to task j
r i n The   amount   of   the   n - th   type   of   resource   required   to   perform   task   i
R total n Total   quantity   of   the   n -th type of resource
e i j n The   amount   of   the   n - th   type   of   resource   obtained   after   completing   task   j   through   task   i
c i j Cost benefit   ratio   from   task   i   to   task   j
P m i n Minimum cost–benefit ratio threshold
W i j n Time   window   available   for   resource   n   between   tasks   i and   j ; equals 1 if within the window, otherwise 0
τ max i j Maximum   allowable   time   interval   between   tasks   i   and   j
M A   sufficiently   large   number   ensuring   that   certain   constraints   are   inactive   when   x i j k = 0
v i n The   consumption   rate   of   the   n - th   type   of   resource   during   the   execution   of   task   i
u i The relative position of task i during task execution.
t i j The time required to transition from task i to task j
V i j The profit obtained after executing task i and then task j
S i j Task feasibility constraint parameter, representing the constraint condition for whether a task is executable
Table 4. Comparative overview of selected optimization algorithms.
Table 4. Comparative overview of selected optimization algorithms.
No.Algorithm NamePrinciple of the AlgorithmFeature of the AlgorithmExplanation of the Feature
1PSOSimulates the foraging behavior of bird flocks, moving through the search space via collaboration and information sharing to find the optimal solution.Global Search CapabilityParticles in the particle swarm optimization algorithm explore the space randomly, with stochastic parameters ensuring that different particles explore different areas.
2WOA Emulates the social hierarchy and group hunting behaviors of grey wolves to optimize, simulating the processes of tracking, encircling, and capturing prey in the search space to find the optimal solution.Fast Convergence RateUtilizes the leader and follower mechanism along with the strategy of encircling prey to ensure fast convergence rate in the search space.
3SCAAdjusts the search path of solutions using sine and cosine rules.Global Search CapabilityLeverages the properties of mathematical sine and cosine functions to quickly adjust the direction and position of solutions, improve the global search capability.
4DESimulates the evolutionary process of biological populations, finding optimal solutions through iterative mutation, crossover, and selection operations.High Quality of Solution OptimizationEfficiently adapts to diverse optimization landscapes, consistently delivering high-quality solutions even in complex problem spaces.
Table 5. Environment parameters settings.
Table 5. Environment parameters settings.
PropertyValue
Spatial Range(0,0) to (100,100)
Spacecraft Position Range(0,0) to (100,100)
Max TOC Capacity100
Number of Generated TOCs(0,100)
Table 6. Hyper-heuristic algorithm and low-level operator parameters.
Table 6. Hyper-heuristic algorithm and low-level operator parameters.
ObjectPropertyValue
RL strategyAction Dimension4
Actor Learning Rate2 × 10−4
Critic Learning Rate3 × 10−3
Hidden Dimension120
Discount Factor (γ)0.95
Entropy Beta0.05
Epsilon Start1.5
Epsilon End0.01
Epsilon Decay500
Num Episodes10,000
OptimizerAdam
Actor Scheduler Gamma0.9
Critic Scheduler Gamma0.9
SCAscale200
Growth Rate0.1
Competition Rate0.1–0.25
Iterations1000
PSOscale200
C12.0
C21.0
WTypically between 0.5 and 1.0
Velocity−0.2–0.2
Iterations1000
w0.9
DEscale200
Differential Weight0.5
Crossover Probability0.9
Iterations1000
GWOscale200
ALinearly decreasing from 2 to 0 |
CRandom values between 0 and 2
Iterations1000
Table 7. Values of the fitness function for the four underlying operators and the algorithm of this paper tested in 16 instances.
Table 7. Values of the fitness function for the four underlying operators and the algorithm of this paper tested in 16 instances.
Instance IDNumber of Tasks in InstancePSOGWOSCADEAC-HATP
Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.
Task Case1101951934.732001937.341951934.291931930.681931930.72
Task Case21427024412.0726624416.1226624415.362692448.992642449.13
Task Case31832129223.9132129223.1935030428.42992921.982982923.25
Task Case42247140745.6444840337.5754248936.14033839.0840338312.06
Task Case52656342170.7649441052.8966255853.0841638715.141038720.06
Task Case63060550656.4953545061.2575567645.7343641233.3843941535.25
Task Case734731564134.958644090.93105585998.8342841124.6742541141.12
Task Case838900679141.3371559289.13122799588.8958849956.8859349972.23
Task Case9421017774145.8876957098.641277107786.1163053463.8461653275.86
Task Case10461224931232.78948707175.211582146872.170664090.3701633102.5
Task Case115012641038144.3887597144.491654148199.7968959169.2265558480.36
Task Case125416541099287.041235785340.971938176587.5879709123.98863702159.56
Task Case135818151370260.311143891133.382099193191.08959768136.3933745122.5
Task Case146222071670289.411431982347.482446226795.821172949131.41103932153.2
Task Case156619301545201.281191745367.682220204572.181190999134.941186993155.36
Task Case167023201629245.5415051126251.422549234987.7914241068178.0413851033186.46
Table 8. Weighted convergence speed (WCS) of the four underlying operators and the algorithm of this paper tested in 16 instances.
Table 8. Weighted convergence speed (WCS) of the four underlying operators and the algorithm of this paper tested in 16 instances.
Instance IDNumber of Tasks in InstancePSOGWOSCADEAC-HATP
Avg.Max.Std.Avg.Max.Std.Avg.Max.Std.Avg.Max.Std.Avg.Max.Std.
Task Case1100.00010.00020.00010.007818.70910.21490.005615.67610.15530.009813.07670.23220.010417.16450.4554
Task Case2140.00030.00050.00010.032446.77650.84290.016946.16570.48440.028250.89210.64740.036648.37440.2364
Task Case3180.00010.00020.00010.024325.11330.5670.020731.51070.5460.023332.64950.42820.022536.66330.3665
Task Case4220.00080.0010.00030.036362.14550.9270.017826.47110.39280.031734.25780.60990.032445.58430.5364
Task Case5260.00070.00090.00020.029669.42110.84480.025549.64330.7440.048172.32221.03560.036863.3310.8223
Task Case6300.00020.00030.00010.034958.99270.89130.015830.02020.39710.04347.69870.78760.036162.33720.5746
Task Case7340.00030.00040.00010.04764.0981.05870.029348.11680.84510.0599113.49061.35270.051172.26610.3661
Task Case8380.00050.00070.00020.034337.88410.73750.024784.39450.88280.043569.8390.86950.039556.13530.4366
Task Case9420.00070.00080.00030.032788.37891.03440.029458.61440.77440.04554.00820.80370.043856.63320.8735
Task Case10460.00030.00050.00010.044586.41281.11920.028141.19670.6910.056897.90281.33350.063375.40030.8614
Task Case11500.00010.00020.00010.033549.17170.85690.03887.6881.22550.052852.08910.960.046269.43870.6339
Task Case12540.00030.00050.00010.028544.13030.78580.035984.12541.10720.043380.9511.01530.045876.64360.7268
Task Case13580.00050.00070.00020.042177.94411.16030.033374.67410.99050.056484.48541.32020.049586.63950.5244
Task Case14620.00060.00070.00030.03478.18471.06990.032873.48010.94860.046374.2911.05930.049772.36640.7353
Task Case15660.00020.00030.00010.041280.93671.19750.028855.80440.8020.037246.28110.66290.044796.13530.9562
Task Case16700.00040.00050.00020.034565.9410.89350.0395104.45151.32690.046950.95320.93810.048972.10541.0254
Table 9. Diversity indices of solutions tested by the four underlying operators and the algorithm of this paper in 16 instances.
Table 9. Diversity indices of solutions tested by the four underlying operators and the algorithm of this paper in 16 instances.
Instance IDNumber of Tasks in InstancePSO-Avg.GWO-Avg.SCA-Avg.DE-Avg.AC-HATP-Avg.
Task Case1100.05020.11410.3020.00570.1336
Task Case2140.18340.38730.77410.00780.1845
Task Case3180.87870.55620.8770.01520.2259
Task Case4220.8980.65940.92490.01880.2746
Task Case5260.99260.7730.94490.02320.3103
Task Case63010.83720.95860.02710.3065
Task Case73410.86620.97090.0320.3564
Task Case83810.88140.97650.02620.3362
Task Case94210.91490.98120.02790.3523
Task Case104610.94070.98440.03030.3198
Task Case115010.96020.98810.03150.3664
Task Case125410.95770.990.02930.3342
Task Case135810.96310.9910.02950.3462
Task Case146210.96810.99250.02890.3321
Task Case156610.9820.99350.02620.3558
Task Case167010.97940.99460.02550.3226
Table 10. Fitness function values of the six algorithms and the algorithm in this paper tested in 16 instances.
Table 10. Fitness function values of the six algorithms and the algorithm in this paper tested in 16 instances.
Instance IDNumber of Tasks in InstanceGASAWDOTSADMABSLMABAC-HATP
Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.Avg.Min.Std.
Task Case110227 204 9.03196 193 3.57222 203 9.72142041.121931930.851931930.801931930.72
Task Case214401 335 21.82302 278 10.43367 322 17.4139826413.2626524412.2426524410.252642449.13
Task Case318546 480 27.68431 392 17.62519 441 36.5357852310.543012927.763002944.532982923.25
Task Case422774 719 28.96648 599 21.65735 670 25.077527094.3441238320.1340838514.6340338312.06
Task Case526974 862 39.83809 741 28.39954 869 42.2799380313.2541639221.6642139419.6841038720.06
Task Case6301014 882 43.6893 857 27.29997 933 38.66102295622.3644642339.4445343637.1843941535.25
Task Case7341340 1246 43.321160 1070 38.451298 1172 51.561428120132.1143741139.2444041145.5642541141.12
Task Case8381527 1450 32.271373 1336 23.171532 1454 44.091639146619.8760152256.3961251980.1359349972.23
Task Case9421569 1503 39.741401 1305 41.661555 1457 36.821663158529.6362954682.2562755987.7561653275.86
Task Case10461895 1839 37.211727 1660 33.171903 1766 64.291934189631.56715665110.68723667119.32701633102.5
Task Case11501950 1849 53.021769 1665 40.11856 1709 53.642033199442.2267160181.1366759675.5265558480.36
Task Case12542252 2138 42.692044 1908 62.172170 2034 56.112278215636.26878726113.5891731183.21863702159.56
Task Case13582390 2256 65.732226 2154 32.522387 2212 66.652456233456.4796177892.2955751135.33933745122.5
Task Case14622822 2709 53.252615 2541 38.662744 2611 57.292874283623.561128940155.761145958159.361103932153.2
Task Case15662480 2372 44.452321 2259 31.742397 2270 53.482503246824.661203993160.2312061001140.111186993155.36
Task Case16702804 2644 56.972625 2438 60.242788 2565 81.52866278656.3314051047194.3714111065196.7313851033186.46
Table 11. Weighted convergence speed (WCS) for six algorithms and the algorithm in this paper tested in 16 instances.
Table 11. Weighted convergence speed (WCS) for six algorithms and the algorithm in this paper tested in 16 instances.
Instance IDNumber of Tasks in InstanceGASAWDOTSADMABSLMABAC-HATP
Avg.Std.Avg.Std.Avg.Std.Avg.Std.Avg.Std.Avg.Std.Avg.Std.
Task Case1100.0120.54140.00660.16220.00250.10230.00720.12440.00930.50330.00910.46290.01040.4554
Task Case2140.01690.77640.01730.42390.01140.41140.02330.56620.02720.43310.02690.25910.03660.2364
Task Case3180.02221.01730.01770.48710.00760.30590.01560.25560.01830.42610.02030.40080.02250.3665
Task Case4220.03691.39350.01970.46480.01960.67460.01130.53110.02940.56490.02770.51470.03240.5364
Task Case5260.0411.43870.0270.71550.01320.44210.01520.32050.03110.79470.03050.93610.03680.8223
Task Case6300.03771.41420.02040.54530.01650.59290.01540.43310.02740.70140.02890.65050.03610.5746
Task Case7340.03231.59120.03850.98920.02040.8710.01980.79230.04130.50650.03310.49620.05110.3661
Task Case8380.05252.35080.03621.22450.01040.46220.01220.31120.03770.56520.04250.53290.03950.4366
Task Case9420.03861.39960.02740.67010.02320.83220.0210.21950.03840.96660.03280.94960.04380.8735
Task Case10460.04351.24140.04071.26350.01070.39350.01310.26930.05690.85530.05440.91430.06330.8614
Task Case11500.0752.71750.03410.81460.03371.3040.02490.63740.04230.61410.04560.62940.04620.6339
Task Case12540.04651.97020.0421.10920.0271.15010.02351.05480.04960.80560.04770.79980.04580.7268
Task Case13580.06282.53160.03830.90560.02571.15720.02641.10220.05350.61170.04830.63540.04950.5244
Task Case14620.05221.95230.03760.85410.02641.08320.03310.93210.04870.89610.04920.79650.04970.7353
Task Case15660.03351.03490.04561.26450.02820.8810.02630.85960.04230.95640.03990.93220.04470.9562
Task Case16700.05671.76810.04851.20160.01650.64160.03110.91080.04750.41090.04140.36430.04891.0254
Table 12. Algorithm Composite Evaluation Index for six algorithms and the algorithm in this paper tested in 16 instances.
Table 12. Algorithm Composite Evaluation Index for six algorithms and the algorithm in this paper tested in 16 instances.
Instance IDNumber of Tasks in InstancePSOGWOSCADEGASAWDOTSADMABSLMABAC-HATP
Task Case1100.9641.0281.0271.0431.0381.0270.9641.0031.0471.0491.055
Task Case2141.0821.1671.1541.1360.9931.0951.011.0071.151.1541.167
Task Case3181.271.3161.2861.321.0711.1711.0421.0131.3191.3291.328
Task Case4221.3311.4061.271.4411.0571.1351.04711.4391.4431.451
Task Case5261.5221.6541.3991.791.0591.21.0240.9511.7881.7791.8
Task Case6301.5021.6431.3071.8071.0531.1341.0210.9871.7861.7751.802
Task Case7342.0252.3841.4572.8251.0981.3071.1250.9852.7982.7872.836
Task Case8382.122.6031.5142.9861.1581.3081.0890.9762.9462.9122.971
Task Case9421.9162.5211.4842.931.1311.2981.11712.9352.942.977
Task Case10462.0542.7861.4253.631.0661.2341.0110.9733.5973.5653.653
Task Case11502.193.3181.4674.1261.1321.2921.1860.9874.214.2294.285
Task Case12541.8712.9641.4234.3861.0781.281.121.0034.3944.3314.466
Task Case13581.9063.9911.4374.891.1171.2611.0710.9984.884.9125.033
Task Case14621.964.6071.5466.1381.1051.3031.141.026.4466.3266.627
Task Case15661.7723.9881.3443.9881.0591.2341.1251.0183.9343.9214.008
Task Case16701.7214.2081.3914.61.1151.2921.071.0154.74.6684.804
Number of Maximums000200000113
Table 13. Format and parameters of the TOCs.
Table 13. Format and parameters of the TOCs.
Command NameCommand MeaningIDNumber of ParametersParameter 1ValueParameter 2Value
Rotary table load schedulingExecute the rotary table command.0 × 652Azimuth0–180Pitch angle0–180
Table 14. Table of results after running the algorithm on the instance.
Table 14. Table of results after running the algorithm on the instance.
InstanceInstance 1Instance 2Instance 3Instance 4
Total Run Time(s)20.474320.030420.686219.8063
Convergence Run Time4.86133.75574.13722.4757
Initial Fitness Value1963218920232045
Execution Result Fitness Value772820619750
Solution Diversity Index0.31770.30320.29930.3013
Memory Usage (KB)9284896390329114
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Lyu, L. An Actor–Critic-Based Hyper-Heuristic Autonomous Task Planning Algorithm for Supporting Spacecraft Adaptive Space Scientific Exploration. Aerospace 2025, 12, 379. https://doi.org/10.3390/aerospace12050379

AMA Style

Zhang J, Lyu L. An Actor–Critic-Based Hyper-Heuristic Autonomous Task Planning Algorithm for Supporting Spacecraft Adaptive Space Scientific Exploration. Aerospace. 2025; 12(5):379. https://doi.org/10.3390/aerospace12050379

Chicago/Turabian Style

Zhang, Junwei, and Liangqing Lyu. 2025. "An Actor–Critic-Based Hyper-Heuristic Autonomous Task Planning Algorithm for Supporting Spacecraft Adaptive Space Scientific Exploration" Aerospace 12, no. 5: 379. https://doi.org/10.3390/aerospace12050379

APA Style

Zhang, J., & Lyu, L. (2025). An Actor–Critic-Based Hyper-Heuristic Autonomous Task Planning Algorithm for Supporting Spacecraft Adaptive Space Scientific Exploration. Aerospace, 12(5), 379. https://doi.org/10.3390/aerospace12050379

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop