Reinforcement Learning-Based Formation Pinning and Shape Transformation for Swarms

: Swarm models hold signiﬁcant importance as they provide the collective behavior of self-organized systems. Boids model is a fundamental framework for studying emergent behavior in swarms systems. It addresses problems related to simulating the emergent behavior of autonomous agents, such as alignment, cohesion, and repulsion, to imitate natural ﬂocking movements. However, traditional models of Boids often lack pinning and the adaptability to quickly adapt to the dynamic environment. To address this limitation, we introduce reinforcement learning into the framework of Boids to solve the problem of disorder and the lack of pinning. The aim of this approach is to enable drone swarms to quickly and effectively adapt to dynamic external environments. We propose a method based on the Q-learning network to improve the cohesion and repulsion parameters in the Boids model to achieve continuous obstacle avoidance and maximize spatial coverage in the simulation scenario. Additionally, we introduce a virtual leader to provide pinning and coordination stability, reﬂecting the leadership and coordination seen in drone swarms. To validate the effectiveness of this method, we demonstrate the model’s capabilities through empirical experiments with drone swarms, and show the practicality of the RL-Boids framework.


Introduction
In recent years, swarm intelligence emerges as a promising approach to solve complex problems by harnessing the collective behavior of multiple autonomous agents [1][2][3][4][5].Swarms consist of many relatively simple individuals interacting with each other and their environment to achieve self-organized behavior and accomplish tasks collectively [6,7].Formation adjustment in response to dynamic task requirements or environmental conditions is crucial for maintaining efficiency and adaptability.Based on the Boids model [8], Vicsek investigates the conditions for reaching an agreement on the direction of motion of individuals in a swarm from a statistical mechanics perspective.He also proposes a model that simulates and explains the swarming, transport, and phase transition in a non-equilibrium system (Vicsek model) and finally implements a swarm of 30 UAVs in an outdoor environment flight [9].Reference [10] proposes a predictive model that describes swarming based on the Boids model, which merges the local principles of the potential field model into an objective function and optimizes these principles with knowledge of the dynamics and environment of the agents.
One of the critical challenges in swarm systems is to optimize their size dynamically, based on the environmental conditions and task requirements.Traditional approaches to swarm transformation often rely on predefined rules or centralized control, limiting their flexibility and adaptability.Moreover, reinforcement learning demonstrates its power as an approach for training agents to make intelligent decisions in complex and dynamic environments.
Reinforcement learning offers a promising avenue for addressing the challenge by enabling swarm agents to learn and adapt their behaviors autonomously.Through interactions with the environment, agents can acquire knowledge and refine their decision-making processes based on feedback received as rewards or penalties.This iterative learning process empowers the agents to navigate intricate scenarios, optimize their swarm size, and efficiently fulfill given tasks.
There are three types in the academic community of reinforcement learning algorithms based on dynamic programming: (1) Value iteration.Ref. [11] adopts Q-learning networks to implement behavioral decision making in robots to improve the analysis and prediction capabilities of agents.However, Q-learning networks can not solve the problem of discrete actions in continuous states.Therefore, Ref. [12] proposes a framework for adaptive formation control of multiple robots based on Deep Q-learning networks; (2) Policy iteration.Policy iteration can solve the problem of continuous states and continuous actions.Ref. [13] uses a robust policy gradient algorithm to optimize a fully decentralized sensor-level collision avoidance policy; (3) The Actor-Critic algorithm.Due to the use of the Monte Carlo method, the gradient estimation variance of the algorithm is large.Therefore, the following reinforcement learning algorithm emerged.Ref. [14] develops a MAPPO-based distributed formation and obstacle avoidance approach in which agents use only their local and relative information to make motion decisions.Moreover, Ref. [15] uses a twostage training scheme of imitation learning and reinforcement learning to propose a fused reward function to guide the agents.One fascinating application of RL is of simulating collective behavior, where groups of autonomous agents interact to achieve coordinated objectives [16][17][18].
In addition to reinforcement learning, we also introduce a virtual leader.The multirobot control often uses the virtual leader approach due to its high robustness [19][20][21].Olfati-Saber [22] has significantly contributed to advancing scalable flocking algorithms via a comprehensive theoretical and computational framework.Within this framework, there are three distinct flocking algorithms.The first algorithm adheres to the three fundamental rules outlined by Reynolds, while the third algorithm incorporates obstacle avoidance capabilities.Central to this framework is the second algorithm, designed as the primary flocking mechanism for agents navigating open space.In this second algorithm, the objective is to ensure that the agent group seamlessly follows the trajectory of a virtual leader.Therefore, a sophisticated navigational feedback mechanism is integrated into each agent.Crucially, it is presupposed that every agent within the group possesses information about the virtual leader, thereby qualifying as an informed agent.This assumption is pivotal, ensuring the entire group maintains cohesiveness and converges to the same velocity.The agent responds to the virtual leader like its actual neighbors.Its purpose is to introduce the task of directing, gathering, or manipulating the behavior of agents.
Therefore, this paper proposes a method to select the cohesive and repulsive parameters in the Boids model based on the Q-learning network.Our prescribed states, actions, and the development of special environmental interaction rules can filter the optimal combination of cohesion and repulsion parameters.By changing the parameters of cohesion and repulsion in the algorithm, we can achieve a simulation scenario with continuous obstacle avoidance and maximum coverage of space.At the same time, we introduce virtual leaders to facilitate stable coordination among intelligent agents and trigger the expansion and contraction of formations [23,24].
To tackle the problem, we also conduct experiments using meta-heuristics approaches such as genetic algorithms [25], ant colony optimization [26], and particle swarm optimization [27].They achieve similar results with RL in the experimental results.However, they cannot handle the issue of the dynamic environment, and its time complexity is high due to its need for a large number of iterations.In some real-time planning scenarios, the environment could change rapidly.Although RL is computationally expensive in training (known as learning offline), once the policy is well trained, the decisions can be made quickly (known as planning online).
Experimental results demonstrate that the RL-Boids method exhibits good performance.The utilization of reinforcement learning offers a promising avenue for allowing Boids to adapt autonomously and make intelligent decisions.By allowing Boids to learn and optimize their behaviors via interactions with the environment, we achieve more robust and flexible collective behaviors.
The rest of this paper is organized as follows.In Section 2, we describe the Boids model and formulate the problem.In Section 3, we introduce the basic principles of the Q-learning network.In Section 4, we outline simulation scenarios, train parameters with Q-learning, and present the simulation results.In Section 5, we apply the algorithm in real robot experiments and confirm its feasibility with three drones.Finally, in Section 6, we draw some conclusions and prospects.

Model Description and Problem Formulation 2.1. Creation of Boids Model
To facilitate easy referencing, Table 1 provides the definitions for the essential symbols.In this paper, we consider a swarm of n agents labelled as i, j ∈ {1, • • • , n}.The position, velocity, and control inputs of the i-th agent are denoted by p i , v i , u i ∈ R 3 , respectively.Let d ij = p j − p i denote the central distance between agent i and agent j, where • denotes the Euclidean norm.We model the swarm with an undirected perceptual graph G = (V, E), where the set of vertices V = {1, • • • , n} represents the agents and the set of edges E ⊆ V × V indicates that the agents (i, j) ∈ E can communicate with each other.N i = {j ∈ V | (i, j) ∈ E} denotes the neighbor set G of the agents i, and |•| denotes the cardinality of the set.
We define the set of neighbors using the topological range, that is, the set N i contains the neighbors |N i | in the vicinity of the agent i.This choice conveniently keeps the cardinality of the set of neighbors constant, and the above definition holds for biological swarms [28].
In the simulation, the dynamics of the agents are reproduced in discrete time.Let p i (k), v i (k), u i (k) ∈ R 3 be the position, velocity, and control inputs of the i-th agent at the moment k = t/T , respectively.We can simplify the model appropriately based on the rules of repulsion and cohesion.At this point, the agents in the swarm can be forced to converge to different points and produce the expansion and contraction of the formation by altering the cohesion and repulsion algorithmic parameters.Set the communication distance among the agents as d com and the boundary distance where repulsion and cohesion occur as d 0 (d com > d 0 ).Set the cohesive parameter as k coh and the repulsive parameter as k rep .When the distance d 0 (k) < d ij (k) < d com (k), cohesion occurs between agent i and neighbor j.The effect of neighbor j on the velocity of agent i can be expressed as Repulsion occurs between agent i and neighbor j when the distance d ij (k) < d 0 (k).The effect of neighbor j on the velocity of agent i can be expressed as When the distance , there is no communication between agent i and neighbor j.The control input u i (k) for any agent i can be denoted as The position p i (k + 1) of any agent i at the next moment can be denoted as

Virtual Leader-Based Pinning Algorithm
At the same time, we employ a virtual leader-based clustering pinning algorithm, which plays a crucial role in achieving coordinated behavior among multiple drones.The virtual leader is positioned within the space and migrates along a predetermined trajectory line.The virtual leader serves as a reference point for the cluster, enabling seamless coordination and synchronization among individual units.This migration of the virtual leader serves a dual purpose: it provides both forward pinning and restriction for the formation of the swarm simultaneously (as depicted in Figure 1).This dynamic interaction with the virtual leader facilitates on-the-fly decision making for drones, enhancing their ability to navigate and perform tasks effectively.
Moreover, the pinning algorithm not only calculates the optimal position of the virtual leader, but it also dynamically adapts the positions of the entire agents.This algorithm considers multiple factors, including environmental conditions, obstacles, and desired objectives, to ensure the efficient operation and adaptability of the cluster in response to changing circumstances.By continuously computing and adjusting these positions, the algorithm optimizes the overall performance of the cluster, thereby enhancing its ability to accomplish complex tasks.This dynamic approach enables the swarm to navigate challenging environments and effectively respond to evolving situations, ultimately improving its overall effectiveness and efficiency.

•
Initialization: Set the initial positions and velocities of all drones in the swarm, ensuring a suitable distribution across the desired workspace and define the desired trajectory or path for the swarm to follow, considering any specific objectives or constraints.We also need to determine the parameters for communication and coordination among the drones, such as the range and frequency of wireless communication and the method for exchanging information between drones.These parameters facilitate practical cooperation and synchronization within the swarm.

•
Virtual Leader Update: Assuming that there are N drones forming a swarm, where the position of each drone is represented by (x i , y i ) and the position of the virtual leader is represented by (x v , y v ).The traction algorithm can calculate the position of the virtual leader using the following formula [29]: The above formula calculates the position of the virtual leader by taking a weighted average of the coordinates of all drones.By continuously updating the position of the virtual leader, other drones can adjust their behavior based on the motion of the virtual leader, enabling coordinated movement within the drones.

•
Communication and Coordination: We need an information exchange that allows drones to adjust their trajectory and align with virtual leaders.One commonly used algorithm is the Distributed Average Consensus algorithm.This algorithm enables the drones to converge towards a typical trajectory by iteratively updating their own trajectory based on the information received from neighbors.Each drone maintains a local estimate of the desired trajectory, denoted as x i (k), where i represents the index of the drone and k denotes the iteration step.The update equation for each drone's local estimate can be expressed as: Here, N i represents the set of neighboring drones of drone i, w ij represents the weight associated with the communication link between drone i and drone j, and the term (x j (k) − x i (k)) represents the difference between the trajectory estimates of drone j and drone i.
The weights w ij can be determined based on different criteria, such as distance, connectivity strength, or predefined weights.Common weight assignment strategies include uniform weights, distance-based weights, or dynamically adjusted weights.The consensus algorithm iteratively updates each drone's local estimate by considering the differences between its estimate and the estimates of its neighbors.This iterative process allows the drones to converge towards a common trajectory.• Trajectory Adjustment: Let P d be the position of the drone and P vl be the position of the virtual leader.The desired direction towards the virtual leader is given by: Considering the influence of neighboring drones, where each drone tries to avoid collisions while coordinating its movement, let P n be the position of a neighboring drone.The repulsion from a neighboring drone is: and the total repulsion from all neighboring drones is: The overall direction can be calculated by combining the desired direction towards the virtual leader and the repulsion from neighboring drones.This can be a weighted sum depending on the importance of aligning with the virtual leader versus avoiding collisions.
where w 1 and w 2 are weights that can be adjusted based on requirements.
By following this algorithm, the traction drones can move in a coordinated manner, effectively pulling or moving objects as a team.The virtual leader acts as a central point of reference, guiding the group towards the desired trajectory.

Value-Based Reinforcement Learning Methods
In general, the process of reinforcement learning involves two objects: the environment and the agents.It also contains four basic elements, which are states, actions, rewards, and state transfer functions.To build a model for the swarm system and train it using Q-learning, we need to define the state space, action space, and reward function: • State Space: The state space represents the current state of the swarm system, which includes relevant information about the environment and the swarm's configuration.It can include parameters such as the positions and velocities of individual drones, the position and velocity of the virtual leader, the distances between drones, and any other relevant variables.• Action Space: The action space represents the available actions that the swarm system can take at each state.In this case, the action space consists of different combinations of the cohesion parameter k coh and repulsion parameter k rep .Each combination represents a potential configuration for the swarm system.• Reward Function: The reward function quantifies the desirability or quality of a particular state-action pair.It provides feedback to the swarm system on the goodness or badness of its actions.
Using Q-learning, the swarm system can learn to select the optimal cohesion and repulsion parameters based on the current state and learn from the feedback via reward signals.Q-learning is a reinforcement learning algorithm that uses a Q-table to store and update the expected rewards for each state-action pair.Through an iterative process of exploration and exploitation, the swarm system learns the optimal policy that maximizes the cumulative rewards over time.
To evaluate the performance of the proposed method, extensive simulation experiments can be conducted.The swarm system can be trained using Q-learning with various combinations of parameters and reward functions.The performance metrics, such as the coverage efficiency and collision avoidance rate, can be measured and compared across different parameter settings.By analyzing the results, the optimal value of k can be determined, which maximizes the performance of the swarm system in achieving its objectives.
In reinforcement learning, the interaction among the agents and the environment can be summarized as follows: (1) The environment generates the state s ∈ S of the environment at the current moment and passes it to the agents, of which S represents the state space.The state s describes the features of the current environment and contains the relevant information to support the decision making by the agents; (2) the agents generate corresponding actions a ∈ A based on the state s and their own decision-making strategy π(a|s), where A denotes the action space of the agents.Based on the states and actions, the environment generates the state of the environment at the next moment s according to their own state transfer function P(s |s, a).At the same time, the agents will receive a reward signal r back from the environment, reflecting the gain obtained from the agents' decisions.This complete interaction process is known as a Markov Decision Process (MDP), and its characteristic is reflected in the fact that the state s is determined by the state and action of the previous moment only, independent of the earlier state action.The goal of reinforcement learning is to maximize the cumulative rewards of the interaction between the agents and the environment [30,31] max where γ is the discount factor.To achieve the goal, it is necessary to define the state-value function V(s) to evaluate the goodness of the current state, which is expressed as the future expected cumulative reward from the current state s under strategy π The larger the expectation value, the more advantageous the current state is for completing the task.Similarly, considering the action a performed in the current state according to policy π, the above state-value function can be written in the form of a state-action-value function as follows The value function-based reinforcement learning approach estimates values ( 12) and ( 13) based on interaction experience.Decisions are made based on the value function estimates.For example, based on obtaining an action-value function estimate Q(s, a), the -greedy algorithm can be adapted for obtaining an implicit strategy for selecting the action with the largest action-value function To obtain an estimate of the value function, the state-value function can be written in the following form Correspondingly, the state-action-value function can be written as The above equation is known as the Bellman equation and expresses the relationship between the value of the state (state-action) and the subsequent state (state-action).Assuming that there exists an optimal estimate of the value function V * (s), then the Bellman equation in the form of the state-value function can be written as the Bellman optimal equation Similarly, the expression for the Bellman optimality equation in the form of a stateaction-value function is The optimal strategy can be solved based on (15), which is expressed as follows To obtain the optimal value function estimate, the idea of dynamic programming is usually used.The value function of the previous step is estimated from the value function of the current step, thus transforming the Bellman optimality equation into an iterative update in the form of Correspondingly, for the state-action-value function, the updated expression is Theoretically, it can be shown that Q(s, a) eventually converges to the optimal form after the above value function iterations [32].Researchers have proposed a class of Temporal Difference (TD) methods using Markov properties that allow for the update of the value function estimate to be performed at each new moment of state entry.The iterative update strategy for this class of methods is shown in the following equation where r t + γV(s t+1 ) − V(s t ) is referred to the TD error for a single-step decision.The Q-learning algorithm takes the TD objective r t + γmax a∈A Q(s t+1 , a) as an estimate of the optimal state-action function, so the flow chart of the Q-learning algorithm can be shown in Algorithm 1, where the strategy used for evaluation and updating is the -greedy algorithm.

Algorithm 1 Q-learning algorithm
Input: Environment(E); Action space(A); Initial state(s 0 ); Discount factor(γ); Learning rate(α).Output: x = x ; 9: end for The provided pseudocode outlines the Q-learning algorithm, which is a model-free reinforcement learning algorithm used to learn the value of an action in a particular state.Below is a detailed explanation of the pseudocode. Inputs: • Environment (E): The setting in which the agent operates.• Action space (A): All possible actions the agent can take.

•
Initial state s 0 : The starting point of the agent in the environment.

•
Discount factor γ: The degree to which future rewards are diminished compared to immediate rewards.

•
Learning rate α: How much new information overrides old information. Output: • Policy π: A strategy that the agent follows, mapping states to the best action to perform in that state.
Algorithm Steps: 1. Initialization: Q(x, a) = 0: The Q-value for all state-action pairs is initialized to zero.
The policy is initialized to be a uniform distribution, where each action is equally probable if there are |A(x)| actions possible in state x.

2.
Set initial state: x = x 0 : The agent starts at the initial state x 0 .

3.
Learning Loop: This loop continues indefinitely, iterating over each time step t.
• Action Selection: a = π (x): An action a is chosen using the -greedy policy from the current policy π.This selects the best action most of the time, but occasionally a random action to explore the environment.
The agent performs action a, receives a reward r, and transitions to a new state x .

• Action Value Update:
The Q-value of the current state-action pair (x, a) is updated using the Bellman equation incorporating the learning rate α, the received reward r, the discount factor γ, and the maximum Q-value of the subsequent state x .• Policy Update: π(x) = arg max a Q(x, a ): The policy is updated to choose the action with the highest Q-value for state x.

• State Transition:
x = x : Update the state to the new state x .
The loop represents the process of the agent interacting with the environment, receiving feedback in the form of rewards and updating its policy to maximize those rewards over time.As the algorithm proceeds, the Q-values converge towards the optimal values and the policy π converges towards the optimal policy, balancing exploration and exploitation.
For some simple tasks where the state and action spaces are low-dimensional discrete spaces, Q(s, a) can be represented as a two-dimensional table.The rows of the table represent certain states and actions, and each cell of the table stores the value under the corresponding state and action.Based on the idea of dynamic programming, the values in the table can converge to an optimal value in the above-mentioned value iteration.In each decision, the action with the highest value in the current state is selected.However, for more complex tasks with a higher dimensional state and action spaces, the table structure is no longer applicable but requires a parametric model such as a neural network to characterize the value function and then adjust the model parameters using parametric optimization methods such as gradient descent with the objective of minimizing the TD error, so that it gradually approximates the optimal value function estimate.
In this paper, we can describe the simulation scenario and the different combinations of parameters as the state space S and the action space A, respectively.

Simulation Scenarios and Results
In this paper, our primary focus is on studying a swarm system consisting of 32 agents.Our investigation centers around the dynamic manipulation of cohesion and repulsion algorithm parameters.By dynamically adjusting these parameters, we create a versatile simulation scenario that enables continuous obstacle avoidance and maximizes spatial coverage.The effectiveness of this approach is visually depicted in Figure 2, where the swarm adeptly navigates through obstacles while efficiently covering the available space.
To facilitate the learning process and decision making of the agents, we employ a Q-table as a critical component of our methodology.The Q-table, as illustrated in Figure 3a, serves as a tabular representation of the state-action pairs and their corresponding Q-values.It enables the agents to make informed decisions based on the accumulated knowledge and experiences gained during the learning process.The reward function in our approach is designed to incentivize desirable behaviors and penalize undesirable ones.Specifically, we have defined two components of the reward function:

•
Obstacle avoidance reward: This component rewards agents for successfully navigating past obstacles.The reward is computed as the sum of the distances among agents when they successfully avoid obstacles.By encouraging agents to maintain a safe distance from obstacles, this component promotes effective obstacle avoidance behavior.Conversely, if agents fail to surmount obstacles, a penalty of −1 is incurred, discouraging collision or unsuccessful navigation attempts.

•
Diffusion reward: This component focuses on maintaining a safe distance from walls and adhering to the expansion criteria.When agents maintain a safe distance from walls and meet the expansion criteria, the reward is calculated as the cumulative inter-agent distance.This encourages agents to spread out evenly within the swarm and avoid clustering near walls.On the other hand, if agents make contact with walls or do not meet the expansion criteria, a penalty of −1 is imposed, discouraging undesired behavior such as wall collisions or failure to achieve the desired expansion.
Our strategy for implementing the -Greedy algorithm is carefully designed to optimize learning efficiency throughout 100 training sessions, with the current training session denoted as num.We employ the following approach: • Exploration in Early Stages: During the initial training phases when num is small, we prioritize exploration by assigning a relatively high value to .This exploration-centric approach allows the algorithm to thoroughly explore and understand the environment, facilitating the discovery of potentially optimal solutions.• Exploitation in Later Stages: As the training progresses and the rewards associated with different combinations of cohesion and repulsion parameters become well estimated, excessive exploration becomes unnecessary.Therefore, we gradually reduce the value of as the number of training sessions increases.To achieve this, we utilize the floor function denoted as • , which ensures a smooth reduction in .By decreasing , we shift the focus towards exploiting the learned knowledge, enabling more informed and optimal decision making.
This strategic adjustment of strikes a balance between exploration and exploitation throughout the training process.It allows the algorithm to effectively explore the solution space in the early stages, while gradually transitioning towards exploiting the acquired knowledge in later stages.This approach maximizes the learning efficiency of the -Greedy algorithm, leading to an improved performance and convergence towards optimal solutions, = 1 − 0.1 num 10 (24) where • denotes the floor function.It is defined by x = max{m ∈ Z|m ≤ x}, where x is a real number and Z denotes a set of integers.
By using the Q-table values depicted in Figure 3b and by training the Q-learning network, we can determine the optimal action for each state corresponding to the optimal parameters of cohesion and repulsion under different obstacle scenarios.These parameter (The real training parameters are in Appendix A) sets are subsequently incorporated into the Boids model for simulation.The performance of the Q-learning algorithm can be highly sensitive to its hyperparameters, such as the learning rate, discount factor, and policy exploration parameters.Tuning these can be non-trivial and may require extensive experimentation.
The simulation results, as shown in Figure 4, serve as compelling evidence of the effectiveness of our approach.The agents adeptly navigate through obstacles while simultaneously maintaining an optimal inter-agent distance and strategically maximizing their spread within the given spatial constraints.This demonstrates the successful mastery of our agents in tackling complex navigation tasks.In simulation experiments, Q-learning typically requires discretizing the state and action spaces.This discretization can lead to a loss of fidelity, as continuous nuances in agent behavior and environment states may not be captured.We also compare the use of the Deep Q-learning method [33,34].Deep reinforcement learning can provide adaptive and intelligent decision-making capabilities for optimizing the network performance.But, it may not directly address swarm behavior or coordination among drones, which is different from the RL-Boids approach.

Real Robot Experiments
The experimental section of this research aims to validate the efficacy of the proposed reinforcement learning-based approach for drone swarms.Therefore, a real-world experiment is conducted utilizing three drones in a controlled environment.The overarching objective of this experiment is to convincingly demonstrate the capability of drone swarms to navigate a predefined space adeptly, concurrently avoiding obstacles and optimizing coverage.
In this experiment, three identical drones are used as the robotic agents.These drones have various sensors, including wireless communication modules and inertial measurement units (IMUs).These sensors are pivotal in enabling the drones to perceive their surroundings accurately.They are of a quadcopter design, ideal for stability and maneuverability in various directions.A central computer system monitors the drones' movements and handles heavy computation if the drones' onboard computers are not capable of real-time processing.The RL algorithm is pre-trained in simulations with similar obstacle configurations to bootstrap the learning process.Hyperparameters are adjusted based on preliminary tests to balance exploration and exploitation.
The experimental environment consisted of an open indoor space with five strategically placed obstacles composed of non-reflective material to avoid sensor distortion.These obstacles are strategically placed to emulate real-world scenarios where the drones must deftly navigate around static objects, mirroring challenges encountered in practical settings such as urban environments or complex terrains.The experiment area is rigged with a motion capture system to provide external validation of the drones' movements and the system's internal measurements.The primary mission for the drones is to seamlessly arrange themselves in a circular formation while adeptly circumventing potential collisions with the obstacles.A schematic diagram depicting the experimental setup is visually represented in Figure 5.
To achieve the desired circular formation, the drones utilize a virtual leader-follower approach.The clustering pinning algorithm calculates the optimal position, which serves as the virtual leader.Moreover, the others follow its movements while maintaining the desired distance and angles.This formation control is crucial for successfully executing expansion and contraction maneuvers.
The core of the experiment revolves around the integration of reinforcement learning (RL) techniques and the body model of the drones.RL algorithms enable the drones to learn and adapt their behaviors based on environment interactions.They learn to apply the right direction to avoid obstacles and maintain the circular formation.
Simultaneously, the drones aim to maximize the coverage of the available space.They expand and contract their formation dynamically, adjusting their positions relative to the virtual leader and obstacles.This dynamic adaptation allows them to explore and cover the entire space efficiently.
The experimental results demonstrate the effectiveness of the proposed approach.As shown in Figure 6, the drones successfully form a circular swarm, avoid obstacles, and maximize coverage of the experimental space.The RL-based control system exhibits adaptability and responsiveness in real-time, ensuring smooth navigation.Throughout the experiment, the drones consistently avoid collisions with the obstacles.The reinforcement learning algorithms enable them to learn from their interactions, improving their obstacle-avoidance capabilities.
The swarm of drones effectively covers the entire experimental space, demonstrating the potential for such intelligent expansion and contraction strategies in various applications, such as surveillance, search and rescue, and environmental monitoring.

Conclusions
This paper has proposed a Q-learning network training method to optimize cohesion and repulsion parameters for the simplified Boids model.This optimization has aimed to achieve continuous obstacle avoidance and maximum space coverage of drones in the simulation environment.Based on this method, we train the theoretical optimal value of the given parameters in the set scenario.The RL-Boids method proposed in this paper has provided analogies and references for related research in the drone swarms field.In future research, our group will explore better reinforcement learning methods based on the Boids model and carry out experimental validation work with drones.

Figure 1 .
Figure 1.Diagram of the cohesion and repulsion control model.

Figure 2 .
Figure 2. Improved Boids model based on Q-learning network.

Figure 4 .
Figure 4. Framework of RL-Boids.In the above part of the algorithm described the workflow, the following part of the simulation results.The dots in (a-i) represent drones.(b,d,f,h) show the virtual leader moving the drone forward through traction.(a,c,g) show the contraction of the drone swarms.(e,i) show the expansion of drone swarms.

Figure 5 .
Figure 5. Schematic diagram of indoor real machine experiment.

Figure 6 .
Figure 6.Simulation and indoor real-world experiments for expansion and contraction scalability.(a,d) are composed of real-world experiments and simulations.The part circled in red in the top half is the drone, and the following is the simulation experiment of its current state.(a,d) show the contraction of the drone swarm.(b,c) show the expansion of drone swarms.The experimental video link is as follows: https://www.bilibili.com/video/BV1YN411H7K6(accessed on 20 September 2023).