Wing Kinematics-Based Flight Control Strategy in Insect-Inspired Flight Systems: Deep Reinforcement Learning Gives Solutions and Inspires Controller Design in Flapping MAVs

Flying insects exhibit outperforming stability and control via continuous wing flapping even under severe disturbances in various conditions of wind gust and turbulence. While conventional linear proportional derivative (PD)-based controllers are widely employed in insect-inspired flight systems, they usually fail to deal with large perturbation conditions in terms of the 6-DoF nonlinear control strategy. Here we propose a novel wing kinematics-based controller, which is optimized based on deep reinforcement learning (DRL) to stabilize bumblebee hovering under large perturbations. A high-fidelity Open AI Gym environment is established through coupling a CFD data-driven aerodynamic model and a 6-DoF flight dynamic model. The control policy with an action space of 4 is optimized using the off-policy Soft Actor–Critic (SAC) algorithm with automating entropy adjustment, which is verified to be of feasibility and robustness to achieve fast stabilization of the bumblebee hovering flight under full 6-DoF large disturbances. The 6-DoF wing kinematics-based DRL control strategy may provide an efficient autonomous controller design for bioinspired flapping-wing micro air vehicles.


Introduction
Flapping insects enable outperforming stability and maneuverability under a wide array of disturbances such as wind gusts and turbulence.Although the small insect body is susceptible even to gentle disturbance, flapping-wing insects are able to restore large deviations through continuous adjustments on wing kinematics within several wing-beat strokes [1][2][3].The insect flight control system is a highly integrated, closed-loop system [4], in which the nonlinear dynamic system couples the motion equations for body dynamics and the Navier-Stokes equations for unsteady aerodynamics [5].The nonlinear control strategy required for insect flight stabilization in case of large perturbations in full degrees of freedom is still limited for controller design.
Based on the assumption of rigid flapping-wing aerodynamics, the instability in hovering flight is reported to exist in most flying insects [6][7][8].With the mechanical sensory and vision systems, the translational (forward/backward, lateral, and vertical) and rotational (roll, pitch, and yaw) deviations of an insect's body under disturbances can be detected, and further actively corrected via wing kinematics modulations with low latency.Previous studies focused on linear control theory demonstrate the feasibility of proportional derivative (PD) strategy for insect flight control.The linear PD controller is suggested to be an efficient tool for 1-degree-of-freedom (DoF) control [2,3,9,10] in various insects' flights, as well as 3-DoF control for longitudinal motions [11] and body attitudes under small perturbations of 189.5 • /s [12] in bumblebee hovering flight.It is also reported to be effective in full 6-DoF hovering control of bumblebee flight under both small (0.03 m/s) and large perturbations [13], in which the adjustment on control parameters of proportional and derivative gains can be obtained based on a CFD data-driven aerodynamic model (CDAM) and a simplified flight dynamic model.However, several limitations still exist for the traditional control strategies.Firstly, the longitudinal and lateral equations tend to be decoupled and resolved to achieve the longitudinal and sideways control separately [14,15], which leaves the coupling features among six degrees of freedom under large disturbances to be neglected.Moreover, the linear assumption with cycle-averaged model may not hold for some large perturbations, as the nonlinearity exists in the correlations between the wing kinematics modulation and the production of aerodynamic forces and torques [14,16].More importantly, the precise control parameters were determined through the eigenvalue and eigenvector analyses [12,14], and even optimized using the Laplace transformation and root locus approach [11,13].This requires time-consuming experiments for optimal parameter achievement as well as the prescribed implementation into the flight system before each task.Considering the existing limitations, a more feasible option for a bioinspired intelligent controller designed for large disturbance conditions based on autonomous a deep reinforcement learning algorithm needs to be further explored.
Flying animals tend to develop their control skills via a trial-and-error evolutionary process, which is consistent with the reinforcement learning (RL) [17] process to work out which behavior interacting with the environment will maximize the rewards.Due to the nonlinear motions and continuous action-state spaces for biomimetic aerial vehicles, the deep reinforcement learning (DRL) controller is proven to give solutions for severe disturbance conditions and complex maneuvering tasks.Bøhn et al. [18] achieved the attitude control on fixed-wing UAV using the deep reinforcement learning method of on-policy proximal policy optimization (PPO).Fei et al. [19] presented a deep reinforcement learning control strategy trained using off-policy deep deterministic policy gradient (DDPG) and achieved goal-directed maneuvering for flapping-wing MAVs.Other challenging fields from games to robotics have employed a variety of state-of-art RL algorithms [20][21][22][23][24][25].Haarnoja et al. [26] developed the SAC algorithm embedded with an automatic gradient-based temperature tuning method, which could achieve better performance without hyperparameter tuning for various tasks compared with other on-policy and off-policy algorithms.Combining the wing kinematics-based flight control strategy with the deep reinforcement learning approach may allow for the control of this highly coupled and nonlinear flight system without previous resolution.Through multiple explorations, the DRL controller is likely to show advantages in the fast achievement of control policy for 6-DoF flight stabilization even under large perturbations with no requirements on the prescribed database and the parameter determination.
Here we propose a novel wing kinematics-based controller optimized using deep reinforcement learning (DRL) for bumblebee hovering stabilization under large perturbations.We establish a high-fidelity Open AI Gym [27] environment through coupling a CFD data-driven aerodynamic model and a 6-DoF flight dynamic model.The control policy with action space of 4 is optimized using the off-policy Soft Actor-Critic (SAC) algorithm with automating entropy adjustment.The benchmark tests are conducted to investigate the feasibility of a wing kinematics-based DRL control strategy to achieve fast stabilization under full 6-DoF large disturbances for bumblebee hovering.Further analysis on the control performances demonstrates the superiority of the deep reinforcement learning strategy compared to the traditional linear strategies, which provides an efficient autonomous controller design for bioinspired flapping-wing micro air vehicles.

Morphological and Kinematic Bumblebee Models
A wing-body model of the bumblebee (Bombus terrestris) is depicted in Figure 1a, whose body mass m b is 391 mg, body length L is 21 mm, wing mass m w is 0.76 mg, wing length R is 15.2 mm, and mean chord length c m is 4.1 mm.The kinematic model of a hovering bumblebee is built based on the experimental observations of Kolomenskiy et al. [28], which is defined by the three angles expressed as the first three terms of a Fourier series with respect to the stroke plane (Figure 1b): the positional angle ϕ, the elevation angle θ, and the feathering angle α.The positional angle is the rotation axis projection of the sweep angle within the stroke plane, the feathering angle is the geometric angle of attack around the rotation axis, the deviation angle between the stroke plane and rotation axis is the elevation angle.The wing beat frequency f for bumblebee hovering flight is 136 Hz, and the initial stroke amplitude Φ is 139.36 • .The stroke plane angle β is 0 • with the initial body angle χ equaling 45 • for the hovering flight of bumblebees.For the rigid moving body, it is determined as three roll ρ, pitch χ, and yaw ψ body angles, in which the roll angle ρ is the rotational angle along body axis of x b , the pitch angle χ is defined as the body inclination angle with respect to the horizontal plane, and the yaw angle represents the rotational angle along body axis of z b .

Morphological and Kinematic Bumblebee Models
A wing-body model of the bumblebee (Bombus terrestris) is depicted in whose body mass  is 391 mg, body length L is 21 mm, wing mass  is 0.7 length R is 15.2 mm, and mean chord length  is 4.1 mm.The kinematic hovering bumblebee is built based on the experimental observations of Kolo al. [28], which is defined by the three angles expressed as the first three terms er series with respect to the stroke plane (Figure 1b): the positional angle , th angle , and the feathering angle .The positional angle is the rotation axis p the sweep angle within the stroke plane, the feathering angle is the geomet attack around the rotation axis, the deviation angle between the stroke plan tion axis is the elevation angle.The wing beat frequency f for bumblebee hov is 136 Hz, and the initial stroke amplitude Φ is 139.36°.The stroke plane a with the initial body angle χ equaling 45° for the hovering flight of bumbleb rigid moving body, it is determined as three roll , pitch , and yaw  bod which the roll angle  is the rotational angle along body axis of  , the pitch defined as the body inclination angle with respect to the horizontal plane, a angle represents the rotational angle along body axis of  .

Aerodynamic and Flight Dynamic Models for Bumblebee Hovering Flight
We construct a control environment in the framework of Open AI G achieve realistic the hovering flight of a bumblebee and provide fast response learning process.A CFD data-driven aerodynamic model (CDAM) by Cai e employed for fast prediction on the aerodynamic forces and torques, comb flight dynamic model based on Cai and Liu [13] which can mimic motions perturbations.The CDAM consists of a CFD-informed quasisteady model b blade element method for flapping wings and a simplified quasisteady app based aerodynamic model for a moving body [11], which is a better altern time-consuming CFD simulations.The flight dynamic model of a bumblebee to large deviations is built by deriving the full dynamic equations extended f et al. [29] and Sun et al. [30].The flight dynamic model is able to mimic the wing-body interactions, where the wing kinematics are served as inputs and  [28], where the positional angle ϕ (red), elevation angle θ (blue), and feathering angle α (green) are expressed in a Fourier series.

Aerodynamic and Flight Dynamic Models for Bumblebee Hovering Flight
We construct a control environment in the framework of Open AI Gym [27] to achieve realistic the hovering flight of a bumblebee and provide fast response during the learning process.A CFD data-driven aerodynamic model (CDAM) by Cai et al. [11] is employed for fast prediction on the aerodynamic forces and torques, combined with a flight dynamic model based on Cai and Liu [13] which can mimic motions under large perturbations.The CDAM consists of a CFD-informed quasisteady model based on the blade element method for flapping wings and a simplified quasisteady approximation-based aerodynamic model for a moving body [11], which is a better alternative to the time-consuming CFD simulations.The flight dynamic model of a bumblebee applicable to large deviations is built by deriving the full dynamic equations extended from Gebert et al. [29] and Sun et al. [30].The flight dynamic model is able to mimic the bumblebee wing-body interactions, where the wing kinematics are served as inputs and the insect's motion can be solved in a fast and precise manner.The dynamic equations of the moving body are determined as where m b , m w are the body and wing mass; I bd is a 3 × 3 matrix of the body moment of inertia (I b, xx = 2.2 × 10 −9 kg m 2 , I b, yy = 7.5 × 10 −9 kg m 2 , I b, zz = 7.7 × 10 −9 kg m 2 ); and , where I w is the wing moment of inertia; R hR , R hL denote the vector from the body center of mass to the wing base; R wgR , R wgL denote the vector from the wing base to the wing center of mass; and E wR2b , E wL2b , E b2wR , and E b2wL are the coordinate transformation matrix between the wing-fixed frame and the body-fixed frame.Detailed expressions of other coefficients and b 2R , b 2L are listed in Cai and Liu [13].The flapping-wing dynamic equations are written as where M b2R , M b2L denote the torques between the thorax of body and the right or left wing.Detailed expressions of the coefficients C vR , C vL , C oR , C oL , C wR , C wL and c R , c L are listed in Cai and Liu [13].We further apply two equations by adding the wing kinematics-based control inputs, where E dEulerR2sp , E dEulerL2sp are the coordinate transformation matrix that transfer the time derivative of wing Euler angles to the stroke plane frame; E spR2b , E spL2b are the coordinate transformation matrix converting a vector from the stroke plane frame to the body-fixed frame, such as . .
By integrating Equations ( 1)-( 4), the bumblebee motion could be solved using three inputs of wing kinematics ϕ, θ, α.Detailed expressions of all the coefficients in dynamic Equations ( 1)-( 4) for the body and two wings can be found in Cai and Liu [13].

Wing Kinematics-Based Controller Design
Cai and Liu [13] proposed a 6-DoF proportional derivative (PD) control strategy through directly tuning four wing kinematics parameters for bumblebee flight stabilization, leaving the x and y positions controlled indirectly by modifying the pitch and roll angles.Based on this successful trial, our controller design also selects four typical wing kinematics parameters to be served as the action space for deep reinforcement learning, and the aerodynamic forces and torques induced through wing kinematics variations are depicted in Figure 2: symmetric stroke amplitude variation ∆∅ will cause pitch torque T y and vertical forces F z ; symmetric mean positional angle variation ∆ϕ may generate pitch torque T y ; and asymmetric stroke amplitude variation ∆∅ RL and asymmetric mean feathering angle variation ∆α RL between right and left wings could induce yaw T z and roll torques T x .
Biomimetics 2023, 8, x FOR PEER REVIEW 5 of depicted in Figure 2: symmetric stroke amplitude variation ∆∅ will cause pitch torque and vertical forces  ; symmetric mean positional angle variation ∆ may generate pit torque  ; and asymmetric stroke amplitude variation ∆∅ and asymmetric mean feat ering angle variation ∆ between right and left wings could induce yaw  and r torques  .Here, we propose a deep reinforcement learning (DRL) policy for insect-inspir flight control systems with the intention of achieving the bumblebee hovering stabiliz tion under large perturbations.The bumblebee behaviors served as the Markov decisi process (MDP) in continuous control.We build a state space with a dimension of 12 observe the angular position, angular velocity, position, and velocity of the insect,  = , , ,  ,  ,  , , , ,  ,  ,  ,  Here, we propose a deep reinforcement learning (DRL) policy for insect-inspired flight control systems with the intention of achieving the bumblebee hovering stabilization under large perturbations.The bumblebee behaviors served as the Markov decision process (MDP) in continuous control.We build a state space with a dimension of 12 to observe the angular position, angular velocity, position, and velocity of the insect, and an action space with a dimension of 4 to provide a continuous manipulation on the wing kinematics of a bumblebee, Figure 3 illustrates the schematic diagram of the wing kinematics-based bumblebee flight control system, where deep reinforcement learning gives solutions for controller design.The state transition for generating s t+1 can be achieved through our bumblebee environment based on the closed-loop flight dynamic model with a feedback controller.Since our flight control system requires continuous manipulation and updated strategy at the beginning of each wing-beat stroke, we choose the popular off-policy actor-critic algorithm based on the maximum entropy RL framework, Soft Actor-Critic (SAC) to train the policy [26].There are three key components in the SAC algorithm: separate policy and value function-based actor-critic networks, high-efficiency data-reusing off-policy formulation, as well as stability and exploration-encouraging entropy maximization.The state value function is written as Thus, the Q value function based on soft Bellman equation [25,26] is given by where r is the one-step reward, E denotes the mathematical expectation, γ is the discount factor, and π is the adopted policy.Here, α controls how important the entropy term is, known as the temperature parameter.The SAC updates the policy to minimize the Kullback-Leibler (KL) divergence [25,26], where Π denotes the family of Gaussian distributions and Z represents the partition function for distribution normalization.The parameters of the soft Q-function θ are trained by [25,26], where D is the replay buffer storing the transitions [s t , a t , r, s t+1 ].A soft update is per- formed in target value network, where τ denotes the step factor and θ is an exponentially moving average of the weights.The policy network with parameter φ is updated by [25,26], Since a suboptimal temperature may cause poor performance in maximum entropy RL [25], a constrained formulation for automatically tuning the temperature hyperparameter has been employed in SAC without the requirement for hyperparameter tuning in every task.The optimal temperature parameter α in every step can be learned by minimizing the same objective function [25,26], where H 0 is the desired minimum expected entropy.The Soft Actor-Critic (SAC) with automating entropy adjustment has been evaluated through a variety of benchmark and real-world tasks of robotics [26], which could achieve outstanding asymptotic performance and sample efficiency compared with other off-policy and on-policy algorithms [20][21][22][23][24].

Deep Reinforcement Learning Policy
The goal of the bumblebee flight control system is to restore the angular position and position to the initial equilibrium state after large angular velocity or velocity perturbations via several strokes controlling.The reward design is determined as a negative cost function composed of stability cost and control cost, such as The stability cost is defined as the errors between current states and target states, where e p denotes the position errors of ∆x, ∆y, and ∆z; e v denotes the velocity errors of ∆  a t are also included in reward design as the control cost to ensure the stable wing kinematics and equilibrium state in trimmed hovering flight of bumblebee.Note that all the quantities of time, length, velocity, mass, force, and torque in our simulation environment have been processed and expressed as a dimensionless form, which leaves the bound value of the six reward terms with quite different orders of magnitudes ranging over O (10 0 )-O (10 4 ).To ensure the relatively equivalent contribution for each reward component and minimize the attitude, position, velocity, and control errors at the same time, we design the scaling parameters as λ p : λ v : λ R : λ ω : λ a : λ a = 10 0 : 10 4 : 10 0 : 10 4 : 10 0 : 10 0 , ( To balance and scale the differences in orders of magnitudes of these nondimensional values.Through a variety of training verification, we found that further precise adjustment for each parameter may not enhance the training performance largely, which demonstrates the current scaling-based parameters in the reward design are rational for learning achievement.The reinforcement learning of the SAC algorithm has advantages in the fast achievement of control policy via exploration with no requirements on the prescribed database and the precise determination of control parameters. Considering the realistic morphology and kinematics of insects, we set the limitations of the action space, such as the maximum rising in stroke amplitude for 20% or the maximum deviation in mean positional and feathering angle for 20 • to avoid overlapping of two wings.We also modify the hyperparameters based on Haarnoja et al. [26] and utilize several tricks such as a reward scale incorporated with SAC to improve the training robustness.The training process illustrated by a learning curve with obtained reward at the end of each exploration episode is shown in Figure 4, where the reward is maximized via minimizing the error between the current state and the equilibrium state to be closer to zero.The training process showed in Figure 4 is quite similar as most of the successful DRL cases [25,26], in which the learning curve appears random and slowly increases at the beginning while it rises fast and even becomes stable during the last several episodes.Since the SAC algorithm could enhance the action selection randomness, and meanwhile encourage more exploration during the training process [31], the actor generates random actions based on the current policy during initial episodes, and the feedback of the environment will be stored into the experience replay buffer for updating the network at each flapping stroke.After sufficient explorations for dozens of episodes, the updated policy may provide actions to achieve better performance until the reward is optimized (the error tolerance is defined as |r − 0| <0.5 considering 5% of the initial reward value).The accumulated negative reward could converge the highest, which is close to zero after randomly giving deviations at the beginning of each episode and exploring actions for 5000 steps (50 flapping strokes for each episode).The number of episodes that served as one of the initial hyperparameters was determined as 100 previously, which was demonstrated to be sufficient for achieving training performance.

Stabilization Control under Large Perturbations
The trimmed state of a hovering bumblebee is illustrated in Figure 5, which reaches a stable periodic state with initial trimmed wing kinematics and maintains equilibrium without perturbation for 10 strokes.A slight body oscillation is induced by the symmetric reciprocation motion of the two flapping wings (Figure 1), involving pitch motion, forward/backward motion, and vertical motion.The goal of flight control is to restore the attitude and position of the bumblebee after disturbances to  ,  ,  = 0, 45, 0 (°) and  ,  ,  = 0, 0, 0 (mm).In our control results, all the pitch angles have been illustrated as  − 45°.Experiments on bumblebee flight control under large perturbations are conducted through applying large angular velocity perturbations along the body axis of ( ,  ,  ) and large velocity perturbations in directions of ( ,  , and  ), which mimics the impact of wind-gust disturbance on the insect's body [2,32].Even gentle air currents can cause large disruptions to the intended flight path [1] according to the perturbation experiments of bumblebees [13,32] and fruit flies [1,2].We employ the trained deep reinforcement learning policy as control strategy after adding the angular velocity disturbances 3%  ( 20 rad/s) and the velocity disturbances 3%  ( 0.3 m/s) [13,32] to the

Stabilization Control under Large Perturbations
The trimmed state of a hovering bumblebee is illustrated in Figure 5, which reaches a stable periodic state with initial trimmed wing kinematics and maintains equilibrium without perturbation for 10 strokes.A slight body oscillation is induced by the symmetric reciprocation motion of the two flapping wings (Figure 1), involving pitch motion, forward/backward motion, and vertical motion.The goal of flight control is to restore the attitude and position of the bumblebee after disturbances to [ρ 0 , χ 0 , ψ 0 ] T = [0, 45, 0] T ( • ) and [x 0 , y 0 , z 0 ] T = [0, 0, 0] T (mm).In our control results, all the pitch angles have been illustrated as χ − 45

Stabilization Control under Large Perturbations
The trimmed state of a hovering bumblebee is illustrated in Figure 5, which reaches a stable periodic state with initial trimmed wing kinematics and maintains equilibrium without perturbation for 10 strokes.A slight body oscillation is induced by the symmetric reciprocation motion of the two flapping wings (Figure 1), involving pitch motion, forward/backward motion, and vertical motion.The goal of flight control is to restore the attitude and position of the bumblebee after disturbances to  ,  ,  = 0, 45, 0 (°) and  ,  ,  = 0, 0, 0 (mm).In our control results, all the pitch angles have been illustrated as  − 45°.Experiments on bumblebee flight control under large perturbations are conducted through applying large angular velocity perturbations along the body axis of ( ,  ,  ) and large velocity perturbations in directions of ( ,  , and  ), which mimics the impact of wind-gust disturbance on the insect's body [2,32].Even gentle air currents can cause large disruptions to the intended flight path [1] according to the perturbation experiments of bumblebees [13,32] and fruit flies [1,2].We employ the trained deep reinforcement learning policy as control strategy after adding the angular velocity disturbances Experiments on bumblebee flight control under large perturbations are conducted through applying large angular velocity perturbations along the body axis of (x b , y b , z b ) and large velocity perturbations in directions of x g , y g , and z g , which mimics the impact of wind-gust disturbance on the insect's body [2,32].Even gentle air currents can cause large disruptions to the intended flight path [1] according to the perturbation experiments of bumblebees [13,32] and fruit flies [1,2].We employ the trained deep reinforcement learning policy as control strategy after adding the angular velocity disturbances 3% ω re f (≈20 rad/s) and the velocity disturbances 3% U re f (≈ 0.3 m/s) [13,32] to the trimmed hovering state of a bumblebee.Here, the reference angular velocity and the reference velocity are defined as the wingtip angular velocity and wingtip velocity of the bumblebee in hovering flight, such as ω re f = 2∅ f and U re f = 2∅ f R, where R denotes the wing length and ∅ and f are the stroke amplitude and flapping frequency.The flight system in equilibrium with initial trimmed wing kinematics is perturbed by the angular velocity and velocity disturbances at the first flapping stroke persisting for one stroke cycle.After a time delay of 1T latency, the actions (active wing kinematics manipulation) generated by the DRL policy will be added into the flight system.Figures 6 and 7 depict the control results in terms of three body attitude (roll, pitch, and yaw angles) and three body positions (X, Y, and Z) under, three horizontal, lateral, and vertical velocity perturbations, as well as three roll, pitch, and yaw angular velocity perturbations, respectively.Although all the large perturbations in different directions result in deviations in rotational angles and body positions, the deep reinforcement learning (DRL) controller based on the action space of four wing kinematics can largely achieve the 6-DoF stabilization for bumblebee hovering flight even in underactuated condition.The quantification analysis on the control results of the deep reinforcement learning (DRL) strategy and a comparison to the traditional PD control strategy are further provided.Here, two indices are introduced for the evaluation of control performance: the maximum attitude or position displacement  from the equilibrium state and the correction time  expressed in wing-beat cycles [2].Through calculating the rotational and translational differences ('errors') between 0 and 50 wing beat cycles [2], 80% response curves of the attitude and position induced by 6-DoF disturbances can be effectively restored toward the stable state within 10% of the maximum displacement, which indicates the control capability of the deep reinforcement learning strategy.Tables 1 and  2 show the detailed values of  and  based on the time evolutions of body attitudes and positions under horizontal, lateral, and vertical velocity perturbations using a cur- The rotational control based on DRL policy can be achieved at around 20 wing-beat strokes, which is slightly slower than the experimental observations on various insect flights [1][2][3]9,10].More restoring time of approximately 40-50 strokes is needed to obtain the translational control after large perturbations, which may be less essential compared with the attitude stabilization [33].The gust rejection and flight profile after the disturbance can be visualized through the dynamic sequence of the bumblebee motion, for instance, the detailed control process with the variation of body rotation and movement after vertical velocity perturbation in the direction of z g in Figure 8.The hovering bumblebee in equilibrium (0 s) encounters a vertical velocity disturbance at the initial stroke cycle (~0.007s) resulting in a rapid movement in the vertical direction.After one stroke time delay, it takes active wing kinematics manipulations for several flapping cycles, during which the bumblebee first pitches down 40 • and restores the body attitude quickly within 0.161 s.The pitch response of the bumblebee induces the backward and forward motion of the body, which takes more stroke cycles to return to the initial position.Although the translational control requires relatively more restoring time due to the indirect adjustments from an action space of 4 under full 6-DoF disturbances, the underactuated DRL controller has great potential to simplify the actuator-based fabrication in flapping-wing MAVs.The quantification analysis on the control results of the deep reinforcement learning (DRL) strategy and a comparison to the traditional PD control strategy are further provided.Here, two indices are introduced for the evaluation of control performance: the maximum attitude or position displacement  from the equilibrium state and the correction time  expressed in wing-beat cycles [2].Through calculating the rotational and translational differences ('errors') between 0 and 50 wing beat cycles [2], 80% response curves of the attitude and position induced by 6-DoF disturbances can be effectively restored toward the stable state within 10% of the maximum displacement, which indicates the control capability of the deep reinforcement learning strategy.Tables 1 and  2 show the detailed values of  and  based on the time evolutions of body attitudes and positions under horizontal, lateral, and vertical velocity perturbations using a cur- The quantification analysis on the control results of the deep reinforcement learning (DRL) strategy and a comparison to the traditional PD control strategy are further provided.
Here, two indices are introduced for the evaluation of control performance: the maximum attitude or position displacement d max from the equilibrium state and the correction time t c expressed in wing-beat cycles [2].Through calculating the rotational and translational differences ('errors') between 0 and 50 wing beat cycles [2], 80% response curves of the attitude and position induced by 6-DoF disturbances can be effectively restored toward the stable state within 10% of the maximum displacement, which indicates the control capability of the deep reinforcement learning strategy.Tables 1 and 2 show the detailed values of d max and t c based on the time evolutions of body attitudes and positions under horizontal, lateral, and vertical velocity perturbations using a current DRL controller and traditional PD controller [13].The maximum displacements of roll, pitch, and yaw attitudes are comparable in the DRL and PD controls, in which the mean values turn out to be 28 • ± 17 • for the DRL controller and 26 • ± 16 • for the PD controller.However, lower displacements exist in the position control of X, Y, and Z with the DRL controller, whose mean d max shows a reduction of 40% compared with PD control results.Moreover, although the DRL controller requires slightly more correction time t c for translational deviations, it presents a significant advance in rotational stabilization with the time saving of 50% (19.5 ± 3.5 wing beats) compared with the PD controller (37.3 ± 9.7 wing beats).Better control performances in terms of displacement reduction and restoring time demonstrate the superiority of deep reinforcement learning compared to traditional linear strategies.

Physical Mechanisms of Control Strategy
The control strategy with action inputs expressed as the wing kinematics manipulations of left and right wings have been shown in Figures 9 and 10.Since the bumblebee normally activates its muscles once in one stroke cycle [34], the control policy applies actions to the wing kinematics at the beginning of each stroke cycle.A smooth step function is further employed to ensure the wing kinematics transition between successive strokes with a transition time of 0.1T [13].The trained policy calls forth the commands with the symmetric and asymmetric variations in positional and feathering angles of two wings, which are highly correlated with the generation of aerodynamic forces and torques (Figure 2) resulting in the physical response of the flight system.For instance, the control strategies for velocity disturbances in x g , z g directions and angular velocity disturbances along body axis of y b show significant symmetric variations in the amplitude and mean value of positional angle.The pitch torques T y largely generated by a symmetric mean positional angle as well as the vertical and horizontal forces F z , F x mainly from symmetric stroke amplitude dominate the remarked pitch-up/down deviations and meanwhile induce the forward/backward and vertical motions.Similarly, significant asymmetric variations in the stroke amplitude and mean feathering angle of left and right wings dominate the control strategies for velocity disturbance in y g direction and angular velocity disturbances along the body axis of x b , z b .The remarkable rotational responses in roll and yaw directions with lateral movements appear due to the synchronous or opposite roll and yaw torques T x , T z induced by the asymmetry in the stroke amplitude and mean feathering angle, as well as the lateral forces F y produced by the asymmetry in mean values of left-and right-wing feathering angles.The flight system is highly coupled as the body's natural modes of motion couple with the periodic aerodynamic and inertial forces associated with flapping wings [4].Strong coupling between roll and yaw motions can be noticed in time evolutions of Figure 7a, and the lateral velocity perturbation in the direction of y g may also induce significant rotational deviations in roll angles (Figure 6b).The coupling phenomenon can be explained by the aerodynamic performance of the leading-edge vortex (LEV), where the side-translational velocity may cause the difference in relative velocities of left and right wings as well as the axial velocities of LEVs [35].This will lead to an asymmetry in the aerodynamic lift production of the two wings, which further generates roll moment for body rotation.Moreover, due to the asymmetric moderations in the stroke amplitude and mean feathering angle of left and right wings, the significant synchronous or opposite roll and yaw torques T x , T z as well as the moderate lateral forces F y are produced and force the coupling sideways motions of the insect's body.Meanwhile, significant pitch deviation and vertical motion in the direction of z g can be caused via horizontal velocity perturbation in the direction of x g (Figure 6a).The coupling features can be explained by the variation in aerodynamic drags in both down-and up-strokes due to varied relative velocity, which causes a cycle-averaged horizontal force around the center of mass producing a pitch moment for body rotation [35].Additionally, the pitch torques T y generated by symmetric variation in the mean positional angle as well as the vertical and horizontal forces F z , F x produced via symmetric stroke amplitude manipulation can further dominate the coupling longitudinal responses in terms of the forward/backward and vertical motions as well as the pitch-up/down deviations.
The previous studies on linear control strategy employed eigenvalue and eigenvector analyses to decouple and resolve the longitudinal and lateral equations [6,7,14,15].However, the sideways motions also have an impact on the longitudinal motions, as large angular perturbations along roll and yaw axis may induce remarked deviations in pitch angles as well as forward/backward and vertical motions according to the responses showed in Figures 6 and 7.The linear assumption may not be feasible as the nonlinearity still exists in the correlations between the wing kinematics modulation and the production of aerodynamic forces and torques.Thus, the DRL controller enables flight control in highly-coupled and nonlinear systems without previous resolution.More importantly, the determination of precise control parameters via the Laplace transformation and root locus approach [11,13] in traditional linear strategies are not necessary for the DRL control strategy, which has proved to be of great potential in fast policy achievement without precise treatments for control parameter implementation.Therefore, the 6-DoF four-wing kinematics-based DRL control strategy will further simplify the actuator-based fabrication and inspire the autonomous controller design for insect-inspired flapping-wing MAVs.

Conclusions
In this study, we have developed an integrated simulation framework with a bioinspired flight intelligence controller optimized by deep reinforcement learning (DRL) tasked with achieving bumblebee hovering stabilization under large perturbations.A highfidelity Open AI Gym environment is established coupling a CFD data-driven aerodynamic model and a 6-DoF flight dynamic model tailored to provide fast aerodynamics prediction and mimic different flight conditions.We propose a unique wing kinematics-based flight control strategy optimized using the Soft Actor-Critic (SAC) algorithm, which is proven to be successful in underactuated condition with an action space of 4 for stabilization under full disturbances from 6 DoF.Fast control after large perturbations could be obtained in body attitude stabilization of yaw, pitch, and roll angles while it takes more wing-beat cycles for body position stabilization of horizontal, lateral, and vertical motions.Better control performances in terms of displacement reduction and restoring time demonstrate the superiority of deep reinforcement learning compared to traditional linear strategies.The DRL controller enables flight control in highly coupled and nonlinear systems without previous resolution, and has great potential in fast control policy achievement without precise treatments for control parameter implementation.This 6-DoF wing kinematicsbased DRL control strategy may provide an efficient autonomous controller design for bioinspired flapping-wing micro air vehicles.

Figure 1 .
Figure 1.Morphological and kinematic models of bumblebee (Bombus terrestris): (a) kinematic parameters defined in a global ( ,  ,  ) and a body-fixed ( ,  ,  ) coo tems.The roll angle , pitch angle , and yaw angle  of the insect's body are deter the body axis of  ,  , and  , respectively; (b) Wing kinematics of bumblebees in ho are based on the experimental observations from Kolomenskiy et al. [28], where the p gle  (red), elevation angle  (blue), and feathering angle  (green) are expressed in ries.

Figure 1 .
Figure 1.Morphological and kinematic models of bumblebee (Bombus terrestris): (a) Schematic of kinematic parameters defined in a global x g , y g , z g and a body-fixed (x b , y b , z b ) coordinate systems.The roll angle ρ, pitch angle χ, and yaw angle ψ of the insect's body are determined along the body axis of x b , y b , and z b , respectively; (b) Wing kinematics of bumblebees in hovering flight are based on the experimental observations from Kolomenskiy et al.[28], where the positional angle ϕ (red), elevation angle θ (blue), and feathering angle α (green) are expressed in a Fourier series.
and b M L calculated via CDAM denote the aerodynamic forces and torques on body and two wings.b v cg represents the velocity of the body's center of mass, b ω bd denotes the angular velocity of the body, and ω R0b , ω L0b represent the angular velocities of the right and left wings.The coefficients A 2oR and A 2oL can be expressed as

Figure 2 .
Figure 2. Aerodynamic forces and torques induced through wing kinematics variations: symm ric stroke amplitude variation ∆∅; symmetric mean positional angle variation ∆ ; asymmet stroke amplitude variation between right and left wings ∆∅ ; and asymmetric mean featheri angle variation between right and left wings ∆ .Dotted region: initial wing motion for trimm hovering flight; shaded region with solid line: manipulated wing kinematics.

( 5 and 6 Figure 3
Figure 3 illustrates the schematic diagram of the wing kinematics-based bumbleb flight control system, where deep reinforcement learning gives solutions for controll design.The state transition for generating  can be achieved through our bumbleb environment based on the closed-loop flight dynamic model with a feedback controller

Figure 2 .
Figure 2. Aerodynamic forces and torques induced through wing kinematics variations: symmetric stroke amplitude variation ∆∅; symmetric mean positional angle variation ∆ϕ; asymmetric stroke amplitude variation between right and left wings ∆∅ RL ; and asymmetric mean feathering angle variation between right and left wings ∆α RL .Dotted region: initial wing motion for trimmed hovering flight; shaded region with solid line: manipulated wing kinematics.

Figure 3 Figure 3 .
Figure 3 illustrates the schematic diagram of the wing kinematics-based bumble flight control system, where deep reinforcement learning gives solutions for contro design.The state transition for generating  can be achieved through our bumble environment based on the closed-loop flight dynamic model with a feedback controll

.
R denotes the attitude errors of ∆ψ, ∆χ, and ∆ρ; and e ω denotes the angular velocity errors of ∆ .The action cost a t and action changing rate

Figure 4 .
Figure 4. Training process illustrated by the learning curve with obtained reward at the end of each exploration episode.

Figure 5 .
Figure 5. Trimmed state of a hovering bumblebee in equilibrium without perturbation.

Figure 4 .
Figure 4. Training process illustrated by the learning curve with obtained reward at the end of each exploration episode.

Figure 4 .
Figure 4. Training process illustrated by the learning curve with obtained reward at the end of each exploration episode.

Figure 5 .
Figure 5. Trimmed state of a hovering bumblebee in equilibrium without perturbation.

Figure 5 .
Figure 5. Trimmed state of a hovering bumblebee in equilibrium without perturbation.

16 Figure 6 .
Figure 6.Attitude and position control results under velocity perturbations: (a) Horizontal perturbation in direction of  ; (b) Lateral perturbation in direction of  ; (c) Vertical perturbation in direction of  .

Figure 6 .
Figure 6.Attitude and position control results under velocity perturbations: (a) Horizontal perturbation in direction of x g ; (b) Lateral perturbation in direction of y g ; (c) Vertical perturbation in direction of z g .

Figure 7 .
Figure 7. Attitude and position control results under angular velocity perturbations: (a) Roll perturbation along body axis of  ; (b) Pitch perturbation along body axis of  ; (c) Yaw perturbation along body axis of  .

Figure 8 .
Figure 8. Flight sequence of a hovering bumblebee after vertical velocity perturbation.

Figure 7 .
Figure 7. Attitude and position control results under angular velocity perturbations: (a) Roll perturbation along body axis of x b ; (b) Pitch perturbation along body axis of y b ; (c) Yaw perturbation along body axis of z b .

Figure 7 .
Figure 7. Attitude and position control results under angular velocity perturbations: (a) Roll perturbation along body axis of  ; (b) Pitch perturbation along body axis of  ; (c) Yaw perturbation along body axis of  .

Figure 8 .
Figure 8. Flight sequence of a hovering bumblebee after vertical velocity perturbation.

Figure 8 .
Figure 8. Flight sequence of a hovering bumblebee after vertical velocity perturbation.

Figure 9 .
Figure 9. Wing kinematics manipulations of left and right wings under velocity pertu Horizontal perturbation in direction of  ; (b) Lateral perturbation in direction of  perturbation in direction of  .

Figure 9 .
Figure 9. Wing kinematics manipulations of left and right wings under velocity perturbations: (a) Horizontal perturbation in direction of x g ; (b) Lateral perturbation in direction of y g ; (c) Vertical perturbation in direction of z g .

Figure 9 .
Figure 9. Wing kinematics manipulations of left and right wings under velocity pertu Horizontal perturbation in direction of  ; (b) Lateral perturbation in direction of  perturbation in direction of  .

Figure 10 .
Figure 10.Wing kinematics manipulations of left and right wings under angular veloc tions: (a) Roll perturbation along axis of  ; (b) Pitch perturbation along body a Yaw perturbation along body axis of  .

Figure 10 .
Figure 10.Wing kinematics manipulations of left and right wings under angular velocity perturbations: (a) Roll perturbation along body axis of x b ; (b) Pitch perturbation along body axis of y b ; (c) Yaw perturbation along body axis of z b .

Table 1 .
The maximum attitude or position displacements d max from the equilibrium state under horizontal, lateral, and vertical velocity perturbations using a proportional derivative (PD) controller and deep reinforcement learning (DRL) controller.

Table 2 .
The correction time t c expressed in wing-beat cycles under horizontal, lateral, and vertical velocity perturbations using a proportional derivative (PD) controller and deep reinforcement learning (DRL) controller.