Asymmetric Airfoil Morphing via Deep Reinforcement Learning

Morphing aircraft are capable of modifying their geometry configurations according to different flight conditions to improve their performance, such as by increasing the lift-to-drag ratio or reducing their fuel consumption. In this article, we focus on the airfoil morphing of wings and propose a novel morphing control method for an asymmetric deformable airfoil based on deep reinforcement learning approaches. Firstly, we develop an asymmetric airfoil shaped using piece-wise Bézier curves and modeled by shape memory alloys. Resistive heating is adopted to actuate the shape memory alloys and realize the airfoil morphing. With regard to the hysteresis characteristics exhibited in the phase transformation of shape memory alloys, we construct a second-order Markov decision process for the morphing procedure to formulate a reinforcement learning environment with hysteresis properties explicitly considered. Subsequently, we learn the morphing policy based on deep reinforcement learning techniques where the accurate information of the system model is unavailable. Lastly, we conduct simulations to demonstrate the benefits brought by our learning implementations and validate the morphing performance of the proposed method. The simulation results show that the proposed method provides an average 29.8% performance improvement over traditional methods.


Introduction
While unmanned aerial vehicles (UAVs) have played a crucial role in various civil and military missions, studies demonstrate that birds usually possess higher flight maneuverability and agility than comparatively-sized aircraft in complex and varying environments [1,2]. One of the critical advantages of birds is that they morph their wings and tails intricately to perform efficient behaviors including perching, hovering and maintaining stability under different flight conditions [3]. Such aerodynamic adaptability has aroused flourishing interest in the design and control of avian-inspired morphing UAVs [4][5][6]. In this work, we focus on aircraft capable of morphing their wings by modifying the geometric configuration of the airfoil shape, which refers to more specifically camber morphing [7]. Bird wings are usually cambered to generate sufficient lift force at a low angle of attack. The camber does not remain constant through their flight, and many observations show that birds actively control the camber of their proximal wing via remiges and modify the distal wing airfoil shape via their primary feathers [8,9]. These investigations provide insight into the study of airfoil-morphing aircraft. The benefits brought by camber morphing for aircraft include increasing lift, reducing drag and airframe noise mitigation [10]. Applications of camber morphing mechanisms include a lead-edge morphing combined smart droop nose design, which achieves high-lift performance with significantly reduced complexity and mass [11], and a flexible morphing trailing edge design with deformable ribs, which is used to enhance the Fowler flaps and act as a substitution for ailerons for civil transport aircraft [12]. approaches without accurate knowledge of system models, which adjusts the airfoils during flight procedures to track the optimal shapes in different flight conditions. The rest of this paper is organized as follows. In Section 2, the morphing method is developed in three steps. In Section 2.1, the asymmetric airfoil is shaped using piece-wise Bézier curves. In Section 2.2, the morphing airfoil is modeled by SMAs, and a dynamic model is derived. In Section 2.3, a deep reinforcement learning-based morphing policy is developed. In Section 3, simulations are conducted to validate the proposed methods. In Section 4, this work is summarized.

Asymmetric Airfoil Shape Modeling
Well-known methods of curve synthesis for airfoil design and optimization include splines (e.g., B-spline and Bézier curves) [42,43], free-form deformation (FFD) [44], and class-shape transformations (CST) [45]. In this work, Bézier curves are selected for their straightforward design procedure and simple calculations. The shape of the asymmetric airfoil is parameterized via N control points, which are connected by Bézier curves [18,20]. To generate an untangled shape, the control points are distributed in an annulus with a predefined inner radius R 1 and outer radius R 2 . Moreover, the annulus is partitioned into N sectors equally, in each of which a control point is placed. These points are sorted with respect to the azimuth and denoted as p i ∈ R 2 for i = 1, . . . , N in Cartesian coordinates. Each pair of adjacent points is augmented by two other points and then connected via a cubic Bézier curve. For each control point, p i , an auxiliary angle, is calculated to determine the tangent to the curve at this point, which is given by where θ i,i+1 is the angle between point p i and p i+1 , and α i ∈ [0, 1] is an averaging parameter to modify the local smoothness of the curve. Then, two augmented points for the curve between p i and p i+1 are calculated by where e i = cos(θ * i ), sin(θ * i ) , and the scale parameter η i controls the local curvature. The curve connecting p i and p i+1 is given by An example of a valid shape is illustrated in Figure 1. The morphing wings in this work take NACA-2424 [46,47] as the baseline airfoil.

Dynamic System of Airfoil Morphing
Denote the polar coordinate of each point p i as {r i , θ i } N i=1 . Then the shape of the airfoil is fully determined by the radius r i , the angle θ i and the auxiliary angle parameters α i , η i . In the flight procedure of the aircraft, we aim to morph the airfoil to maximize desired aerodynamic performances, such as the lift-to-drag ratio C l /C d , according to various conditions, including flight position, velocity or attack angle. With the combination of CFD solvers and optimization methods, the preferred airfoil shape at a given flight condition can be determined previously. Such optimized shapes serve as the reference or target airfoil shapes for the morphing task. Then, the problem of how to control the airfoil to achieve optimized shapes during the flight procedure where the flight condition varies is investigated. Since there has been a variety of investigations on the position control of DC motors [48], we assume that the polar angle of each control point tracks the reference trajectory well via motors. Additionally, we assume that the auxiliary angle parameters can be adjusted rapidly and accurately. Therefore, in this work, we focus on the optimal morphing in the aspect of modifying the radii of control points.
Smart materials, especially shape memory alloys (SMA), are adopted to realize the airfoil morphing, where the radii of control points are modified via adjusting the length of SMA wires. Firstly, the dynamic model is constructed between the wire temperature and radii. An SMA wire changes its length through the crystal phase transformation between martensite and austenite according to the temperature. The transitions to martensite and austenite have different start and end temperatures, which leads to the hysteresis properties of the strain with respect to the temperature. Instead of common methods such as Preisach model and Krasnosel'skii-Pokrovskii model [49], the SMA hysteresis is characterized using hyperbolic tangent functions for their efficiency in computation and accuracy in curve fitting [41]. The strain is replaced by a radius factor 1]. For heating and cooling starting with temperatures outside the transformation region, namely that the initial temperature is not between the end temperatures of the phase transformations, the hysteresis properties are modeled by the major hysteresis loops as where T denotes the temperature, and h 0 , c tl , c tr , c b , w, c s parameterize the shape of the curves. Values of such parameters are chosen to fit the experimental data of SMA [40,41]. The radius factor γ varies according to the f major l curve when the temperature decreases and f major r when the temperature increases. Furthermore, switching the temperature direction during the transformation procedure causes a reverse transformation starting from the current temperature and strain, which is not on the major loop of the reverse transformation. Such transforms are modeled using minor hysteresis loops, which are modeled by hyperbolic tangent curves with similar shapes as the major loops. The function of a rising minor loop is given as where h is selected to ensure the intersection of the consecutive curves at the current point and is given by and h prev is the height parameter of the previous curve. Functions for the lowering curves are analogous as and An illustration of the transformation procedure is given in Figure 2. After constructing the temperature-strain model, resist heating is used to actuate the SMA wires [50]. Given the applied voltage v i , the temperature T follows the heat transfer model [39] where m w is the mass per unit length of the SMA wire, c w is the specific heat, R w is the electrical resistance per unit length, h w is the heat exchange coefficient, A w is the wire circumferential area, and T f is the airflow temperature. Combining the temperature-strain relationship and (10), it is shown that the dynamic system between the radius and the input voltage are highly nonlinear due to the hysteresis characteristics. An illustration on the dynamics of SMA wires driven by a sinusoidal voltage input is given in Figure 3. In this work, we tackle the morphing problem for the airfoil constructed by the SMA wires whose dynamics are given in (4)- (10). Note that the temperature-strain and voltagetemperature relationships modeled above are not directly accessible to our controller, but serve as the environment from which paths of the states can be sampled. We resort to deep reinforcement learning methods to design the morphing policy.

Reinforcement Learning based Morphing Control
Reinforcement learning (RL) methods are capable of learning a control policy from interactions between the given agent and environment, with no requirement on the knowledge of system models [51]. The learning procedures are based on Markov decision processes (MDPs), which are given by 4-tuples {S, A, R, P}, where S and A are the state space and action space containing all the states and actions, respectively, R is the reward function giving r k = R(s k , a k , s k+1 ) as rewards, and P is the transition function giving P(s k+1 |s k , a k ) as state transition probabilities. In this section, the airfoil morphing problem is solved in the RL framework, where we aim to find the optimal policy maximizing the expected total rewards.
Before choosing the states and actions for the morphing problem, the morphing system is investigated further to construct an MDP from it. Firstly, the voltage-temperature dynamics (10) of SMA wires is discretized via Euler methods as where ∆t > 0 is the discretizing time step, and With regard to the temperature-strain dynamics, since the minor loops converge with the major loops outside the SMA's transformation region, we assume that the major loops determine the initial states of the wires, and minor loops dominate the dynamics during the morphing procedure. We denote the function (8) as f l (T, h) and (6) as f r (T, h) for simplicity, and introduce the signum function Then, sgn(T k − T k−1 ) is used to discriminate the status of raising or lowering the temperature at time k. According to (11) and the fact that ∆t > 0, we describe the temperature direction at time k by Note that the strain is dependent on the height parameter of the current loop. Recall from (7) and (9) that the value of the height parameter changes when the direction of temperature switches. Then, the time-varying parameter is determined by where detects the reversal of temperature direction. Subsequently, the radius factor given the temperature and height parameter is calculated by choosing the rising or lowering loop according to the current temperature direction as Summarizing the relationships (11), (15) and (17), it seems appropriate to select T and h as states and v as actions to construct a RL environment for airfoil morphing. The radius factor can be treated as an observation since it is dependent on only the current voltage, temperature and height parameter. Since the length and temperature of SMA wires can be measured directly via sensors equipped on the airfoils, these values are assumed to be available. However, the height parameter is not a realistic physical characteristic but just a coefficient fitting the hyperbolic tangent curves to the actual SMA properties, which makes the measurement on h not available. Actually, we aim to adjust the position of control points to achieve a reference airfoil shape, such that we want to learn a policy on modifying the lengths (i.e., the radius factors). We observe from the loop functions (6) and (8) that, given the temperature T, the function of γ with respect to h is bijective. Therefore, denoting the relationship (17) as , we can obtain an inverse function as Combining (15) and (18), the dynamics of radius factors are described as Note that χ i,k−1:k is dependent on the temperature and voltage at time k − 2. This makes the transition function of γ a second-order difference equation. Proposition 1. Given a second-order MDP with states s ∈ R n s and actions a ∈ R n a satisfying p(s k |s k−1 , a k−1 , s k−2 , a k−2 , . . . , s 0 , a 0 ) = p(s k |s k−1 , a k−1 , s k−2 , a k−2 ) k ≥ 2 (20) select s k = s k , s k−1 and a k = a k , a k−1 . Then, an MDP can be constructed by states s and actions a (with an additionally defined state transition function and reward function). Moreover, if the second-order MDP with s and a has a deterministic state transition function then the MDP constructed by s and a satisfies the transition function s k = f s I n s , 0 s k−1 , I n a , 0 a k−1 , 0, I n s s k−1 , 0, I n a a k−1 , 0, I n s s k−1 This proposition can be derived directly via the properties of MDPs. According to Equations (11), (19) and Proposition 1, it is reasonable to choose the radius factors γ i,k−1 , γ i,k and temperatures T i,k−1 , T i,k as states with input voltages v i,k−1 , v i,k as actions. We restrict the states by γ ∈ [0, 1] and the actions by v ∈ [0, V max ].
The reference radius factors of the optimized airfoil shape under certain condition c k are denoted as γ ref i,k (c k ). When determining the reward, we expect the airfoil to morph to reference shapes accurately and rapidly. Therefore, a sparse reward function comprised of two components at time k is designed by where the position reward conveying the requirement of morphing accuracy is given by (24) and the voltage reward aiming to increase the morphing speed is given by where r p > r v > 0 and e thr > 0 are tunable hyperparameters. When the position error is small, the voltage reward is eliminated to mitigate the oscillation. The choice of adjacent voltages as actions permits the calculation of the voltage reward in the MDP framework. Furthermore, the total return to be maximized is given by where τ = (s 0 , a 0 , c 0 , s 1 , a 1 , c 1 . . . ) denotes the sequence of states, actions and conditions. After establishing the MDP, we proceed to tackle the morphing task based on deep reinforcement learning techniques. Our learning method is designed based on the soft actor-critic (SAC) algorithm [52], which is an off-policy reinforcement learning method compatible with continuous state and action spaces. In the actor-critic framework, the agent learns to interact with the environment and obtain maximum rewards via training two types of neural networks iteratively. The first one is named a critic network and accepts current states and actions as input to approximate the action-value function, which serves as an evaluation of the current policy. The second one is denoted as an actor network and generates actions according to the system states and optional external inputs. After the training is converged, the policy is determined by the actor network and conducted for online executions, which in this work is the morphing task. In SAC, a stochastic policy is learned with additional entropy regularization in the rewards, which improves the ability of exploration and achieves faster convergence for a variety of control problems. According to the MDP of the morphing procedure, we choose the state s k to be s k = s 1,k−1 , s 1,k , . . . , s N,k−1 , s N,k and the action as a k = a 1,k−1 , a 1,k , . . . , a N,k−1 , a N,k , where s i,k = [γ i,k , T i,k ] and a i,k = v i,k . Note that since the reference airfoil shapes guide the morphing, the flight condition should be incorporated for the generation and evaluation of the actions. We denote the distribution of the stochastic policy as π(a|s, c).
With regard to the construction of the action-value function, instead of directly applying (23), we augment the reward with policy entropy as where is the entropy representing the randomness of the policy, and β > 0 is the trade-off coefficient. According to (26), we define a finite-horizon undiscounted return to be maximized. Nevertheless, a discount factor is applied when evaluating the value functions to focus on recent rewards, since the future reference shapes are not accessible at the current time step. Then, the action-value function with the regularized reward function is introduced as Afterwards, the Bellman equation for the action-value function is given as Q π (s k , a k , c k , k) = r k + E s k+1 ∼P,a k+1 ∼π [ξ(Q π (s k+1 , a k+1 , c k+1 , k + 1) − β log π(a k+1 |s k+1 , c k+1 ))] (30) and following the schedule of SAC [52], we approximate the expectation by Q π (s k , a k , c k , k) ≈ r k + ξ(Q π (s k+1 ,ã k+1 , c k+1 , k + 1) − β log π(ã k+1 |s k+1 , c k+1 )) (31) where {s k , a k , c k , s k+1 , c k+1 } are sampled from replay buffers, and the next actionã k+1 is sampled from the current policy π(·|s k+1 , c k+1 ). For the learning of the action-value functions, the double-Q trick is applied to avoid overestimation [53]. Two critic networks for Q functions are implemented as Q φ 1 (s, a, c, k) and Q φ 2 (s, a, c, k), where φ 1 and φ 2 are parameters. Additionally, for the stabilization of the training procedure, the target networks Q φ 1 and Q φ 2 , which are copies of Q φ 1 and Q φ 2 , are used and updated by polyak averaging after each time we update the main critic networks as where ρ ∈ (0, 1) is the update hyperparameter. Summarizing all these settings, the loss for critic networks is given by the mean squared Bellman error function as where B is the sampled batch with elements b = {s k , a k , c k , r k , k, s k+1 , c k+1 , k + 1} of the replay buffer, and where y(s k+1 , a k+1 , c k+1 , k + 1) = ξ Q φ (s k+1 ,ã k+1 , c k+1 , k + 1) − β log π(ã k+1 |s k+1 , c k+1 ) (34) is calculated using target networks. Stochastic gradient descent is applied to update φ 1 and φ 2 with respect to the loss functions L(φ 1 ) and L(φ 2 ). Subsequently, we aim to find the policy that maximizes the expected action-value function with respect to the actions. Denote the parameters of actor network as θ. Since the action at time step k is composed of the input voltages at k − 1 and k, the actor network should be designed as π θ (s k , c k , a k−1 ), where an identity layer is applied to propagate the previous voltages. However, the distribution of new input voltages is still only dependent on s k and c k . Therefore, we use π θ (a k |s k , c k ) to denote the density of action a k . Then, the value function to be optimized is given as The reparameterization trick is adopted here for the sake of the efficient computation of gradients [54]. We introduce a standard normal distributed variable ζ ∼ p(ζ) = N (0, I n a ) and calculate the input voltage according to a deterministic squashing function as where µ θ (s, c) and σ θ (s, c) are parameterized neural networks and • denotes element-wise multiplication. Then, the expectation over a k ∼ π can be converted to the expectation over the normal variable whose distribution is irrelevant to the states and net parameters. Additionally, the squashed Gaussian policy constrains the input voltages in [0, V max ]. According to (36), the action given s k , c k and ζ k is written as a k = a θ (s k , c k , ζ k ). Note that this is an invertible map between a k and ζ k . Therefore, we can compute the log-probabilities in closed form according to the change of the variable formula [55] as where a −1 θ is the inverse function of a θ given s k and c k , and J denotes the Jacobian matrix. With Q π θ approximated by the minimum of the two critic networks, the loss function for the actor network is obtained as Then, we conduct the training by updating the actor and critic networks iteratively. The agent interacts with various randomly generated time-varying reference shape sequences to acquire training data, which faciliates the exploration and enables the policy to handle different morphing scenarios. When the training converges, we can use the actor network to calculate the required voltages and morph the airfoil to the reference shapes. Finally, the overall flowchart of the proposed morphing mechanism is given in Figure 4.

Results
In this section, a simulation is conducted to validate the proposed morphing method. The simulation is arranged in two stages. Firstly, we implement our method with random generated reference shapes and perform ablation studies to examine the superiorities brought by different parts of our algorithm. Subsequently, we apply the proposed method to track optimized airfoil shapes in different flight conditions and show the morphing procedures. The values of parameters in our simulation are given in Table 1 [40,41].

Tracking Random Shapes
In this stage, the superiorities of our method are illustrated in a variety of perspectives, including the state/action selection, reward configuration and entropy regularization. Without loss of generality, piece-wise constant trajectories of the radius factor for one control point are generated to represent the reference shapes. Then, our method is compared with three different settings of RL algorithms. The proposed method is denoted as RLM-SAC.

•
Second-order state/action versus first-order state/action In Section 2, second-order MDP is adopted to model the hysteresis characteristics of the morphing system. Therefore, we chose the states and actions as combinations of that in current step and previous step. We compared the performance with RL algorithms where the policy is generated according to only current states, and the value function was also evaluated with only current states and actions as inputs that are applied in existing investigations on controling SMA wires. We refer to this as RLM-FO. • Sparse reward versus squared error reward We designed a sparse reward taking value in {0, 1}, which is different from traditional RL-based morphing research. We compared that with the square error rewards, which is given by which is named RLM-SER. • SAC versus DQN The entropy regularization improves the capability of exploration in our algorithm. A modified deep Q learning method was implemented as a comparison, where only the entropy loss was removed, and both the double-Q setting and reparameterization trick remained. We denote this as RLM-DQN.
All RL realizations were trained through 150 epochs, in each of which 5 episodes with 40 s of time and 200 time steps were executed. The critic network was constructed using multilayer perceptrons (MLP) of 3 hidden layers and 128 units per layer. The actor network adopted similar structures, where additional fully connected layers were attached to produce the mean and standard variations of the policy. The training was started with actions uniformly sampled from the valid action space bounded by [0, V max ] for 5000 steps to explore the state space sufficiently. Then, the networks were updated every step with a batch size of 200. A fixed learning rate was set as 0.002, and other hyperparameters used in RL training are shown in Table 2. The actions were generated from the stochastic policy in training phase but produced in a deterministic way according to the mean value of the actor network in the testing phase. The results are shown in Figures 5-8. In Figures 5 and 6, we illustrate the rewards acquired in the training and test trajectories, respectively, during the training procedure. A Savitzky-Golay filter [56] was adopted to smooth the data such that the values and trends of the rewards are illutstrated more clearly. Note that the reward of RLM-SER was not included because of a different reward setting. After the training was finished, the algorithms were executed on 100 random generated test reference trajectories. We present the root-mean-squared error (RMSE) of the radius factor through the flight time in Figure 7. Some of the trajectories and the corresponding results produced by the four RL realizations are depicted in Figure 8 for an intuitive comparison.
From the results, we can see the benefits of each important component in our algorithm. Firstly, RL using square error reward totally fails to produce effective actions in our environment, which is shown from both the RMSE and example trajectories. RLM-FO acquires inferior performance compared with RLM-SAC, especially in the temperature-switching procedure, which can be validated by the middle sections of the example trajectories. This is a result of the fact that the hysteresis cannot be characterized well by first-order states and actions. Lastly, RLM-DQN obtains better performance than RLM-FO and RLM-SRE, and it achieves a similar reward to RLM-SAC at the end of training. However, from the illustrations of rewards in Figures 5 and 6, it is shown that RLM-DQN converges much slower than RLM-SAC. This is due to the improvement of exploration capability provided by entropy regularization.

Morphing Procedure Simulation
In this stage, the trained actor network is applied to morph an airfoil controlled by four points in a given flight procedure with varying flight conditions. Since the focus of this work is morphing control, in each condition, the optimal shape is assumed to be solved in anticipation by shape optimization techniques and determined by the radii and angles of control points [14,18,20]. The average coefficient and scale coefficient are fixed as α = [0.12, 0.4, 0.4, 0.12] and η = [0.5, 0.5, 0.5, 0.5]. The trajectory of flight conditions and the corresponding parameters of optimal shapes are shown in Figure 9. Since the position control technique of motors is relatively sufficiently developed, we assumed that the angles of the points are controlled by DC motors with accurately known linear dynamic models, which can drive the points to desired angles rapidly with subtle errors. Then we generate input voltages to heat the SMA wires and adjust the radius factors. The voltages, temperatures and radius factors of all points are summarized in Figure 10. It is shown that with constrained voltages, the wires can track the reference lengths well. Furthermore, we illustrate the morphing procedures in Figure 11. We can see intuitively that with the proposed RLM-SAC method, the airfoil is capable of morphing into the optimized shape within about 3 s after encountering a new flight environment, which validates both the morphing accuracy and morphing speed.
Quantitative comparisons on the length factor differences and shape differences are given in Table 3. The average length factor differences are calculated according to the reference length factor and actual length factor as The difference between the actual and reference shapes are evaluated using distances between the control points. An average shape difference over all time is calculated as where ||p i − p ref i || 2 is the L 2 distance between the two Cartesian coordinates p i,k and p ref i,k . Additionally, the steady shape error is evaluated by where k t denotes the end time step of each flight condition. It is shown that our method acquires the best performance on all metrics. The proposed RLM-SAC method provides an average 29.8% performance improvement over the second-best RLM-DQN method. The length differences and average shape differences, which are averaged over all time steps, demonstrate that our method can morph the airfoil into desired shapes more rapidly, while the steady shape difference validates the morphing accuracy. Lastly, in this work, we give the reference airfoil shape directly and focus on the morphing performance of the proposed method. It will be interesting and meaningful to combine the shape optimization task with the morphing control problem in the future. Figure 11. Illustration of the morphing procedure. Each row represents a morphing stage.

Conclusions
In this work, a novel deep reinforcement learning-based morphing control method is proposed for an asymmetric morphing airfoil. The airfoil is designed via Bézier curves and is capable of morphing from a baseline shape to an asymmetric shape. The morphing mechanism is modeled via SMA wires, which adjust shape parameters, especially the radii of the control points. To actuate the SMA wires, resistive heating is performed, but the hysteresis characteristics between the SMA strain and temperature make the dynamic system nonlinear and non-Markovian, which brings difficulties to the design of the control algorithm and the RL framework. Therefore, hyperbolic tangent curves are adopted to model the strain-temperature relationship and derive a second-order MDP describing the system, which is then transformed into a valid MDP and provides guidance for the selection of states and actions. Based on the constructed MDP, we modify the SAC algorithm and develop an RL scheme where input voltages are generated to morph the airfoil instantaneously according to reference-optimized shapes. Lastly, ablation studies on random generated reference trajectories are conducted to demonstrate the benefits brought by different components of our RL implementations, and we perform simulations of morphing procedures to validate that our method is able to morph the airfoil into the optimized shapes rapidly and accurately. Future works include incorporating the aerodynamic performance optimization directly into the morphing control and exploiting learning-based morphing policies for more complicated bio-inspired morphing aircraft.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: