## 1. Introduction

Various Approximate Dynamic Programming (ADP) methods have been employed to solve the optimal control problems for single and multi-agent systems [

1,

2,

3,

4,

5,

6]. They are divided into different classes according to the way the temporal difference equations and the associated optimal strategies are evaluated. The ADP approaches that consider gradient-based forms provide fast converging approaches, but they require the complete knowledge of the dynamical model of the system under consideration [

7]. The solution of the flexible wing control problem requires model-free approaches, since the aerodynamics of the flexible wing aircraft are highly nonlinear and they variate continuously [

8,

9,

10,

11,

12,

13,

14,

15,

16]. This type of aircraft has large uncertainties embedded in their aerodynamic models. Herein, an online adaptive learning approach, based on a gradient structure, is employed to solve the challenging control problem of flexible wing aircrafts. This approach does not need any of the aerodynamic information of the aircraft. It is based on a model-free control strategy approximation.

Several ADP approaches have been adopted to solve the difficulties associated with the dynamic programming solutions which involve the curse of dimensionality in the state and action spaces [

2,

3,

4,

5,

17,

18]. They are employed in different applications such as machine learning, autonomous systems, multi-agent systems, consensus and synchronization, and decision making problems [

19,

20,

21]. Typical optimal control methods tend to solve the underlying Hamilton–Jacobi–Bellman (HJB) equation of the dynamical system by applying the optimality principles [

22,

23]. An optimal control problem is usually formulated as an optimization problem with a cost function that identifies the optimization objectives and a mathematical process to find the respective optimal strategies [

6,

7,

18,

22,

23,

24,

25,

26,

27,

28]. To implement the optimal control solutions stemming from the ADP approaches, numerous solving frameworks are considered based on combinations of Reinforcement Learning (RL) and adaptive critics [

1,

5,

18,

25,

27]. Reinforcement Learning approaches use various forms of temporal difference equations to solve the optimization problems associated with the dynamical systems [

1,

18]. This implies finding ways to penalize or reward the attempted control strategies to optimize a certain objective function. This is accomplished in a dynamic learning environment where the agent applies its acquired knowledge to update its experience about the merit of using the attempted policies. RL methods implement the temporal difference solutions using two main coupled steps. The first approximates the value of a given strategy, while the second approximates the optimal strategy itself. The sequence of these coupled steps can be implemented with either value or policy iteration method [

18]. RL has also been proposed to solve problems with multi-agent structures and objectives [

29] as well as cooperative control problems using dynamic graphical games [

21,

26,

30]. Action Dependent Dual Heuristic Dynamic Programming (ADDHP) depends on the system’s dynamic model [

7,

26,

28]. Herein, the relation between the Hamiltonian and Bellman equation is used to solve for the governing costate expressions and hence a policy iteration process is proposed to find an optimal solution. Dual Heuristic Dynamic Programming (DHP) approaches for graphical games are developed in [

21,

26,

30]. However, these approaches require in-advance knowledge of the system’s dynamics and, in some cases of the multi-agent systems, they rely on complicated costate structures to include the neighbors influences.

Adaptive critics are typically implemented within reinforcement learning solutions using neural network approximations [

18,

27]. The actor approximates the optimal strategy, while the value of the assessed strategy is approximated by the critic [

18]. Real-time optimal control solutions using adaptive critics are introduced in [

3]. Adaptive critics provide prominent solution frameworks for the adaptive dynamic programming problems [

31]. They are employed to produce expert paradigms that can undergo learning processes while solving the underlying optimization challenges. Moreover, they have been invoked to solve a wide spectrum of optimal control problems in continuous and discrete-time domains, where actor-critic schemes are evoked within an Integral Reinforcement Learning context [

32,

33]. An action-dependent solving value function is proposed to play some zero-sum games in [

34], where one critic and two actors are adapted forward in time to solve the game. An online distributed actor-critic scheme is suggested to implement a Dual Heuristic Dynamic Programming solution for the dynamic graphical games in [

7,

24] without overlooking the neighbors’ effects, which is a major concern in the classical DHP approaches. The solution provided by each agent is implemented by single actor-critic approximators. Another actor-critic development is applied to implement a partially-model-free adaptive control solution for a deterministic nonlinear system in [

35]. A reduced solving value function approach employed an actor-critic scheme to solve the graphical games, where only partial knowledge about the system dynamics is necessary [

26]. An actor-critic solution framework is adopted for an online policy iteration process with a weighted-derivative performance index form in [

33]. A model-free optimal solution for graphical games is implemented using only one critic structure for each agent in [

25]. The recent state-of-the-art adaptive critics implementations for numerous reinforcement learning solutions for the feedback control problems are surveyed in [

36]. These involve the regulation and tracking problems for single- as well as multi-agent systems [

36].

Flexible wing aircraft are usually modeled as two-mass systems (fuselage and wing). Both masses are coupled via different kinematic and dynamic constraints [

8,

13,

14,

15,

37]. They involve the kinematic constraint at the connection point of the hang strap [

38,

39]. The keel tube works as a symmetric axis for this type of aircraft. The basic theoretical and experimental developments for the aerodynamic modeling aspects of the flexible wing systems are introduced in [

8,

13,

14,

15,

40,

41]. Several wind tunnel experiments have been introduced for the hang glider in [

14]. An approximate modeling approach of the flexible wing’s aerodynamics led to equations of motion for the lateral and longitudinal directions with small perturbation models in [

42]. The modeling process for the hang glider assumed a rigid wing modeling process, where the derivatives, due to the aerodynamics, were added at the last stage [

11,

12]. A comprehensive decoupled aerodynamic model for the hang glider is presented in [

43]. A nine-degree-of-freedom aerodynamic model that employs a set of nonlinear state equations is developed in [

38,

39]. The control of the flexible wing aircraft follows a weight shift mechanism, where the lateral and longitudinal maneuvers or the roll/pitch control mechanism is achieved by changing the relative centers of gravity of the wing and the fuselage systems [

9,

10,

13,

14,

37,

44]. The geometry of the flexible wing’s control arm influences the maximum allowed control moments [

9]. The reduced center of gravity magnifies the static pitch stability [

9]. Frequency response-based approaches are adopted to study the stability of flexible wing systems in [

11,

12]. The longitudinal stability of a fixed wing system can be used to understand that of the flexible wing vehicle provided some conditions are satisfied [

37]. The lateral stability margins are shown to be larger compared to conventional fixed wing aircraft.

The contribution of this work is four-fold:

An online adaptive learning control approach is proposed to solve the challenging weight-shift control problem of flexible wing aircraft. The approach uses model-free control structures and gradient-based solving value functions. This serves as a model-free solution framework for the classical Action Dependent Dual Heuristic Dynamic Programming problems.

The work handles many concerns associated with implementing value and policy iteration solutions for ADDHP problems, which either necessitate partial knowledge about the system dynamics or involve difficulties in the evaluations of the associated solving value functions.

The relation between a modified form of Bellman equation and the Hamiltonian expression is developed to transfer the gradient-based solution framework from the Bellman optimality domain to an alternative domain that uses Hamilton–Jacobi–Bellman expressions. This duality allows for a straightforward solution setup for the considered ADDHP problem. This is supported by a Riccati development that is equivalent to solving the underlying Bellman optimality equation.

The proposed solution that is based on the combined-costate structure is implemented using a novel policy iteration approach. This is followed by an actor-critic implementation that is free of the computational expensive matrix inverse calculations.

The paper is organized as follows:

Section 2 briefly explains the weight shift control mechanism of a flexible wing aircraft.

Section 3 highlights the model-based solutions within the framework of optimal control theory along with the existing challenges.

Section 4 discusses the duality between the Hamiltonian function and Bellman equation leading to the Hamilton–Jacobi–Bellman formulation, which is used to generalize the Action Dependent Dual Heuristic Dynamic Programming solution with a policy iteration process.

Section 5 introduces the model-free gradient-based solution and the underlying Riccati development.

Section 6 demonstrates the adaptive critics implementations for the proposed model-free gradient-based solution.

Section 7 tests the validity of the introduced online adaptive learning control approach by applying it on two case studies. Finally, the paper is concluded with some concluding remarks in

Section 8.

## 2. Control Mechanism of a Flexible Wing Aircraft

This section briefly introduces the idea of weight shift control along with a basic aerodynamic model of a flexible wing system. Herein, a flexible wing aircraft is modeled as a two-mass system (fuselage/pilot and wing) coupled through nonlinear kinematic constraints at the hang strap, as shown in

Figure 1. The flexible wing is connected to the fuselage through a control bar. The aerodynamic forces are controlled via a weight shift mechanism, where the fuselage’s center of gravity “floats” with respect to that of the wing [

8,

9,

10,

11,

12,

37,

44]. Such a system is governed by complex aerodynamic forces which makes it difficult to model to a satisfactory accuracy. Consequently, model-based control approaches may not be appropriate for the auto-pilot control of such systems.

In this framework, the longitudinal and lateral motions are controlled through the force components applied on the control bar of the hang glider [

38,

39]. This development takes into account a nine-degree-of-freedom model that considers the kinematic interactions and the constraints between the fuselage and the wing at the hang point, as shown in

Figure 1. The longitudinal and lateral dynamics are referred to the wing’s frame, where the forces (nonlinear state equations) at the hang point are substituted for by some transformations in the wing’s frame [

39].

The decoupled longitudinal and lateral aerodynamic models satisfy the following assumptions [

39]:

The hang strap works as a kinematic constraint between the decoupled wing/fuselage systems.

The fuselage system is assumed to be a rigid body connected to the wing system via a control triangle and a hang strap.

The force components applied on the control bar are the input control signals.

External forces, such as the aerodynamics and gravity, the associated moments, and the internal forces, are evaluated for both fuselage and wing systems.

The fuselage’s pitch–roll–yaw attitudes and pitch–roll–yaw attitude rates are referred to the wing’s frame of motion through kinematic transformations.

The complete aerodynamic model of the aircraft is reduced by substituting for the internal forces at the hang strap using the action/reaction laws.

The pilot’s frames of motion (i.e., longitudinal and lateral states) are referred to the respective wing’s frames of motion.

The dynamics of the flexible wing aircraft are decoupled into longitudinal and lateral systems, such that [

9,

37,

39,

44].

where

${\delta}^{Lo}={\left[\begin{array}{cccccc}\hfill {\nu}_{aw}& {\nu}_{nw}& {\dot{\theta}}_{w}\hfill & {\dot{\theta}}_{fw}& {\theta}_{fw}& {\theta}_{w}\end{array}\right]}^{T}$ is the longitudinal state vector,

${\delta}^{La}={\left[\begin{array}{cccccccc}\hfill {\nu}_{lw}& {\dot{\varphi}}_{w}& {\dot{\psi}}_{w}\hfill & {\dot{\varphi}}_{fw}& {\dot{\psi}}_{fw}& {\varphi}_{fw}& {\psi}_{fw}& {\varphi}_{w}\end{array}\right]}^{T}$ is the lateral state vector, the force

${T}_{cq}=\frac{1}{2}\left({T}_{Rq}+{T}_{Lq}\right)$ is the collective force in direction

q, the force

${T}_{dq}=\frac{1}{2}\left({T}_{Rq}-{T}_{Lq}\right)$ is the differential force in direction

q,

${u}^{Lo}={\left[\begin{array}{cc}\hfill {T}_{cx}& {T}_{cz}\end{array}\right]}^{T}$ represents the longitudinal control signals, and

${u}^{La}={\left[\begin{array}{ccc}\hfill {T}_{cy}& {T}_{dx}& {T}_{dz}\hfill \end{array}\right]}^{T}$ denotes the lateral control signals.

The modeling results of the flexible wing aircraft are based on the experimental and theoretical studies of [

9], where the control mechanism employs force components on the control bar [

39].