Considering the reinforcement-learning framework, an agent and an environment compose the two main elements that are defined to interact with each other. The stated interaction is carried out through agent actions that occur in an instantaneous effect in the environment, (returning the effect consequence back to the agent, that is, agent state) and through environment rewards that guide the evolution of the agent behavior. Thus, in the reinforcement-learning paradigm, an agent is determined to generate a policy

$\pi \left({s}_{t}\right)$ in order to maximize its expected accumulated reward

${R}_{t}$ through the execution of an action

${a}_{t}\in \mathbb{A}$ for each state

${s}_{t}\in \mathbb{S}$ at any step in time

t. The expected return, given a state and an action in time

t, is

where

${Q}^{\pi}({s}_{t},{a}_{t})$ represents the action–value function. A broadly used reinforcement-learning approach is learning the stated action–value function by means of an optimization process that has as its main objective to estimate the optimal Q-function (which, in the context of discrete action spaces, leads to an optimal policy of the agent). Other algorithms aim at directly estimating the agent optimal policy

${\pi}^{\star}\left({s}_{t}\right)$ by means of the parallel estimation of the Q-function (or, equivalently, the V-function [

50]) and the computation of the policy gradient [

51]. These methods, that follow the actor–critic paradigm and are compatible with continuous state and action spaces are known as policy-gradient methods, such as Deep Deterministic Policy Gradients (DDPGs) [

51] or Trust-Region Policy Optimization (TRPO) [

52]. Other recent algorithms, which compose a family of natural policy-gradient algorithms named Proximal Policy Optimization (PPO) [

53], include a surrogate objective function for optimization. This family of algorithms lead to a reduction of complexity and an increase of both sample efficiency and performance. Stated increase in performance was taken into consideration for the application under study due to its high-dimensional nature of state and action spaces (refer to

Section 4.3.1). Since PPO aims at directly optimizing an objective function for the policy, an instant update of the policy is controlled in order to avoid divergence and to assure optimal convergence. The strategies for constraining the policy update are diverse and a matter of research. Nevertheless, the most common strategies are based on Kullback–Leibler divergence (KL-divergence) penalty and on direct-clipping penalty (see Equation (

4)).

where

${\mathcal{L}}_{{\theta}_{k}}^{\mathit{CLIP}}$ represents the objective function for policy-weight update,

${r}_{t}\left(\theta \right)$ is the new–old policy ratio,

${\widehat{A}}_{t}^{{\pi}_{k}}$ is the advantage function for policy update at time

t and

$\u03f5$ is the clipping constraint. In this work, due to computation simplicity and performance increase, PPO with a clipping penalty was selected to train the agent to accomplish the image-based multirotor following maneuver with continuous state and action spaces.

#### 4.3.1. Problem Formulation

In our problem formulation, the reinforcement-learning agent perceives a continuous state

${s}_{t}\in \mathbb{S}$ at time

t (

5).

where

${e}_{cx}$,

${e}_{cy}$ and

${e}_{area}$ represent the normalized error of the current RoI with respect to the target RoI center position (in x and y axes) and area;

${\dot{e}}_{cx}$,

${\dot{e}}_{cy}$ and

${\dot{e}}_{area}$ represent the normalized difference of errors (

${e}_{cx}$,

${e}_{cy}$ and

${e}_{area}$) with respect to the previous time step and

${\mathsf{\Theta}}_{x}$ and

${\mathsf{\Theta}}_{y}$ are the normalized angular states of the camera gimbal in the current time step. The state is represented in the Camera (C) frame of reference (see

Figure 3) and

$\mathbb{S}\in $ [−1, 1]. A virtual camera gimbal was added to the multirotor simulation in order to meet real platform specifications.

The reinforcement-learning agent has the ability to perform a continuous action

${a}_{t}\in \mathbb{A}$ at time

t, as represented in Equation (

6).

where

$\theta $ and

$\varphi $ represent the multirotor pitch and roll absolute angles, respectively;

$\dot{z}$ represents multirotor altitude velocity; and

${\dot{\mathsf{\Theta}}}_{x}$ and

${\dot{\mathsf{\Theta}}}_{y}$ represent the camera gimbal angular difference of angles in the x and y axes, respectively. The action space is represented in the Stabilized Multirotor (SM) frame of reference for

$\theta $,

$\varphi $ and

$\dot{z}$ and in the C frame of reference for

${\dot{\mathsf{\Theta}}}_{x}$ and

${\dot{\mathsf{\Theta}}}_{y}$ (

$\mathbb{A}\in $ [−1, 1]). The actions are directly forwarded to the multirotor FC as an input through SDK commands, avoiding postprocessing or filtering.

An important component in reinforcement-learning formulation is the reward function

r, due to the high sensitivity of the current techniques to the reward-function design. Although there are some techniques that are able to deal with low-frequency and sparse-reward functions [

54], most of the techniques require a more behavior-guided design. In this scenario, a well-suited reward function can decrease training times but, conversely, a weak design can introduce human bias in the final policy or even completely prevent the agent from learning a stable policy. In our presented formulation, the reward function was designed in a scheduled trend, by rewarding the main application goal higher, which keeps the target multirotor in the image plane but at the same time, by encouraging safe and smooth movements. Resulting reward function

r is

where

$cx$,

$cy$ and

$area$ represent the center (in x and y axes) and the area of the current RoI, respectively;

${g}_{1}$,

${g}_{2}$,

${g}_{3}$ and

${g}_{4}$ are experimentally defined constants (100, 65, 50, 30, respectively);

${r}_{1}$ and

${r}_{2}$ prevent the agent from exploring states out of a certain volume with respect to the target NC-M, based on image coordinates; and

${r}_{3}$ informs the agent about its instantaneous progress and helps speed up learning [

55].

Furthermore, in the shaping component an explicit distinction was included in the importance of minimizing the relative image position between current and target RoI centers, the absolute position of the camera gimbal, the current and target RoI area ratio and the error velocities (each variable was weighted by a different coefficient ${g}_{1}$, ${g}_{2}$, ${g}_{3}$ and ${g}_{4}$). In this trend, the agent is encouraged to coarsely learn to minimize the position of the current RoI with respect to the target RoI, which retains the target NC-M within the camera Field of View (FOV). Subsequently, the agent is encouraged to optimize its behavior in order to keep the camera gimbal in a centered position (angles close to zero) and to finally keep a certain distance with respect to the target multirotor (RoI area directly related to the distance) as well as to decrease image velocities. Particularly, we found out that the incorporation of ${\mathsf{\Theta}}_{x}$ and ${\mathsf{\Theta}}_{y}$ components, weighted by ${g}_{2}$ coefficient, is a determinant for learning convergence. As stated, this component of the reward function encourages the agent to keep the camera gimbal in a centered position during the execution of the maneuver, diminishing the uncertainty of the solution space and making the desired behavior more explicit, that is, centering the camera gimbal provides the RL-M with more reaction time in case of sudden movements by the NC-M. Finally, the final reward was not normalized. Although ${\mathrm{p}}_{center}\left[t\right]$, ${\mathrm{p}}_{gimbal}\left[t\right]$, ${\mathrm{p}}_{area}\left[t\right]$ and ${\mathrm{p}}_{velocity}\left[t\right]$ by far exceeded the unitary value during training, the final computation of reward ${r}_{3}$, which informs the agent about its instantaneous progress, did not. In the context of this application and reward-function design, final reward ${r}_{3}$ is in the decimal order of magnitude, with values near the unit. Thus, taking this into consideration and the algorithm involved, reward normalization was not required.

#### 4.3.2. System and Network Architecture

A versatile system architecture was designed and implemented, taking the standardization of the virtual and real contexts as the main consideration. In this scenario, most of the component interfaces were shared for both simulated and real-flight experiments. A global overview of the system architecture is depicted in

Figure 3. From the side of the agent, the PPO component, which had the actor–critic network as its motion-policy representation model, was wrapped with an ROS [

56] interface. This design not only increased the similarity between the virtual and real contexts but also reduced the friction of interaction with robotic components, since ROS is the most common middleware in the robotic ecosystem. Conversely, the implemented environment interface is in charge of parsing the raw information (target NC-M RoIs and angular states of the camera gimbal) in order to properly adapt it to the reinforcement-learning formulation (refer to

Section 4.3.1). It is also in charge of interacting through SDK commands with the hardware interface, which can be real or simulated due to standard implementation. The RoI of the target NC-M was generated based on the projection of the multirotor ground-truth 3D points and the intrinsic camera parameters for the simulated environments and based on the multirotor detector, described in

Section 4.1, for real-flight experiments. Hence, in order to send state

${s}_{t}$ to the PPO agent, the environment interface performs all the required processing by relying on current and target RoIs.

The actor–critic neural network is a feed-forward neural network with two hidden layers of 256 units each. The activation function of each unit of a hidden and output layers is a hyperbolic tangent (

tanh). The input- and output-layer dimensions of the actor–critic network are based on the state, action and value dimensions (8 and 5 + 1 units), respectively (see

Figure 3). The activation function of the output layer is bounded to the range of [−1,1] and is provided by a linear unit for the value, to output an estimation of the V-function, used to compute an advantage function [

53]. It has to be remarked that other values of the number of hidden layers and/or hidden units have been tested. Nevertheless, the stated neural-network structure composes the minimum size in terms of hidden layers and units per layer, which allows for learning stability in the conditions described in this work. For instance, network models with 128 units and two hidden layers did not provide proper results.