Towards Realizing Intelligent Coordinated Controllers for Multi-USV Systems Using Abstract Training Environments

: Unmanned autonomous vehicles for various civilian and military applications have become a particularly interesting research area. Despite their many potential applications, a related technological challenge is realizing realistic coordinated autonomous control and decision making in complex and multi-agent environments. Machine learning approaches have been largely employed in simpliﬁed simulations to acquire intelligent control systems in multi-agent settings. However, the complexity of the physical environment, unrealistic assumptions, and lack of abstract physical environments derail the process of transition from simulation to real systems. This work presents a modular framework for automated data acquisition, training, and the evaluation of multiple unmanned surface vehicles controllers that facilitate prior knowledge integration and human-guided learning in a closed-loop. To realize this, we ﬁrst present a digital maritime environment of multiple unmanned surface vehicles that abstracts the real-world dynamics in our application domain. Then, a behavior-driven artiﬁcial immune-inspired fuzzy classiﬁer systems approach that is capable of optimizing agents’ behaviors and action selection in a multi-agent environment is presented. Evaluation scenarios of different combat missions are presented to demonstrate the performance of the system. Simulation results show that the resulting controllers can achieved an average wining rate between 52% and 98% in all test cases, indicating the effectiveness of the proposed approach and its feasibility in realizing adaptive controllers for efﬁcient multiple unmanned systems’ cooperative decision making. We believe that this system can facilitate the simulation, data acquisition, training, and evaluation of practical cooperative unmanned vehicles’ controllers in a closed-loop. benchmarks controls and


Introduction
Intelligent decision-making and the coordinated control of multiple unmanned systems such as unmanned surface vehicles (USVs), unmanned aerial vehicles (UAVs), unmanned ground vehicles (UGVs), and unmanned underwater vehicles (UUVs) have become an intense research area because of their high performance, efficiency and extensive application potentials. With the increasing development and application of unmanned systems, intelligent decision-making and the coordinated behavior of these systems are receiving more attention [1]. For unmanned systems to be able to reach a maximum level of autonomy, they must be able to make decisions under unpredictable situations taking into considerations its performance constraints and other vehicles acting in the environment. However, a prior specification of the optimal and robust coordination is difficult due to the complexity and dynamics of their operating environment and missions. The situations in the operation environment of these systems can change rapidly. Hence, it is imperative for these systems to adapt their decision-making strategy to accommodate changes in the environment. One way to achieve robust coordinated behaviors and dynamism for these systems on their various missions is training and learning using virtual environments dovetail with machine learning algorithms. By integrating machine learning in virtual environments with human-guided learning and guided interventions, multi-agent systems can be trained on mission specific scenarios that closely resemble the physical environment of the unmanned systems. Moreover, data which can be valuable in the training and evaluation of these systems using state-of-the-art methods such as deep-reinforcement learning techniques are scarce, and it is difficult to obtain experimental data in a real environment since training and/or testing can be costly and sometimes dangerous in the physical environments or using the real systems. In order to obtain large chunks of operational data and optimized decision making, abstract (virtual) environments are powerful tools that can facilitate this process by allowing machine learning algorithms to manipulate parameters, store data and visualize results. They can provide cost-effective and risk-free training and testing mechanisms while advancing research and development with humans in the loop. By using abstract environments with machine learning approaches, agents can be built as physical controllers and trained in settings and scenarios that would have been either too costly or too difficult to replicate in the field. Specifically, this study envisions a multi-USV training system for realizing autonomous coordinated control in a variety of maritime applications such as escort missions, search and destroy, maritime patrol missions, etc., as the rapid growth of maritime activities extends multi-USV research for applications in civilian [2][3][4][5] and military services [6][7][8]. USVs are marine vessels capable of performing various marine operations with no crew on-board in a variety of complex and dynamic ocean environments. In comparison with other unmanned systems, the USV offers a significant number of advantages such as resources localization [9], the use of traditional communication capabilities [10], as well as payload and energy capacities [11].
In the literature, several attempts have been made towards the development of multi-USV systems for various maritime missions. Advances in areas such as the use of statistical or machine learning techniques to derive knowledge from data as well as through qualitative logic-based approaches [12] can facilitate intelligent or robust behavior realization of unmanned systems, especially in multi-agent missions. Indeed, recent designs approaches for facilitating the execution of autonomous unmanned systems missions employs simulations for modeling and qualitative logic-based approaches as well as machine learning techniques such reinforcement learning and bio-inspired approaches for control systems design. For instance, intelligent decision-making for multiple unmanned vehicles using genetic fuzzy trees is presented by the authors in [13]. The proposed system, as demonstrated in combat scenarios was capable of obtaining strategies that are robust, aggressive and responsive against opponents. A localization framework for underwater robotic swarms to dynamically fuse multiple position estimates of an autonomous underwater vehicle while using a fuzzy decision support system is presented in [14]. The authors in [15] used deep q-neural networks to obtain combat strategies in an attack-defense pursuit-warfare of multiple unmanned systems in a simplified environment. A heuristic planning approach for guarding a valuable asset by a team of USVs operating in a continuous state-action space is presented in [16]. By evolving planning decision trees, they succeeded in automatically generating decision trees expressing a blocking policy for the USVs. In contrast, our approach allows the actual organization and optimization of behaviors peculiar to the mission. These behaviors are provided by the designer and fined tuned using immune network dynamics and clonal selection. On the other hand, refs. [16,17] employs deep reinforcement learning for path planning and the formation of USVs. In relation to one of the scenarios being considered in this work, ref. [18] employs deep reinforcement learning in order to train a team of escorts to maintain payload safety while navigating alongside the payload. We extend this case to the complex maritime environment with increased complexity in input and output space. While minimizing assumptions and prior knowledge may also result in realizing more robust controllers, the complexity of the learning process is greatly reduced when prior knowledge is utilized in complex multi-agent problems peculiar to multi-USV missions. The authors in [19] present a multi-agent based intelligent training system for USVs where the authors proposed the use of genetic fuzzy trees [13] to realize controllers for the multi-USV system. The fuzzy tree is a tree-based algorithm with branches, where each branch handles a sub-task of the control problem and employs genetic algorithms for optimization. On the contrary, in this work, behaviors and sub-tasks are independently defined while immune dynamics are employed for behavior activation and action selection.
On the other hand, ref. [20] presents an overview and comparative study of free simulation software for mobile robots and concluded that simulators supporting USV operations as compared to other platforms are lacking. This claim was validated in a more recent work in [21] where the authors evaluated several realistic simulators and presented a simulation environment integrated with robotic middleware which models the forces that act on a USV in a disaster scenario. The authors in [22,23] present a simulation strategy and experimental design for developing and testing controllers for UAVs and USVs coordination with the aim of significantly reducing development and delivery times by providing an off-the-shelf simulation environment and a step-by-step implementation guideline. A virtual RobotX simulation capable of approximating the behavior of USVs operating in complex ocean environments is presented in [24]. Moreover, the authors in [25] developed a platform to model and visualize the behavior of marine vehicles in three-dimensional space for surface and subsurface applications.
To contribute to the available literature and provide simulation support for the emerging domain of autonomous combat USVs, this work presents a modular framework for the automated training, simulation and evaluation of multiple unmanned surface vehicles controllers that facilitate prior knowledge integration and human-guided learning through designer-provided behaviors in a closed loop. To realize this, we first present a digital maritime environment of multiple unmanned surface vehicles that abstracts the real-word dynamics in our application domain. Then, a behavior-driven artificial immune-inspired fuzzy classifier systems approach that is capable of optimizing agents' behaviors and action selection in a multi-agent environment is presented. More specifically, we modeled the training systems as artificial immune system with agents as organs and behavior models as immune cell containers of fuzzy learning classifier systems whose classifiers were modeled as B-Cells of the artificial immune system.
Learning classifier system is a machine learning approach that evolves a group of if-then rules by employing evolutionary machine learning to solve practical learning problems that is general enough for a wide range of tasks [26][27][28][29][30]. In fuzzy learning classifier system, which is an extension of the learning classifier system (LCS), classifiers are modeled as fuzzy rules and are applied to realize tactical behavior [31] of robotic systems. The LCS [32,33] concept in general has inspired a multitude of implementations adapted to manage the different problem domains to which it has been applied. On the other hand, various control problems have benefited from fuzzy if-then rules [31,34] with the advantage of easy comprehension as compared to 'blackbox' methods such as deep-reinforcement learning (DRL), dynamic programming and policy functions. Unlike research in DRL methods that focus on the training of networks in various problem settings, a fuzzy LCS framework in general trains classifiers (rules).
On the other hand, the artificial immune system is a typical multi-agent and decentralized information processing system, capable of learning and remembering, which was inspired by the working mechanisms exhibited by the biological immune system [35,36]. The dynamics exhibited by the biological immune system has inspired various theories and models which represent the different aspects proposed under the artificial immune system such as the immune network [37], clonal selection [38], and negative selection [39] and several applications have been demonstrated based on these theories [40].
The immune network theory is a critical theory of the artificial immune system which exhibits characteristics such as learning and memorizing in immune system. The immune network theory proposed by Jerne [37] suggests that the immune system is capable of achieving immunological memory by the presence of a mutually reinforcing network of B-Cells by producing the interaction mechanism between the network cells. The interaction of cells happens regardless of the presence of harmful foreign agents. Jerne's theory stipulates that the antibody of an immune cell's epitope is recognized by a set of different antibodies (paratopes) with various levels of precision. The idiotope of one antibody can be recognized by the paratope of another antibody with or without the presence of an antigen that possesses an epitope (analogous to an idiotope). This recognition and interaction results in a network that is dynamic and leads to stimulation and suppression. The recognized antibody is suppressed while the recognizer antibody is simulated. In the robotics domain, ref. [36] proposed a computational model of Jerne's idiotypic network theory which has been notable as a means of inducing adaptive behavior mediation and has demonstrated some encouraging results. In these idiotypic networks, competence modules (antibodies) are linked not only to environmental stimuli (antigens) but also to each other, which leads to the formation of a dynamic chain of suppression and stimulation that affects their concentration levels globally.
On the contrary, negative selection abstracts an aspect of the immunological mechanism of organisms that deals with self-non-self-classification. This process of negative selection of B-Cells in the biological immune system involves the destruction of B-Cells that react against the 'self' and the promotion of B-Cells that attack only foreign agents. This is the underlying principle of the negative selection algorithms and their modifications [39].
On the other hand, clonal selection provides the immune system the ability to adapt B-Cells to new types of antigens. This adaptation is proportionate to the degree of matching between B-Cells and antigens. Hence, a stronger match causes a B-Cell to be cloned many times compared to a weaker match. Cloned B-Cells undergo mutation from the originals at a rate inversely proportional to the matched strength. This mechanism is the inspiration behind the artificial clonal algorithms [41] and their applications in different tasks and domains [42].
Comparatively, multi-agent coordinated control is similar in characteristics with those in the biological immune system (BIS), in that, there is the need for coordination and the adaptive control of agent's behaviors in a dynamic environment. In our approach, we combined immune-based methods with fuzzy classifier systems to find an appropriate amount of the suppression and simulation of behaviors in the architecture, in addition to learning the internal mechanism of each behavior so that behaviors are adaptive to the agent's environment. The contributions of this work are as follows: • A generic framework for an autonomous unmanned systems training system design that supports operational data collection in a closed-loop was developed; • We present a realistic abstract digital maritime environment for interactive multi-USV systems that can be used for multi-agent reinforcement learning; • A behavior-driven immunized fuzzy classifier system approach for multi-USV coordinated intelligent control and decision-making is presented; • We demonstrate the feasibility of our approach in realizing improved decision-making in multi-USV missions.
The rest of this work is organized as follows. Section 2 presents the background and motivation of this work. Section 3 presents the approach and system architecture of the training system. Section 4 presents the details of the digital maritime environment design and modeling. Section 5 details the training and learning approach and the experimental evaluation scenarios and results are presented in Section 6. Section 7 concludes this work.

Background and Motivation
The increasing number of maritime assets and infrastructures, ocean exploration and the military's need to operate in littoral and asymmetric warfare situations are all factors influencing USV research and development. This work is part of a larger project aimed at designing artificial general intelligence control systems for coordinating unmanned systems in multiple and complex missions. In particular, this paper focuses on tasks that requires multiple USVs to coordinate in the guarding, selecting and intercepting hostile threats, performing intelligent combat maneuvers, surveillance and taking counter measures in the presences of several environmental disturbances. This is necessitated by the widening possible applications and happenings in the maritime domain. For instance, in January 2017, for the first time an unmanned surface vehicle was operated from a distance in a real warfare environment with its full operational capability being demonstrated when a vessel of the Saudi navy was damaged in an attack carried out by the Houthis using an unmanned suicide vehicle [7]. Prior to this, in 2000, a similar event occurred when the USS Cole-guided missile was a target of a terrorist attack during a refuel [8]. This attack was executed by a small fiberglass boat carrying C4 explosives. These two events could have been avoided by taking advantage of autonomous USVs. The main question is how does a USV employed for this mission become informed of the appropriate behaviors or actions to take under the different situations they will encounter on their missions? This process is called training.
Leveraging the potential and practicality of learning classifier systems and the artificial immune algorithms, a hybrid approach for multi-USV control learning is discussed and its applicability to realizing the coordinated control and decision making of a multi-USV system is presented. Ultimately, a modular multi-USV training framework for the automated training, simulation and evaluation of multi-USV systems in a closed loop was realized. Particularly, the training of USV is realized by designing behavior-driven fuzzy classifier systems whose working mechanism in this work was inspired by the biological immune system. Using prior behavior encoded by domain experts, appropriate primitive actions can be learned to realize robust internal behavior mechanisms. This way, we can accelerate the learning while reducing the computational requirements. In order to make the system's results a reference for real-system decision making, there is the need for a physics-based meta-model of the USVs and the environment in order to realize a digital maritime environment that abstracts the real process and constraints while conforming to the physical rules governing the multi-USV operations.

Multi-USVs Training System Design
In this section, we present the design and approach used to realize the training system. We first introduced the schematic framework of the system and then followed it with the component description.

System Architecture
As shown in Figure 1, this approach presents a layered architecture. In the first level from the top is the real or simulated environment of the controlled platforms. The second level, referred to as the USV (platform) abstraction layer, represents the core of the system as it provides the framework for the resources and data management of the vehicles in the environment. The third level is the immunized strategy and decision-making layer which provides the necessary mechanisms for behavior definitions and learning during training. The last level of the architecture is the optimization (behavior learning) layer which provides learning algorithms for the implementation of the appropriate behavior. The third and fourth layers together provide the training and learning mechanism. This is done by the creating, grouping and mapping of behaviors that provides the learning objective of an agent to the individual optimization algorithms. This design approach enables the implementation of different behavior controllers while enabling the implementation of several learning algorithms based on the scenario being modeled. This also enables centralized learning and distributed control implementation. Due to the high computational requirements of both machine learning algorithms and simulations, the modular design adopted allows for a distributed and parallel processing since each part of the system is designed as a separate process or subsystem. Different components interface and exchange data with other processes and subsystems through generic interface of sockets and/or a message passing interface. These design approaches ensure the effective evolution of data services and control requirements of the different unmanned systems. The proposed framework offers the necessary interfaces for collecting data through a data engine and decision-making mechanism that interfaces with agents of the unmanned system for receiving observations and sending commands. During training, the data of relevant variables are fetched and fed to the intelligent agent that interfaces the learning algorithm for individual behaviors. In order to obtain a large chunk of operational data and optimized decision making through data-oriented learning methods, the training system can be used to obtain a database of observed cases, actions taken to respond to those cases and the results of the action through the data engine. The data engine stores the data as a resource tree which can be transformed into common formats such as excel data sheets. The agent layer provides the abstraction and interface for simulated or real USVs. Thus, to achieve a seamless transition between the physical systems and their abstract simulation, this layer provides a data pipeline for the various sensors and actuators of the control platforms through the intermediate resource manager. Using this data pipeline, the important sensor data are extracted through platform-dependent virtual sensors.
In the proposed system, the values extracted from the real/simulated sensor readings from the environment are computed by a predefined set of virtual sensor functions. The world model of the intelligent agent provides an environment, detection, relational and velocity virtual sensor functions to process the sensor data in the context of high-level functionalities. The detection virtual sensors return the relative positions and distances of other objects with a classification function that determines and classifies objects. The relational virtual sensor returns information about how an agent is situated to and from other objects. While the velocity sensor returns, the velocity of the objects remains within the detection range of the model. The environment virtual sensor returns normalized values of the observed ocean wind, currents and waves. Table 1 presents the data models extracted from the environment for agent world modeling. Table 1. Information extracted from the agent environment as contained in a data model.

Data Item Description Examples
Internal state These data consist of information that is internal to the USV Rudder angle, radar range, engine power, weapon type, number of ammunition, etc.
External state This consists of USV external information in relation to the environment coordinates position, orientation, speed, heading, etc.

Observation
These data include the external state information of objects detected by USV sensors distance, relative heading, relative position, dimension, etc.

Weather
This data model holds the environment data contents of wind, currents and waves data wind speed, wind direction, waves height, currents speed, etc.

Simulation info
The information and data pertaining to the abstract environment of the USV simulation mode, simulation time steps, and configuration data Figure 2 shows an abstract data model representing the data concerning the state and observation of controlled USVs and environment. The content data model represents the structure that is used to hold the actual value of the data instance. The properties of the content data model are the type and value fields. This data model consists of the state, observation, and weather information data contents. The observation includes information on detected objects computed from the virtual sensors. The state includes the internal state such as the rudder angle, radar range, etc. and the external state such as the vehicle speed, position and heading. The platform (simulation) information consists of the task-specific data, decision times and other relevant information. The weather information model contains information about the wind, water currents and waves.

Multi-USV Interactive Environment Design
The usefulness of the trained controllers is directly related to the accuracy of the models that are used. As such, significant efforts are required to ensure the models used for vehicles dynamics, sensors, and the environment are realistic. Despite the availability of third party open source resources such as Webots [43], Gazebo [44] and UWSim [45] that can provide the USV physical engine, environment, and sensors modeling, these resources are limited in some instance such as combat vehicles modeling and in some instances specific to one or two types of vehicles. For instances, Gazebo focus on UGVs and UUVs with few works introducing new features designed for USVs. In UWSim, the wind simulation appears not to affect the vehicle movement and its focus is clearly on UUVs. Hence, in this section, we present the details of the digital maritime multi-USV environment design of the training system that provides a straightforward realistic behavior modeling of combat capable USVs.

USV Physical Engine
The USV abstract model presented accounts for wind and wave-induced currents to make it applicable to a wider range of sea conditions. Figure 3 shows USV motions in 6-DOF. Considering the USV movement dimension in 6-DOF of freedom, the basic USV model used is based on Fossen's 6-DOF model for marine vehicles [46]. This model expresses the resulting movement of USV as the combined effect of five main forces as shown in Equation (1) where τ hs f is the hydro-static forces, τ hyd f is the hydrodynamics forces, τ is the control and propulsion forces, while τ wind , τ waves are the wind and wave forces, respectively. The kinematic and kinetic model with wind and wave disturbances is defined in Equation (2). The complete kinematic and kinetic model including the perturbations due marine currents can be found in [47]: where: The USV state model integrated the state of a physical dynamical model to the parameters of the corresponding constructed geometric model of the USVs, given external forces and torques. Currently, the physical properties are only available to describe the ranges and activation of the available sensors of an USV entity including their physical properties. In order to compute the drag and lift forces of the USVs, the coefficients and the USVs' velocity relative to the water velocity are used. To calculate these forces, the apparent velocity of the difference between the model and the ocean current is used together with lift and drag coefficients, as described in [48].

Modeling Buoyancy
To represent the buoyancy effects on the vehicles in the environment, the Archimedes principle as shown in Equation (3) is used to compute the buoyant force: where g is the gravitational acceleration constant, ρ is the known density of the water and v is the volume of the water displaced. Because each USV entity can be approximated with basic geometries, the volume of the displaced water is computed every time step of the simulation using the submerged region of the basic geometries of the vehicle. The buoyancy effects on the USVs consider the height of the water due to waves. This is particularly important for combat USVs since the orientation of the USV can affect the required elevation of a targeting weapon. The buoyancy is therefore assessed by the buoyant force and the gravitational force (computed from the mass of the vehicle). However, to improve the buoyancy effect on the USV, each USV is represent by six links joined together and the gravity and buoyancy forces applied to each link's center.

Actuator and Sensor Modelling
Marine radar is an important environmental perception sensor for USVs. Considering the problems of noise, jamming, and target lost in marine radar images, as well as the high-speed of the USV to the requirement of realistic representation, Radar and Lidar beams are simulated using the ray casting and ray-geometry collision detection approach, which we implemented through the ODE physics engine. The rays cast are affected or can be blocked by objects making detection performance realistic. By comparing the time the ray left the radar/scanner to the time each return is received, the range measurement can be computed. The range distance can be calculated as where c is light speed and t is the difference in time between transmission and receiving a pulse. To simulate the LiDAR, in each simulation step, the range of α degrees is sampled N times to generate different rays using Equation (5): The physics engine reports the collision and returns the range distance of the collided every time a ray hits an object. The detected points are then computed using this information and the current position of the model.
An inertial measurement unit (IMU) and global position system (GPS) for positioning and attitude information were modeled using a Gaussian model to account for sensor noise.
Moreover, to be able to represent combat USVs, a gun component was modeled in our simulated environment. The control parameters of the weapon are the elevation and angle in addition to the fire actuator. In order to produce a more realistic weapon performance, a fired weapon impact velocity at the target location can be estimated taking into consideration the aerodynamic drag of the weapon. For example, the aerodynamic drag force in one dimension of the fired weapon can be estimated in Equation (6): where υ is the speed of the ammunition, A is the defined area of the ammunition, ρ a is the mean density of the air and C dr is the aerodynamic drag coefficient. Based on Equation (6) the impact velocity of a ammunition can be estimated upon which we can estimate the damage to the target hit. The damage of the ammunition is simplified by Equation (7), where d is the distance between the point where a bullet was fired and the hit target, while υ is the impact velocity, and κ and η are the wind and scalar constants. Hence, the ammunition is also simulated as constrained by the physical properties of the weapon type and is affected by the environmental conditions:

Wave, Wind and Current Modeling
To provide realistic waves, we adopted the Gerstner swell wave model which is commonly used in computer graphics [49,50] and simulates trochoides. In order to take into account the combination of many different waves for difference training instance, we generated wave trains for specific a wave spectrum inspired by [51] using randomly generated wave parameters within constraints. Using a different set of parameters, we continuously updated the waves and all trains. This representation was used to affect the physical motion of the USVs at the ocean surface.
To represent the wind disturbance for objects at the sea surface, we used the modeling in [24] to generate the wind direction and wind speed. The forcing influence of wind direction and the wind speed in the reference plane on the model above the water surface was modeled as where u w and v rw are the x and y components of the simulated wind velocity in the vessel body frame, expressed as The resulting wind force can then be computed using: Y wind =c y v rw |v rw | N wind = −2.0c n u rw v rw (14) wherec x ,c y ,c n are the dimensional wind coefficients.
To model the effects of water current, a first-order Gauss-Markov processes is used to homogeneously generate the parameters of currents at different nodes of the environment. The forcing terms are then obtained using the current vector v E c similar to the wind forcing terms. Figure 4 shows the trajectory of the USVs with no disturbance (Figure 4b) and when wind and currents are activated (Figure 4b). In addition, the performance of a USV radar shown at bottom left corner of Figure 4c. In this figure, we only activated the radar of the non-blue boat to demonstrate the detection range and performance. As can be seen in the radar, only two USVs appear on the target radar and the closest USV on the right appears more clearly than the one on the left.

Multi-USV Training Algorithm Design
In the previous sections, we described how the digital environment of the training system is abstracted and simulated by describing how the physical environment and entities are modeled. In this section, we present our approach for learning controllers of USVs to realize multi-USV cooperative decision making.
The approach adopted in this work was inspired by the working mechanism of the biological immune system, which can be regarded as a typical distributed multi-agent system. In this approach, agents are designed as physical controllers of vehicles that owns a set of behaviors models (T-Cells models) encapsulating fuzzy classifier units (FCUs) which output its control decisions (antibodies). Each FCU possesses a detector which matches the attributes (antigens) of the agent environment obtained through its sensors and internal measurement units. The objective is to learn the primitive behaviors of the agents while coordinating the different behaviors of the agent. The key idea in our approach is to generate, activate and assign the behaviors of B-Cells to agents executing a mission so that the entire multi-agent team can learn an optimal control strategy faster. In this case, diverse behaviors that constitute the multi-agent mission serve as motivations for the immune agents. That is, the focus of execution within each behavior model is guided by the underlying objective of that particular behavior with which a local reinforcement program can be defined.
In what follows, we present the detail algorithmic approach towards realizing autonomous multi-USV control decision making in combat tasks.

Agent Knowledge Modelling and Representation
The agents in the training system are equipped with sensor (radar) and weapon systems for detecting and firing threats, as shown in Figure 5. An agent control system only requires important sensors information abstracted from the raw sensor data. The extracted information is then formatted and sent to the decision-making layer. The equipped detection radar of the agents has a circular detection region with a radius r which determines its range. This radar is partitioned into the virtual sensors of the same range by assigning a reach for each virtual sensor. These virtual sensors return the relative positions and the normalized distance of other objects. The relational virtual sensor returns information about how an agent is situated to and from other objects, while the velocity sensor returns the velocity of objects within the detection range of the agent.
In our approach, the agent is analogous to an organ in the body and owns a set of T-Cells which encapsulate a behavior or decision model that generates antibodies in response to the environment stimulation. T-Cells perform fuzzy matching between classifiers and environment attributes. A fuzzy classifier is modeled as a B-Cell which consists of several parts: an ID, condition (attributes of T-Cell receptors), consequent (specific antibody or antibodies) and connection (idiotope) parts. A set of B-Cells constitute a strategy. T-Cell models activate B-Cells which together results in an immune network of B-Cells. In the BIS, T-Cell receptors can only recognize antigens that are bound to certain receptor molecules, and undergo a process called rearrangement, causing the recombination of a gene that expresses T-Cell receptors. The process of rearrangement allows for a lot of binding diversity. This process is likened to the generation of the rule base of the FCU inputs. During training, T-Cells activated B-Cells which are optimized to obtain a good performance strategy. Figure 6 shows an abstract immune network of the T-Cell interaction of an agent. Each T-Cell is associated with a decision-making model within the overall mission and is identified by a unique identifier, receptors, specific antibodies and local antibody network which forms the knowledge base of a behavior. The inputs and outputs are defined as fuzzy sets with different degrees of membership and represented as binary strings similar to the representation used in [52]. The detector receives the real values of input variables (antigens) and transforms each of these values into a group of fuzzy sets. In this study, we adopted a triangular membership functions type for defining a range of values for each variable of a behavior model. Figure 7a,b presents the example heading and range inputs membership functions and their encoded strings positions or indexes, which are determined by the designer. For instance, a real-value of 100 for heading can be represented as a 11000 binary string. The possible outputs of a T-Cell represent the constituents of antibodies that can be activated by it, while the inputs define the set of pathogens that a B-Cell reacts to. Here, we loosely model the T-Cell as a container of B-Cells and its responsibility is to activate the B-Cells generated from the recombination of genes (attributes of agent tasks) that express T-Cell receptors.

Decision-Making and Evaluation Mechanism
Behavior units interact with the environment through the agent, its detectors and effectors, whilst the actions of classifiers (B-Cells) are evaluated through a reinforcement program that captures the motivation or objective of the behavior. Classifiers are strengthened or weakened based on the performance of the behavior unit modeled by the T-Cells. During the execution, behaviors are activated based on the stimulation from the environment and a match set of classifiers which determines the stimulation level of the behavior to the agent's observation is formed for each behavior in parallel. Each matched classifier is collected when the agent is exposed to the environment state and its parent behavior is activated to form a network of B-Cells. To determine the affinity between the behavior unit and the environment's antigens, we used the average affinity of the activated classifiers by the behavior unit. The final actions of the environment is emitted from the classifiers with a higher concentration after interacting with other classifiers. Classifiers that contributed to the overall fitness obtained during the evaluation of a behavior unit receive a relative reward as its contribution to how well the agent performs the behavior based on the reinforcement program of the behavior. This means that individuals are evaluated in parallel and evolved according to its experience in the environment.
The algorithmic approach for the agent's decision making of a multi-USV system consists of the following steps, which are executed at each sample time by each agent:

1.
At each decision step, the agent activates behaviors according to the environmental stimulation; 2.
For every activated behavior, we generate a set of strategies or determine the match set of strategies (B-Cells). When multiple tasks are detected within the behavior, a match set is formed for each task. For example, when a track behavior is activated after USV detects multiple targets, the behavior model generates a strategy for each target using the attributes of each task (target); 3.
Then, the agent establishes connections between B-Cells based on the selected task. Connections between B-Cells are established based on the tasks and behavior under which they are generated or activated.

4.
To coordinate with nearby agents, the same is done by the agent with respect to nearby agents and the detected targets. Agents can also communicate with nearby agents within a communication range to obtain the strategy concentration for shared behaviors' activation to select the best strategy with respect to a task and submit to a behavior learner; 5.
Apply immune network dynamics to update the concentration of each B-Cell; 6.
The final actions of the environment are emitted from the classifiers (B-Cells) with a higher concentration after interacting with other classifiers.
In order to evaluate an agent's performance of a behavior, classifiers within behavior units are evaluated based on the local objective of the behavior and global reward using Equation (15): where R g is the team reward return by the environment as computed by the global reinforcement program. R l is the local reward of a particular behavior, in other words, the estimate of how well an agent performs a particular behavior which is evaluated by the local behavior objective function or human-provided reinforcement during training. The variables α 1 and α are discount factors used to discount the local and global objective functions. The local reinforcement program (evaluation functions) allows for reward shaping, the independent learning of behaviors by agents, and facilitates the sharing of learned policies between team members while the global reinforcement program implements the global objective function that evaluates the team performance on a task or the overall mission performance by the agents. The global reinforcement program in the evaluating cases are designed to capture the overall mission objective using the cumulative rewards received from the environment after an episode ends. In this case, a score was assigned as the strength of each behavior's B-Cells that were triggered during an episode run as obtained from reinforcement programs. Based on the immune system mechanisms, stimulation and co-stimulation occur among B-Cells. In our case, B-Cell i is said to stimulate B-Cell j if the strength or fitness of j is higher than that of i. This implies that the control action proposed by j leads to a better performance than that of i. At the lower level, when a B-Cell is activated during a behavior execution, this will lead to the stimulation of B-Cells under same conditions but with a different output in the global immune network when the output of the B-Cell leads to a better performance and vice versa. A relative affinity between a B-Cell and environment pathogens can be measured using a modified version of Equation (15) that takes into account the number of times the B-Cell was activated, as shown in Equation (16): With the knowledge of B-Cell's affinity and how to obtain the affinity between B-Cells, the concentration (fitness) C a of the a th B-Cell can be realized using Equation (17) [53]: where: • N is the number of B-Cells that have an inhibitory or stimulating effect on the B-Cell; • m a is the affinity between B-Cell a and current stimuli (antigens); • m ja is the mutual stimulus coefficient of antibody j into B-Cell a; The concentration level of a B-Cell in this case affects the chances of a particular action that the classifier can propose as the optimal action that maps the conditions of the classifier. This means that the objective of the optimization algorithms is to learn the optimal action of a classifier map to a compact number of classifiers necessary to realize an optimal and adaptive execution of the independent behaviors.

Evolution Mechanism: Clonal and Negative Selection
In classifier systems, genetics-based learning approaches are employed by using genetic operators to realize solutions for different problems and exists in two forms, i.e., Michigan-style [54] and Pittsburgh style [55] approaches. In the Michigan-style approach, an individual is a single rule, while in the Pittsburgh approach, one individual is a set of rules. In this work, the Pittsburgh approach is used to represent the antibody set as the possible solution or optimal actions mapping for individual classifiers of a behavior. The clonal selection and negative selection mechanism enables the evolution and learning of our optimal internal behavior execution while the immune network dynamics allows the coordination between and within behaviors for effective realization of controllers. In what follows, we present the process for evolving classifiers.

1.
At the beginning of training, T-Cell receptors must undergo the rearrangement process for the recombination of genes that express T-Cell receptors to form the knowledge base of all encoded behaviors; 2.
Initialize a non-self database to empty or using prior knowledge where the designer encodes inconsistent antibody (control action) mapping as the antibody set for individual behaviors (T-Cells); 3.
Randomly initialize an N population of the antibody set for each T-Cell by assigning an antibody from the valid antibodies of the respective T-Cell to each B-Cell to form the initial controllers; 4.
Compare the current generated antibodies with those in the non-self database to remove/modify inconsistent antibody-sets; 5.
Next, the simulator is run with the current generation of antibodies for each behavior N times to test each set in the population. In each run, we apply the decision-making mechanism described above to select the classifiers whose actions are posted to the environment; 6.
At the end of each episode, we apply the evaluation mechanism to evaluate the antibody set of each behavior; 7.
At the end of each generation of the population of antibodies, the concentration level of classifiers (B-Cells) based on the performance of individual behaviors' entire antibody set is used to determine the n best antibody sets; 8.
Clone and store the classifiers of the elite antibody sets that were triggered and recompose the global cure database with these classifiers. Submit the population of clones to a hyper-mutation scheme by randomly selecting and changing the antibodies of classifiers to form temporary antibody sets to be evaluated next; 9.
Add the set of antibodies that results in the poor performance of the behavior to the non-self database. After the B-Cell undergoes mutation by changing the action (antibody) parts of the B-Cells, the resulted antibody set is compared with those in the non-self database and modified if the similarities between them is below a predefined threshold; 10. On the other hand, elite B-Cells resulting from other agents are cloned by other agents when other agents succeed in finding more optimal actions for a behavior execution; 11. Repeat Steps 5 to 10 until a termination condition is met

Experiments
In this section, we demonstrate the performance of the training system and approach in the context of training multi-USV systems in two combat tasks. The cases presented are real-world scenarios designed to test the performance generalization and learning speed of the system and approach. Experiments are setup and the USVs trained in the virtual environment and the obtained controllers are evaluated. Moreover, the trained controllers are evaluated in different configurations of the underlying scenarios in all cases. Two training scenarios' designs were based on different missions of unmanned surface vehicles. The cooperative target and escort task saw the island conquered in the realistic environment.

General System Setup
The USVs behavior configuration consists of several high-level behaviors as T-Cells shown in Table 2. Depending on the mission or tasked to be learned, behaviors peculiar to the task are activated and the execution of these behaviors are tuned and optimized to maximize USV performance on the mission in different situations. Each behavior has receptors that define a minimum of one antigen it responds to, and as specified by the designer based on the knowledge of the tasks and behavior. Table 3 shows the repository of antigens used during experiment. Table 4 lists our primitive control actions (antibodies of classifiers) that can be suggested by a classifier (B-Cell). The primitives of steer control and throttle control directly translate into fuzzy sets. For weapon control, the fuzzy set of the aim angle include aimLeft, aimCenter and aimRight. Radar control involves turning on/off of the radar system of the USV. In each experiment, the number of antibody sets in a population is set to 40 and a simulation episode lasts for 3 min. The detection and firing range of both teams are set to 300 and 80 m, respectively, with a maximum speed of 25 m/s during training, while the maximum turning angle of the USV is 30. The wind and current speed are set between [0-10] and [0-8] meters per second with a variable direction across the sea surface. The different configurations used during testing are shown in Table 5.  Table 3. Attributes of the tasks and environment for the different behavior models.

Case 1: Multi-USV Target Escort
This scenario is motivated by the practical domain of protecting oil tankers or cargo ships, and protecting forces and maritime warfare in general. These assets of tremendous economic value may be the target of terrorist organizations or pirates using small boats parked with remotely or manually controlled explosives or as in the case of pirates boarding small boats to carry out attacks. In this case, small and fast USVs with combat capabilities can provide protection in such situations, especially when multiple hostile threats are involved.  In the training setup, six USVs (blue team) are required to protect a dynamic target from the hostile boats (red team). Figure 8 shows a typical setup of the scenario during training. The blue team must coordinate their actions so that they result in creating a safety fence around the target while intercepting and neutralizing any incoming threat.
The goal of the team is to ensure the safety of the target and successfully destroy any detected threat (red USVs) by engaging in combat. In this scenario, a successful cooperative control enables the blue team to balance their resources between several main tactical behaviors. A dynamic redeployment or formation may maximize the coverage area and an interception and combat tactics may destroy the detected threats. Based on this criterion, the global reinforcement program implements the following objective function: where β is the damage caused to the opponent team, γ is the time taken for the target to be destroyed-normalized between 0 and 1-α the damage caused by opponents to the team members and ρ is the total resource utilization of the team. On the other hand, the staying power of a USV is the health of the USV which deteriorates as the amount of damage caused by its opponent's fired weapon increases. In addition to the global objective function, the local reinforcement program implements the local objective functions based on the objective of a defined behavior. The design of the opposing (red) team control policy is crucial for providing realistic challenge blue team. In this regard, the red team control policy is designed to be competitive in the sense that their action selection within executing behaviors is not fixed to similar states, hence no predefined pattern can be easily deduced by the blue force. Table 6 shows an example encode rules used for the red team behavior selection during simulation. (a) Simulation setup in 3D environment (b) Simulation setup in 2D environment Figure 9. Island conquering simulation setup in 3D settings (left) and the simplified 2D environment (right).

Case 2: Cooperative Islands Conquering
In this scenario, multiple USVs competing for conquering more islands while engaging in combat is simulated as described in [19]. We implement this scenario in our virtual environment while using the environment in [19] for evaluation. Figure 9 shows simulation setup for this scenario in two different environments. The environment consisted of N islands and two teams of unmanned surface vehicles (boats). Each team had information on the location and number of islands and their states, whether conquered or unconquered by the team. An island is said to be conquered by a team if a member of the team moves to the coordinate of the island and stays there for that time period and no opponent boats move to that particular island conquered by the team. If two opponent boats occupy an island at the same time, the island is not awarded to any team for the elapsed time steps. The red force uses fixed rules while the blue force is trained with our approach, which implements or modifies the additional behaviors described in [19]. We compared the performance of these policies and the trained controllers in both environments and settings. In this case, the global objective function as defined in [19] is as follows: where β is the damage caused to opponent team, γ is the time taken for the target to be destroyed, and α the damage caused by opponents to the team.

Results and Discussion
In the first case, the blue team is declared the winner of the confrontation if the target is not destroyed within the episode run. After a number of training episodes, the obtained results show a better performance corresponding to a higher score and an average score for our agents as the training progresses. This can be seen in the results of Figure 10. Figure 10a shows the score obtained by the team based on the global objective function while Figure 11b shows the average score per generation. To evaluate the trained controllers, the results of different configuration of the case is presented in a simplified 2D environment and the virtual environment in this work. The results in both settings are being compared with a baseline heuristic policy. In the heuristic strategy, the USVs are uniformly distributed around the protected target(s) and moves according to a preplanned way points. When the intruder boats appear, the blue USVs intercept the closest by simultaneously moving closer to the intruder with a fixed speed and fixed turning angles. The intruder is automatically fired at when it is within the firing range of the blue USVs. Alternatively, selecting the intruder for interception is done randomly and one on one, i.e., only one blue USV intercepts an intruder or two blue USVs randomly intercept a selected intruder two to one, where the USVs employ similar rules defined in Table 6. Figure 11a,b presents the evaluation results as compared to the baseline heuristics in both the virtual environment and simplified environment with a varied team configuration to demonstrate the transferable and scalable nature of the learned controllers. While the 2D environment assumes calm sea conditions in all settings, different sea conditions are specified in the 3D environment to test the performance in both settings. The results show that in two of the environment settings, the blue team achieved the highest success rate in setup B6vsR3 when all teams had the same configuration. There is no significant drop in performance in B10vsR7 despite the constraints in the capability of the blue team. This may be as a result of the team's size and speed advantage of the blue team. However, there is further drop in performance in both environments with the equal team size and when the red team has an advantage over the team size. A significant drop in performance was witnessed in B7vsR10 setup. This can be attributed to the impact of large environmental disturbances caused by the wind and currents in addition to the team size advantage of the red team. The impact of wind and current can be seen in all settings as the performance in the 2D environment (with no environmental disturbances modeled) is better in all scenarios. Moreover, in all setups of the escort task, the trained controllers outperformed the heuristic-based fixed strategies with a win rate between 10% and 30%.
(a) Learning agents score per episode (b) Average score over 50 episodes of simulation run Figure 10. The learning curve of agents based on the proposed approach during training. On the other hand, Figure 12 shows the episodic score and average score over 50 episodes (generation) of simulation in the case 2 scenario. The average score per generation of antibodies and score per episode during training increases as training progresses as shown in Figure 12a,b, respectively. To evaluate the performance of the output controllers after training, the different configuration of the scenario is also run on both the virtual environment and the environment developed in [19] without any major changes to the output controllers and compared with a heuristic baseline approach which comes in two forms in terms of selecting a task to perform.
CCI: Under this strategy, each member of the blue force selects and conquers the closest island. When an opponent appears, it also selects the closest to attack. The intruder is automatically fired at when it is within the firing range of the blue USVs. Fixed rules for behaviors such as detour, track and retreat can be employed by the USV during combat engagement, similar to one shown in Table 6.
CIU: This is also known as conquer in units, a strategy which works by grouping team members to perform the conquering in units. In this case, the groups can select the islands to conquer randomly. However, when multiple opponents are detected, the units maybe dissolved during combat.
(a) Score per episodes obtained by agents (b) Average score over 50 episodes of simulation run Figure 12. The learning curve of agents based on the proposed approach during training.
The average wining rate, defined by the number of islands, the losses of the team and the damages to the opponent faction, is shown in Figure 13a,b, respectively. A win here means that the blue team scores higher than the red team based on the objective function. In this scenario, interestingly, the trained controllers obtained its maximum performance in the Scene 4 setup despite the more significant environment disturbances compared with Scenes 1 and 2. Moreover, as can be observed, the highest performance obtained in this scenario was about 4% less that of the one obtained in the escort task. However, it appears to have performed very well when evaluated in the 2D settings. This is also evident in that the trained controllers outperformed both fixed strategies in both environments. To develop realistic coordinated control strategies, multi-USV systems need an adaptive navigation strategy to face unpredictable environmental forces such as waves, wind, and water currents. A starting step toward this goal is to have a digital environment with re-alistic modeling where designers can assess their control strategies under different degrees of environmental disturbances. While some physics-based simulators exist for other types of unmanned vehicles, very few exist in the literature for simulating multi-USV missions. Moreover, to the best of our knowledge, the few available simulators do not provide the data storage engine that can facilitate scenario recreations during and after simulation. The digital maritime environment developed in this work provides this functionality.
On the other hand, learning in the digital environment can be time consuming as the complexity of the environment increases. However, our proposed learning algorithms were able to learn a stable control policy that achieved the coordination requirement of the multi-USV systems after a few generations and the evolution of classifiers as can be seen in the learning performance graphs. This, we believe, is facilitated by the behavior-driven approach as the objective within each behavior can easily be formulated. By modeling the physical environment and constraints of USVs in a digital environment, the obtained control systems could be implemented in real systems, as the controllers were able to perform credibly in settings they were not trained on. The above results show the robustness and scalability of the controllers realized by our training system and approach.

Conclusions and Future Work
This work presents a training simulation platform that includes a realistic simulation environment. We present a generic framework that can be used for training system designs that are flexible enough and easily scalable. An abstraction of the physical platforms and environment dovetail immunized behavior-driven decision-making and behavior learning are the key contributions of this work. Our approach to implementation enables different learning approaches and algorithms to be developed as learning or training algorithms and also allows the various components of the physical simulator to scale at the local level without major changes to other components.
A unified approach to USVs training presents a modular and simple development process for intelligent controllers. The purpose of all the experiments was to obtain realistic controllers that are adaptable and understandable to meet the requirement in real USV and the physical environment. As can be seen in the performance graphs provided, the training system is able to discover policies that improve the performance of the USVs on their tasks. By continually updating the non-self database, agents do not waste time in evaluating antibodies that are already evaluated by others, hence speeding up the training process to realize a stable policy with few simulations. Simulation results show that the resulting controllers can achieve an average wining rate between 52% and 97.6% in all test cases, indicating the effectiveness of the proposed approach and its feasibility in realizing adaptive controllers for efficient cooperative decision-making among multiple unmanned systems. The results further point to the importance of the model's environmental disturbances since the performance of the controllers increases when transferred to calm environments. Even though the controllers' performance decreases with the different teams' configuration, about 20% of this performance degradation can be attributed to the environmental disturbances at sea.
Finally, using the proposed system, real or simulated sensor data can be obtained and processed based on the model definition. This allows for straightforward switching between real and simulated sensors and actuators. We also introduce, as part of our contribution, more realistic multi-USV learning tasks that can be used for evaluating cooperative control strategies and for multi-agent reinforcement learning research advancement. Subsequently we intend to improve parts of the system by mathematical models and real equipment will be used in a human-in-the-loop configuration and to acquire real operational data from field tests to be used by the trainer to improve the tactical behaviors of unmanned systems. In the future, more experiments will be conducted by implementing different learning algorithms on the training system. A field test will be carried out to ascertain the performance of these controllers in the real world. We will also design and improve the training algorithm and develop more real-world scenarios that can be used as benchmarks for training and designing controls systems for multi-USV systems.
Author Contributions: Conceptualization, supervision and funding acquisition, Y.X.; methodology, software, writing-original draft preparation, S.N.; software and validation, writing-review and editing, K.P.; software and validation, W.S.; validation, experiments and editing, R.L. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
There is no conflict of interest.