Continuous Autonomous Ship Learning Framework for Human Policies on Simulation

Kim, Junoh; Park, Jisun; Cho, Kyungeun

doi:10.3390/app12031631

Open AccessArticle

Continuous Autonomous Ship Learning Framework for Human Policies on Simulation

by

Junoh Kim

,

Jisun Park

and

Kyungeun Cho

^*

Department of Multimedia Engineering, Dongguk University-Seoul, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(3), 1631; https://doi.org/10.3390/app12031631

Submission received: 18 November 2021 / Revised: 3 January 2022 / Accepted: 1 February 2022 / Published: 4 February 2022

(This article belongs to the Special Issue Robotic Sailing and Support Technologies)

Download

Browse Figures

Versions Notes

Abstract

:

Considering autonomous navigation in busy marine traffic environments (including harbors and coasts), major study issues to be solved for autonomous ships are avoidance of static and dynamic obstacles, surface vehicle control in consideration of the environment, and compliance with human-defined navigation rules. The reinforcement learning (RL) algorithm, which demonstrates high potential in autonomous cars, has been presented as an alternative to mathematical algorithms and has advanced in studies on autonomous ships. However, the RL algorithm, through interactions with the environment, receives relatively fewer data from the marine environment. Moreover, the open marine environment causes difficulties for autonomous ships in learning human-defined navigation rules because of excessive degrees of freedom. This study proposes a sustainable, intelligent learning framework for autonomous ships (ILFAS), which helps solve these difficulties and learns navigation rules specified by human beings through neighboring ships. The application of case-based RL enables the participation of humans in the RL learning process through neighboring ships and the learning of human-defined rules. Cases built as curriculums can achieve high learning effects with fewer data along with the RL of layered autonomous ships. The experiment aims at autonomous navigation from a harbor, where marine traffic occurs on a neighboring coast. The learning results using ILFAS and those in an environment where random marine traffic occurs are compared. Based on the experiment, the learning time was reduced by a tenth. Moreover, the success rate of arrival at a destination was higher with fewer controls than the random method in the new marine traffic scenario. ILFAS can continuously respond to advances in ship manufacturing technology and changes in the marine environment.

Keywords:

autonomous ship; case-based curriculum learning; learning framework

1. Introduction

1.1. Background

Artificial intelligence for autonomous ships (unmanned surface vehicles (USVs) and autonomous surface vehicles (ASVs)) has been extensively investigated in the private sector (including bathymetric measurements, subsea pipelines management, marine geography surveys, and safety management) and the military sector (including patrol, security, intrusion detection, blockage, and defense). Artificial intelligence has proven its benefits in solving control issues related to the operation of ships under an unstable natural environment, reducing risk, and enhancing efficiency under collaborative or competitive conditions with other ships in the real ocean [1,2,3,4,5,6].

Operating autonomous ships is more difficult than self-driving cars because it is difficult to obtain explicit decision-making data from the environment. Self-driving cars learn the driving rules from definite environmental data, such as lanes and traffic lights. Dynamic obstacles around self-driving cars are considered when making decisions regarding the actions to be taken. However, in the case of an autonomous ship, it is difficult to obtain data regarding rules from the environment, and the rules must therefore be learned from the vessels sailing around them. Therefore, a learning environment that considers the ships operated by humans is required, and not only ship control but also human rule learning is required [7,8,9].

In board games such as chess, human-level odds of winning have been achieved under the rules set by humans, and real-time video games such as Atari have achieved high scores through human-level control [10,11]. In Starcraft (a real-time strategy game), multiple agents can be controlled at the same time to outperform human players [12], while in Minecraft (a role-playing video game), tiering problems are solved based on decision-making. This hints at the possibility of solving problems similar to a human [13]. In this way, the reinforcement learning algorithm can establish not only the rules of the game but also the controls, strategies, and rules to solve problems like humans [12,13]. However, there are several factors, such as the natural environment and harbor entry/departure rules that have been defined for each harbor in the self-driving vessel’s learning, and this learning must be learned through surrounding vessels. This makes it difficult to establish an environment capable of learning human-level rules through learning [14].

1.2. Challenges

In this study, an autonomous vessel moving to its destination was considered, taking into account the topographical features of the marine environment; the vessel follows the navigation rules in a human-like manner. Each harbor has a different marine traffic environment, which is defined by humans, and there are rules with regard to other ships the vessel encounters. These rules are not provided as signs or traffic lights in marine environments or ports. The typical method is to observe the behavior of other ships, learn the rules, set sail with nearby ships, or wait until all the ships have arrived [15]. However, this process is time-consuming, and the simulation is expensive to implement if all these environments have been considered. In addition, it is difficult to build an environment by considering all the cases [16,17,18]. Furthermore, previous learning experiences can be forgotten depending on the order of learning (catastrophic forgetting), and new environments can be learned [19,20]. The following is a summary of the issues to be considered relating to the learning environment, method, and the structure of the RL algorithm to learn for the application of RL algorithms to autonomous ships.

The first is the data of the RL algorithm, which is insufficient to learn the rules defined by humans and make decisions. The sailing environment to the destination considers the topographical features and the natural environment. Surrounding vessels, which have rules embedded in them, should respond to an autonomous learning ship. These vessels present a variety of vessel types and behaviors.

Second, learning a human rule requires a systematic learning method. Random learning can provide a wide range of experiences by providing different states, but the rules are hard to learn. Therefore, the learning environment must have rules and should be consistent.

The last issue is the complexity of decision-making for autonomous ships to operate. Autonomous ships shall plan a destination at the human level, decide the direction and speed by considering obstacle avoidance, sliding of the water surface, and control of the postures depending on the features of the vehicle itself to achieve the given missions. The more data provided, the larger the state space becomes. The decision-making space is also expanded depending on the number of actions. Substantial learning time may be required, or obscure results may be acquired because the learning results are not collected.

To solve these problems, previous studies have presented an environment that includes various partner ships for ship avoidance and route planning [21,22]. Simulation with other ships (including strategies) can make autonomous ships learn human-defined navigation rules in the environment. Through this research, we expect to not only learn a solution to solve the aforementioned problems but also learn human-defined navigation rules at the human level [23,24,25]. Autonomous cars and robots have developed separate training platforms for continuously acquiring data and training in a given environment to solve learning environment problems, or the implemented simulation using OpenAI’s GYM based on open sources provides a relative vehicle environment [26,27,28].

1.3. Approaches

This study solves three issues for autonomous ships to learn by implementing a sustainable intelligent learning framework.

First, this study proposes an approach to learn human-defined navigation rules by defining the operation cases of neighboring ships based on Case-Based RL. Various data are generated through autonomy compliance with human-defined navigation rules considering the neighboring ships.

Second, curriculum learning methods are applied to prevent catastrophic forgetting, from losing previous learning experience, and to reduce the learning time through systematic learning. The learning process is established for the autonomous ship to learn complicated problems with varying degrees of difficulty. One such method, Self-Play, induces sufficient learning experience with fewer data, making autonomous and neighboring ships with diverse purposes and compensations to learn competitively.

Third, the decision fields are divided and stratified to solve the complex problems involving the autonomous ship, and the algorithm structure is generalized. The layers are classified into the mission layer, sailing layer, and control layer. The layers are also further interconnected by target values, state data, assessment, and compensation per layer for auto-supervised learning. More definite learning results can be acquired per layer by stratifying the state-action space depending on the decision-making space.

1.4. Contributions

The training framework proposed in this study is different from those proposed in previous studies.

(1) The learning environment is intellectualized by applying case-based RL, which can include human-defined navigation rules for environments. This intellectual learning environment can solve the problem of insufficient environmental data in the marine environment using neighboring intellectual ships with rules and has expandability applicable to various ranges of simulators.

(2) The problems that the autonomous ship needs to solve are clarified, and the learning time can be saved because the global and local curricula are presented depending on the degree of difficulty. The learning results are enhanced with fewer data owing to self-play curriculum learning. Self-play curriculum learning facilitates a quick response to changes in marine traffic policies in a harbor and changes in the marine environment.

(3) The decision-making spaces of autonomous ships are simplified and clarified through hierarchical classification. The target value, state data, assessment, and compensation standards are automatically established for the supervised interlayer learning structure.

The stratified RL algorithm and intellectual learning framework can consistently cope with the changes in the marine space, where the RL algorithm is examined and operated. The change in ship control approaches through the development of shipbuilding technologies.

2. Related Works

2.1. Learning Method Based on Curriculum

The approach of implementing data in real simulations such as game environments for solving outstanding problems has been widely adopted in studies on RL algorithms [16,17,18]. However, this approach is employed in studies on RL because its environment has to be implemented by experts to enable the learning of certain rules. Therefore, it requires a large amount of quality learning data and an extensive amount of cost, time, and labor [29,30,31]. Because it is difficult to provide a perfect environment, such as games, some have used a random scenario environment as the general learning method of the RL algorithm [32,33]. A random scenario means a scenario is selected randomly among data that are obtained from the real world. However, a weak inductive bias is used for the stability of learning when autonomous ships significantly depend on dynamic obstacles. When environmental data change without considering the movement of autonomous ships, learning is not properly performed. To obtain the desired learning results in the learning process of the RL algorithm, the problems are classified based on the degree of difficulty. Environmental data gradually become complicated and are consistently provided. Accordingly, an autonomous ship can solve difficult issues in reality [19,34]. Considering the advancement of RL algorithms, it is necessary to solve more complicated issues, and curriculum learning is becoming a general learning approach.

Regarding the learning approaches using the curriculum learning environment, the task-specific, self-play, and teach-student approaches can be implemented without any modification of the RL algorithm [20]. The task-specific approach is the most common curriculum learning that configures the curriculum with problems per the degree of difficulty. It is the most common curriculum learning that adjusts the degree of difficulty depending on the learning and trains the RL algorithm. A task-specific approach was proposed to solve the problems that autonomous ship learning could not properly adapt to in a new environment owing to the overfitting of the RL algorithm in a fixed environment. This approach achieves good performance with fewer data by controlling the degree of difficulty of the problems [35,36]. Global space learning is performed using the curriculum at a low degree of difficulty. Hence, autonomous ships go to a destination without dynamic obstacles in the learning environment. Overfitting, which may occur in the learning process, is prevented by providing a curriculum with busy marine traffic based on the degree of difficulty. Moreover, the curriculum classified into local events can solve the catastrophic forgetting issue, losing previous learning experiences.

Neighboring ships are provided in real-time during the learning of the autonomous ship and operated in relation to it. Self-play learning in curriculum learning approaches makes agents compete with one another in the same environment to learn from one another with different goals and compensations. Although self-play learning uses the same RL algorithm, it induces competitive learning because different agents detect and learn the strategies of other agents. Continuous learning through the self-play approach allows agents to fully learn environments with rich experiences through actions acquired from interactions with other agents. Furthermore, the actions of agents can be developed into human-related skills, such as intrinsic motivation, through interactive learning among agents [20,30,37]. However, the competitive environment in self-play learning needs to include strategies to facilitate learning in the curriculum. Strategies should include human-defined navigation rules. To maintain a rich learning environment and include human-defined navigation rules, an approach to engage in rules is required. Case-based learning or separate scripts have been used in some cases to define experts’ knowledge related to the rules. Moreover, accumulated experience has been extracted from the demonstration by experts and applied to the planning and implementation of games [38,39,40]. When case-based learning is applied to the RL algorithm of neighboring ships in the ocean, the self-play curriculum learning method can be applied simultaneously. Neighboring ships show intelligent navigation because only the cases with minimum goals and actions are presented to enhance their autonomy. The environment includes human-defined navigation rules that autonomous ships should learn.

The RL algorithm of the autonomous ship stratified in the learning environment can consider autonomous and supervised learning among layers using the teach-student approach, which is a curriculum learning approach. The teach-student approach automatically presents the difficulties of problems depending on the learning by the RL algorithm. It adjusts the degrees of difficulty based on the loss rate of the RL algorithm and assesses the divergence of the learned weight. As described above, the teach-student approach induces transfer learning based on previous experience and degrees of difficulties. Furthermore, it engages in the RL algorithm for learning through bootstrapping [20,41]. The stratified algorithm transfers the standard target values between layers, assesses through external compensation depending on the degree of achievement, and induces learning. In this approach, a layer can automatically learn from the learned layer through supervised learning.

This study configured the learning environment by expressing ships, including navigation rules, as cases, and implementing RL in real-time. The learning environment comprises curriculum learning, including the self-play approach, to induce transfer learning by gradually solving the problems from a lower degree of difficulty to a higher degree of difficulty, thereby fully learning by the autonomous ship itself.

2.2. RL for a Hierarchical Autonomous Ship Task

Autonomous ships performing missions in the ocean instead of human beings have been investigated in various fields. The control and route search of a ship began with the navigation of a ship to a place where human beings wanted to go by controlling the direction and power based on the natural environment, including wind, waves, and tides. Ship navigation was approached using the physics and control algorithm based on a mathematical algorithm [42,43,44]. The A* algorithm was also adopted to plan the route to the waypoint, given destination in the given space, clearance of noise collected through sensors, interference and slipping of a ship in the ever-changing natural environment, and real-time adaptive ship control using Gaussian process regression [45,46,47]. RL has emerged as an important technology by replacing complicated mathematical algorithms and solving problems. It has been demonstrated to have high potential for applications in ship movement control, posture control against the natural environment, collision avoidance, and path finding [8,9].

Studies on autonomous ships using the RL algorithm in the control field were applied to obstacle avoidance, decision on access to targets, speed control, and posture control through value-based Q-learning [48]. The algorithm was also investigated in the ship control field by applying a policy-based deep deterministic policy gradient (DDPG) [49]. Moreover, studies have been conducted on learning the posture control model of a ship in real-time by configuring the given route based on the DDPG and tracking the configured route [21]. The studies above dealt with the movement issue of a ship in the local space to consider the global space or surrounding conditions depending on the necessity, focusing on the ship’s posture. Accordingly, the local space in the ship control field can be defined by the natural environment, which affects the ship control and present state of a ship. Therefore, the supervision space and obstacle avoidance that must consider the given global space can be separated from the local space.

Furthermore, studies on autonomous ships using the RL algorithm in the navigation field were applied to Q-learning-based smart ships, path finding, and ship control. The environments (including the distance to a destination and punishment on obstacles) and restricted areas were implemented based on the Nomoto model in a narrow waterway. Accordingly, autonomous ships learned human-defined rules in those environments [25]. The RL algorithm was also applied to decision-making on autonomous navigation under partial observation considering a real environment [50]. The multi-agent algorithm was applied to maintain the formation during navigation and path finding, in which several autonomous ships sailed [51]. Studies on the navigation section aimed at the movement of a ship to a destination in a given marine space. Accordingly, they focused on the acquisition of state data of a space through sensors, route planning considering other ships, and obstacle avoidance. Similar to the studies on the control field, studies on the navigation sector also included the control field depending on the circumstances.

The curse of the RL algorithm dimensionality can be cleared by separating complicated issues to be solved by autonomous ships, as described above. Through stratification, the knowledge domain per problem can be separated, and the Markov decision process (MDP) of the hierarchical structure can be configured [52]. The most significant merits acquired from the stratification are to save the learning time for actions and control of the autonomous ship and to demonstrate definite learning results [53]. The hierarchical structure enables supervised/non-supervised interlayer learning and has the advantage of learning speed and generalization by simplifying the algorithm [54]. The hierarchical structure enables supervised/non-supervised interlayer learning and has the advantage of learning speed and generalization by simplifying the algorithm [55]. Moreover, it induces cooperation among separated layers. The algorithm in the higher layer uses the discrete sequence of sub-goals in the state space in the lower dimension as the learning data to achieve the main goals of the algorithm in the higher layer. The algorithm in the lower layer can solve complicated control issues through the cooperative hierarchical structure by learning the local route in the state space in the original higher layer to achieve the goals designated at the higher level [56]. The hierarchical approach was adopted as the lifelong learning framework, which separated diverse units of skills into lower-level issues that were selected and applied in accordance with the necessity. It has been proposed as a framework applying RL to solve various issues that could be faced in real environments [57].

This study divided and interconnected the issues related to autonomous ships into autonomous missions, navigation, and control. It applied autonomous navigation to the experiment to learn human-defined navigation rules.

3. Intelligent Learning Framework for Autonomous Ships

This study proposes an approach to create an intelligent environment for autonomous ships to achieve human-level learning through an RL algorithm. Autonomically reacting against obstacles and expressing human-defined navigation rules are provided for neighboring ships, accounting for the majority of learning decision-making of autonomous ships by applying case-based RL to make them include human knowledge. The experiment defined the human experience in operating ships in the ocean as the case and adopted a curriculum learning approach that could adjust the degrees of difficulties based on complicated marine traffic conditions. Considering these learning environments, the autonomous ship learned various issues occurring in the ocean through self-play and neighboring ships based on human-defined navigation rules. The RL algorithm of autonomous ships hierarchically separated per decision field learns through the interaction among the mission, navigation, control fields, and environment. Moreover, the upper layer allowed the lower layer to learn through the interlayer teach-student curriculum learning approach.

This study adopts the multi-agent posthumous credit assignment based on counterfactual multi-agent policy gradients (COMA) as the RL algorithm applied to an autonomous ship [58]. Autonomous ships and neighbor ships are operated by the same RL algorithm through ILFAS’s Self-Play curriculum learning. Recently, research is considering applying multi-agent RL to autonomous ships. This is to perform missions stably through one or more autonomous ships and to perform complex missions through the cooperation of autonomous ships. The ILFAS prepared enables multi-agent-based autonomous ships to be trained according to this research flow [59,60]. Multi-Agent RL is based on Actor-Critic, and when applied to a single autonomous ship, it behaves the same as a general single-agent. Because autonomous control is limited to the issues of a single ship, it is explained based on proximal policy optimization, whereas the RL algorithm is used for a single agent [61]. The same RL algorithm used for an autonomous ship was also adopted to express diverse vessels operating in the environment. This paper explains the algorithm based on COMA and compares and evaluates the learning method for neighboring ships considering the performance of ILFAS.

3.1. Architecture

The learning framework (ILFAS) proposed in this study comprises three types: hierarchical RL frame to be applied to autonomous ships, case-based curriculum system-induced learning of autonomous ships, and environment used in the learning. The physical environment was created in three dimensions (3D) with a fixed height for the LiDAR simulation using unity. Moreover, the basic physics model for the acceleration of ship weight and surface sliding in the ocean was adopted.

The elements are explained from left to right in Figure 1 to facilitate the understanding of the learning frame.

The hierarchical learning structure comprises the autonomous mission (

M i s s i o n

), which reduces the learning complexity of the autonomous ship to learn, autonomous sailing (

S a i l i n g

), and autonomous control (

C o n t r o l

) to make decisions and operate various vessels. Considering

S a i l i n g

in Figure 1, the autonomous ship was learned through transfer learning using a case-based curriculum system, and

C o n t r o l

was learned through supervised learning based on

S a i l i n g

. After Sailing was learned,

S a i l i n g (a, s_{t}, s_{t + 1}, r)

at which the autonomous ship should move to Control could be presented. Therefore, the expected action value a (the direction and speed) by Sailing could also be predicted. Sailing’s action value

a

and current state

s_{t}

are the Target and State values of Control, which were provided along with environmental data, such as wind. Through Control’s action in the learning process, the autonomous ship operated in the environment. The operation result of the autonomous ship was obtained as

s_{t + 1}

of Sailing, and Control learns by returning the difference from the expected action value to Control as reward

r

.

s_{t}

is environment data through

S a i l i n g

and environment plug-in learning become the state and target value of

C o n t r o l

. The state was transferred to

C o n t r o l

, the autonomous control, considering learning on

S a i l i n g

, and the actions from

C o n t r o l

are evaluated and learned by the autonomous ship.

M i s s i o n

is a general RL field related to space planning. For example, the marine space was divided based on the supervision scope of autonomous ships for ocean surveillance. The division of the marine space and movement location in sequence was determined for collecting marine data. A destination was delivered to

S a i l i n g

through the spatial data input from the environment, and the learning results of

S a i l i n g

were evaluated based on the successful arrival to the destination.

A case-based curriculum system configured the curriculum with the degrees of difficulty and space by defining cases of neighboring ships to be operated in the ocean. The curriculum was classified into the global curriculum that learns the entire given space and the local curriculum based on the sequence of time and space. The curriculum was provided in the environment in real-time. The cases stated in the curriculum were provided based on the decision-making order of neighboring ships. Moreover, the RL algorithm implementing the cases already learned the movement between waypoints.

3.2. Hierarchical RL Frame

This study hierarchically configured the RL of an autonomous ship to enable its operation in reality. The correlation between separate layers was explained using

S a i l i n g

and

C o n t r o l

as examples. RL based on a multi-agent algorithm was applied to

S a i l i n g

for learning autonomous navigation, including path finding, collision avoidance, and human-defined navigation rules considering the operation of other ships.

C o n t r o l

learned the actions

(a_{s})

to control the ship based on the environment transferred from

S a i l i n g

. This process is consistent with the control methods and types of ships. The RL algorithm can be applied to existing ship and posture controls.

The state input and action output of

S a i l i n g

are explained.

(o_{s} \in O) = {[S e l f_{(P o s i t i o n, S p e e d, D i r e c t i o n, O t h e r s)}], [E n v_{a l l}], [G_{G o a l P o s, D i r}], [O_{O b s t a c l e b y L i d a r}]}

(1)

(o \in O)

observed in the autonomous ship for the decision-making of

S a i l i n g

is defined in Formula (1).

S e l f

is the ship state;

E n v

is transferred from the environment, and it includes the data from the natural environment simulator. The absolute coordinates and relative directions of destination

G

were calculated. The LiDAR of

O

is the distance between the static obstacles, such as an embankment. The data were acquired through the short-range obstacle detection in the environment and dynamic obstacles (the neighboring ships operating around).

(a_{s} \in A) = {[S p e e d u p, K e e p, S p e e d d o w n], [L e f t, K e e p, R i g h t]}

(2)

The action

(a_{s})

was discretely declared to learn

S a i l i n g

. The maximum scope that can be changed at a time using Formula (2) limits the scope of one output action, such as

(10 k n o t, 5 d e g r e e)

. The scope to be changed at a time varies based on the vessel type. The scope is the criteria for evaluation and compensation of the actions of

C o n t r o l

.

[O_{O b s t a c l e b y L i d a r}]

, the single-layer LiDAR inputs the obstacle detection data around the autonomous ships, and other sensors can be added when required. The relative location value of the autonomous ship was calculated using the obstacle and destination data, such as the real environment. The network of autonomous ship layers for path finding is defined, as shown in Formula (3). Considering the reward on

S a i l i n g

, when the autonomous ship successfully arrives at the destination provided in the mission, the goal and collision are one and 0.001, respectively. The depreciation rate (

γ

) is 0.0001. The state value is received from the environment. The

S a i l i n g

network is shown in Formula (3):

[N e t w o r k π_{s}, R e p l a y M e m o r y D_{s}, P a r a m e t e r θ_{s}, G o a l G_{s}, R e w a r d R_{s}, A c t i o n a_{s}]

(3)

After learning, the

S a i l i n g

layer is ready to perform supervised learning for

C o n t r o l

.

The observation value of

C o n t r o l

is the state value of

S a i l i n g

and is defined, as shown in Formula (4).

S e l f_{(O t h e r s)}

is the present state value, including the gradient of a ship.

E n v_{(O p t i o n)}

is selectively applied based on the natural environment simulation, such as a plug-in. This value is the vector value of the wind direction and tide. Its relative value to an autonomous ship is calculated as shown in Formula (4):

(o_{c} \in O) = {[S e l f_{(P o s i t i o n, S p e e d, D i r e c t i o n, O t h e r s)}], [E n v_{(W i n d, W a v e, T i d a l D i r e c t i o n | O p t i o n)}]}

(4)

E n v

, the natural environment, was provided through

S a i l i n g

, and the network is shown in Formula (5). The action

(a_{s})

determined in

S a i l i n g

is defined as the target value

(t a s k_{t e r g e t} = \pm 10 k n o t, \pm 5 d e g r e e)

of

C o n t r o l

. If

a_{s} = (K e e p)

, then the value zero (“0”) is transferred. The target value of

C o n t r o l

is given as

G o a l a_{s}

, which is the action to be taken by

S a i l i n g

. The action

(a_{c})

of

C o n t r o l

, the autonomous control, is retransferred to

S a i l i n g,

was implemented and received

S a i l i n g {(s_{t + 1})}_{}

from the environment to calculate the compensation reward as expressed below:

[N e t w o r k π_{c}, P a r a m e t e r θ_{c}, G o a l a_{s}, R e w a r d r (| | (s^{t} | a_{s}) - {(s^{t + 1} | a_{c}) | |}_{2}), A c t i o n a_{c}]

(5)

The reward of

C o n t r o l

is

R e w a r d r

in Formula (5) and was calculated as the difference between the action to take in the given

S a i l i n g

state and the state after applying the action from

C o n t r o l

. The action of the value expected by

S a i l i n g

was taken considering the natural environment in

C o n t r o l,

and learning was performed in the

C o n t r o l

layer to maximize the reward. Accordingly, the tilt and rollover state of a ship based on the control method of a ship and the physical features of a hull was acquired from the environment in

C o n t r o l

.

Contrary to the hierarchical structure connected to learning in previous studies,

S a i l i n g

and

C o n t r o l

separately generated actions and had independent learning structures. The actions

(a_{s})

of

S a i l i n g

were applied to the environment without any modifications, and learning was preceded. When navigation to a destination went on smoothly after the learning completion of

S a i l i n g

, the teach-student curriculum method implementing the learning of

C o n t r o l

was applied.

3.3. Case-Based Curriculum System

The operation case of neighboring ships in the simulation comprises the following:

C a s e = {I n d e x_{C a s e i n d e x}, N a m e_{N a m e i n d e x}, [E v e n t_{1}, E v e n t_{2}, \dots, E v e n t_{n}], G o a l_{P o s i t i o n}}

(6)

An event defined in this case is the detailed information to be implemented toward the final goal. The case consists of the waypoint coordinates based on the time schedule and states the detailed actions toward a destination or after movement, as described in Formula (7):

E v e n t = {I n d e x_{M i s s i o n i n d e x}, N a m e_{M i s s i o n i n d e x}, G o a l_{w a y p o i n t}, [A c t i o n]}

(7)

Action is defined as

[A c t i o n S c r i p t]

or

[P a r a m e t e r]

and configured as [Stay, Cycle, etc.] and [Speed, Random, etc.]. Speed and Random were applied to ship control on the way to

G o a l_{w a y p o i n t}

. Random noise was added to the action to propose the speed or unstable track. It considers a ship that is difficult to sail straight, such as a yacht. Stays and cycles are executed at

G o a l_{w a y p o i n t}

. A ship stays on the ocean at a destination for a designated time or sails in a circle within 10 m of a destination. Action can be added depending on the purpose of the training. RL with neighboring ships transfers the ship control authority to action in the action script per event or uses it as the reference value for navigation. Single or multiple cases can be applied simultaneously, and the number of ships to arrange was designated. Therefore, the condition

A l l o c a t e d S h i p_{N u m} \geq U s e C a s e_{N u m}

needs to be satisfied. Cases were defined by a human operator considering the number of ships, coordinates, and actions to be executed. Each data point was saved in a case-based system database. Cases included human-defined navigation rules and could record and define the ship operation state in harbors and coasts.

This study classifies the spaces for implementing the curriculum based on defined cases, as shown below.

G l o b a l C u r r i c u l u m = [B a s i c s k i l l s e t f o r G l o b a l s p a c e]

(8)

L o c a l C u r r i c u l u m = [R u l e s k i l l s e t f o r L o c a l s p a c e]

(9)

Definition (8) includes path finding and fixed obstacle avoidance as the global content for the autonomous ship to learn. The experiment selected the harbor with busy marine traffic, and the curriculum started with learning to verify the route from the anchorage harbor to coastal waters. Considering this learning process, random waypoints in coastal waters were provided for departure. The autonomous ship learned considering the harbors with fixed obstacles and other obstacles in the coastal waters for departure and arrival.

Definition (9) is the intensive learning content in the local space based on the navigation time schedule for autonomous ships and is provided sequentially. The autonomous ship stands by for the port entry of neighboring ships or remains right in the waterway depending on the complexity of the marine traffic environment configured with neighboring ships in the harbor. It comprises the movement to a final destination to avoid moving fishing boats in coastal waters (Angler’s boats frequently move around, marine sports zones are small, and fast boats move around). The inward voyage was implemented in reverse order.

The curriculum learning was implemented from obstacles with a low degree of difficulty in finding a destination. This process aims to avoid static obstacles in the environment without dynamic obstacles to those with a higher degree of difficulty consisting of a busy marine traffic environment because the number and actions of neighboring ships are increased. Because the degree of difficulty got higher, the teach-student method was additionally implemented after the completion of

S a i l i n g

learning for

C o n t r o l

of the autonomous ship. The degree of difficulty in the controlling of

C o n t r o l

increased as the environment became more complicated.

The RL algorithm applied to the cases was implemented in a given environment with waypoints and destinations transferred from the case-based curriculum system. The busy marine traffic environment was created as the neighboring ships operated by the multi-agent RL algorithm to avoid collisions. Various types of cases were implemented simultaneously, as demonstrated below:

(o_{c a s e} \in O) = {[S e l f_{(P o s i t i o n, S p e e d, D i r e c t i o n, O t h e r s)}], [E n v_{a l l}], [G_{S u b G o a l P o s, D i r}], [O_{O b s t a c l e b y L i d a r}]}

(10)

The RL algorithm implementing cases is almost the same as the algorithm applied to an autonomous ship. The observation value

(o_{c a s e} \in O)

is shown in Formula (10). The subsequent waypoint was received through the case-based curriculum system sequentially for G. On arrival at the final G, the case ends. The action

(a_{c})

is shown in Formula (11):

(a_{n} \in A) = {[S p e e d u p, K e e p, S p e e d d o w n], [L e f t, K e e p, R i g h t]}

(11)

Although the same RL algorithm was applied, the agents could learn strategies through the self-reply curriculum learning with different goals and rewards and learn human-defined navigation rules through the learning process. When the diversity and number of learned autonomous ships increased, the RL algorithm for neighboring ships applied to the case-based curriculum could be used for additional learning. Self-play curriculum learning demonstrates good learning results through relatively continuous learning.

3.4. ILFAS Training

Although the types of RL algorithms have been continuously developed, they should prevent local optimization and maximization of rewards. Other researchers frequently use separately designed reward functions to achieve the above goals quickly. However, this study aims to train an RL algorithm designed to provide generally delayed and immediate rewards.

The subsequent algorithm is the policy-based multi-agent RL COMA algorithm. The following formula demonstrates the policy gradient with the Q function as the advantage using the actor–critic. The actor updates policies and the critic evaluates them in this configuration. Policies are updated simultaneously, and rewards are maximized as expressed below:

\nabla_{θ} J (θ) = Ε_{τ \sim p_{θ} (τ)} [\sum_{t = 0}^{T} (\nabla_{θ} l o g π_{θ} (a_{t} | s_{t}) A^{π_{θ}} (s_{t}, a_{t}))]

(12)

A^{π_{θ}} (s_{t}, a_{t}) = (Q^{π_{θ}} (s_{t}, a_{t}) - V^{π_{θ}} (s_{t}))

(13)

The RL applied to multiple ships updated

l o g π_{θ} (a_{t} | s_{t})

, the actor part taking log using Formula (9), and critic

A^{π_{θ}} (s_{t}, a_{t})

comprising the Q function. Policies and rewards were each updated as critics in Formula (12), as shown in Formula (13), and approximated to

V^{π_{θ}} (s_{t}) \approx V_{w} (s_{t})

:

θ \leftarrow θ + α_{a c t o r} \nabla_{θ} J (θ)

(14)

w \leftarrow w + α_{c r i t i c} \nabla_{w} V_{w}

(15)

The actor and critic performed transfer learning by learning the environment changed based on Formulas (14) and (15).

This study configured the curriculum, and Definitions (8) and (9) are explained in detail as follows:

Global Curriculum = [Find entire route], [avoid fixed obstacles], [case: avoid dynamic obstacles], etc.

Local Curriculum = [Stand by at dock area], [passing method in waterways], [avoid water recreation zone in coastal waters], etc.

The local curriculum in learning solves more difficult issues, although the global curriculum environment continuously exists. Regarding the RL of autonomous ships interacting with dynamic obstacles, path planning for autonomous navigation is learned in the global curriculum, and the navigation rules by collision avoidance are learned in the local curriculum. When

S a i l i n g

learning is completed, as shown in Figure 2 and Figure 3,

C o n t r o l

learning starts. The following example illustrates a curriculum configured for autonomous ships.

Transfer learning is based on baseline training, including pathfinding without obstacles. After learning was completed, learning was adjusted through

(P o l i c y I t e r a t i o n)

, as shown in Formula (16). Adjustment applies the probability rate

r (θ)

.

r (θ) = \frac{π_{θ} (a | s)}{π_{θ o l d} (a | s)}

(16)

After the policies were adjusted through transfer learning, as shown in Formula (16), the reward trajectory

T r a j e c t o r y (R (τ))

was generated by

P o l i c y π_{θ}

with

P o l i c y p a r a m e t e r θ

, which consists of

{S_{1}, A_{1}, R_{2}, S_{2}, A_{2}, R_{3}, S_{3}, A_{3}, \dots, S_{n}}

. After global learning, a new environment with time and space was presented to the autonomous ship through the local curriculum. Therefore,

T r a j e c t o r y (R (τ))

was scattered and diffused when learning fails. To prevent the previous learning experience from being lost (catastrophic forgetting), the previous cases learned should be provided together as the degree of difficulty increases.

Considering the simple experiment below, we examined the implementation of transfer learning in an autonomous ship.

The experiment above explains the adjustment without catastrophic forgetting through curriculum learning and how to save learning time. The experiment proposes a new curriculum that makes the autonomous ship (black) learn basic pathfinding, and obstacle avoidance in a given space (harbor map) avoids the new ship (A) when encountering. Considering the global environment given in the global curriculum, the graph in Figure 4b shows the values before (blue) and after (red) learning.

The values are the same as the number of control signal commands from the autonomous ship and the number of steps to a destination and reward values. Autonomous navigation to a destination can be performed based on previous learning experiences on obstacle avoidance even before additional learning. However, the present location includes non-smooth reward values and unnecessary controls in the new environment. The autonomous ship can obtain more rewards with fewer commands through transfer learning, avoiding other vessels encountered through an additional learning of 2000 steps.

4. Experiment

The computer simulation was experimented in a virtual environment built up with a real harbor in Korea. The experiment compared random autonomous ship learning in an environment to neighboring ships sailing around, using the general RL method and autonomous ship learning by the ILFAS for two hierarchical RL algorithms. The word “Random” means random scenarios that make much experience of RL algorithm through enough exploration and exploitation. In this experiment, we compared our results to random scenarios because they can address against any status. Moreover, the stability of ship control was compared based on the success rate of autonomous navigation to a destination and the actions performed during autonomous navigation in the new environment. Finally, the experience examined whether an autonomous ship could learn human-defined navigation rules.

The 3D environment was implemented by adding the height on which the LiDAR simulator could detect the structures in the harbor in unity as like Figure 5. A 3D ship model was used in the experiment. The ship size was enlarged by three times for the safety distance between ships and the data simplification of LiDAR. The penalty was immediately provided, and the scenario ended after the collision with obstacles, including neighboring ships.

4.1. Training: Baseline Training(Path Finding)

Basic learning on the global space is implemented through the curriculum, setting a random destination in the coastal waters and departing from the harbor. The training result is shown in Figure 6.

The autonomous ship has additional learning on the avoidance of dynamic obstacles using different learning methods to compare the learned weight. General RL in an environment with random neighboring ships sailing around and RL training with two types of inward ship cases using ILFAS was compared. The learning time was measured until the autonomous ship arrived at a destination 100 consecutive times to avoid neighboring ships. The reward is high when the optimized navigation gets to the destination without collision. Episode Length is the number of actions and is equal to steps. A low value of Episode Length means that the goal has been arrived at with an optimized number of actions.

4.2. Training: Learning Avoidance of Dynamic Obstacles (ILFAS vs. Random)

Considering the random RL, four random inward neighboring ships were generated in international waters. The autonomous ship aimed to sail to a randomly generated destination to avoid the neighboring ships from using various routes. The autonomous ship started to converge at 25 million steps and was finally stabilized at 50 million steps. Learning was implemented until the autonomous ship arrived at a destination 100 consecutive times in 100 attempts. The average step length (H) per episode was 810 (no. of learning episodes = total number of learning steps/average number of scenario-ending steps). The training result is shown in Figure 7.

Regarding the ILFAS RL, the case-based curriculum system operates neighboring ships for human-defined general inward ships. The case is the inward ship curriculum for several ships that are generally observed in harbors. Although the neighboring ships avoid collisions when collisions are estimated in the simulation environment using the ILFAS, the episode ends when collision occurs. The autonomous ship started to converge at three million steps and was stabilized from five million steps. The experiment aimed to successfully arrive at a destination 100 times in 100 attempts. The average episode length was approximately 700. The training result is shown in Figure 8.

Learning results: Although random RL had various kinds of experiences through random inward neighboring ships, it required 10 times more learning time than the ILFAS until the autonomous ship arrived at a destination 100 times in 100 attempts. The average episode length (the ship control signal) was 100 steps more than that of ILFAS. V(s) in Figure 9 demonstrates that

T r a j e c t o r y (R (τ))

is stabilized as the autonomous navigation continues. Although the random RL showed new learning owing to catastrophic forgetting, RL by the ILFAS indicated that additional learning was performed based on the previous learning experience.

4.3. Experiment: Simple Traffic 1(Marine Traffic Environment for Inward Ships)

After learning the avoidance of dynamic obstacles, two RL algorithms were implemented and compared in the new environment. The new environment had neighboring ships following the right path to comply with human-defined navigation rules in the harbor that were not randomly created. The experiment result is shown in Figure 10.

The learning results in Section 4.2 were applied to this experiment. Although the successful arrival rates to a destination were similar (90% in 100 episodes), the ILFAS showed a slightly high success rate. The ILFAS learned the intrinsic motivation that the right path was safe; therefore, it avoided neighboring ships encountered. Random RL did not know the navigation rules of neighboring ships entering the harbor using the left path; hence, it selected the right path as the avoidance action. However, both learning methods showed significant differences in controlling autonomous ships. As shown in Figure 11, the ILFAS is stable in controlling the ship; however, the control by the random RL fluctuates substantially.

4.4. Experiment: Simple Traffic 2(Marine Traffic Environment of Outward Ships)

The experiment compared the operation of an autonomous ship to several outward ships without additional learning. In contrast to the experiment in Section 4.3, four neighboring ships departed from the harbor, and two ships were artificially placed in front of those ships. There is a gap between neighboring ships #3 and #2, wide enough for one ship to enter. Two additional ships departed from the last ship. Moreover, the neighboring ships sped down to induce the autonomous ship to crash against other ships. To avoid collision with the neighboring ship in front, the autonomous ship sped down. This experiment aims to verify whether the autonomous ship can retain the learning experience of taking the right path as specified in the human-defined navigation rules, even without neighboring ships entering the harbor using the left path. The experiment result is shown in Figure 12.

The learning results in Section 4.1 (the learning results using the ILFAS and random RL) were applied to two inward cases without additional learning. Neither the ILFAS nor random RL did not learn this environment before. This experiment aimed to verify whether the autonomous ship could adapt to the marine traffic environment with neighboring ships departing from the harbor simultaneously.

Although the ILFAS did not sufficiently learn how to respond to obstacles existing in front through learning with neighboring ships on the right path, its success rate to a destination was 10% higher than that of the random RL. The ILFAS complied with human-defined navigation rules, maintained the right path if possible, and did not surpass the neighboring ships in front. However, the random RL did not keep its position among the outward ships, stayed in the harbor, and departed later or departed using the empty space on the left. The graphs in Figure 13 compare the number of controls. The ILFAS shows a better stability of control than the random RL with sufficient experience.

4.5. Experiment: Complex Traffic(Complicated Marine Traffic Environment with In-Ward/Outward Ships)

The busy harbor conditions were implemented by adding inward and outward ships to the environment in Section 4.3. Considering a successful departure, the autonomous ship needs to properly control the speed between outward ships or depart from a harbor after entry into or departure from the harbor is completed. RL using ILFAS cannot learn how to wait. By contrast, the random RL did not learn the navigation rules through neighboring ships, and there was no space on the left path to get ahead of other ships in the busy marine traffic environment. The experiment result is shown in Figure 14.

The ILFAS created busy marine traffic circumstances has a high degree of difficulty induced by the presence of inward and outward ships. The circumstances are the same for the curriculum with a high degree of difficulty and a total of eight neighboring ships because two cases are implemented simultaneously in the ILFAS. The autonomous ship could smoothly sail among outward ships while keeping it right as usual. In the experiment, although the autonomous ship by the ILFAS RL proceeded and showed expected actions, 30% of the episodes crashed against the ship in front, failing to control the speed.

The autonomous ship using the random RL learned only the path finding in the shortest time as random learning. The successful arrival rate to a destination was low because there was no space to get ahead to avoid inward ships on the left and outward ships slowly moving on the right front as compared to the experiment in Section 4.3. The experiment indicated that it was difficult to induce an autonomous ship to learn human-defined navigation rules through unorganized random learning.

4.6. Learning and Experiment Results

Because RL algorithms cannot consider all situations, learning is sufficiently provided by adding noise to the environment created using random rules or a random environment in unmanned ships to avoid collision, such as in sensors and cameras [32,33]. However, it is difficult to solve the issues of autonomous ships using random learning in navigation if there are dynamic obstacles that occupy most of the area in the environment and certain rules. Furthermore, even when the rules to learn are included, it requires excessive time to learn certain rules using the random learning method.

This study demonstrates that when the autonomous ship learns autonomous navigation using the ILFAS in the space, unnecessary experience is eliminated and learning results are stabilized as compared to general random learning. Additional learning can be performed in a new environment based on previous experiences. Moreover, the learning time was reduced and definite learning results were acquired. Subsequently, the autonomous ship can learn human-defined navigation rules using the ILFAS. To generalize the learning methods and verify the learning results, the ILFAS demonstrated better results in the autonomous navigation field in the new environment and relatively stable results in the ship control field.

Additional learning related to the complex-traffic environment in Section 4.5 was performed. Based on the successful arrival rate to waypoints, learning was completed only after approximately 3400 episodes. Furthermore, because intelligent neighboring ships sailed using a multi-agent RL algorithm, a slight difference was always generated in the gaps between ships and actions. The autonomous ship could obtain sufficient learning data from such changes and sailed to the right among outward ships or standby ships in a complicated marine traffic environment. Even when the autonomous ship waited in the harbor, neighboring ships sailed as a reaction against the autonomous ship. Thus, the autonomous ship can learn how to properly wait in the harbor for departure.

5. Conclusions

This study has attempted to solve learning environment issues when applying RL to autonomous ships. If the environment is not intelligent, even with such a great expert and distinguished algorithm, and if the autonomous ship has to adapt itself to the real environment, then it cannot help facing the limits in learning. Particularly, if the neighboring ships are not intelligent, then there is a limit to the research on intelligent autonomous ships. Moreover, it requires too much time and cost to find an expert or data to consider all the issues and to make the autonomous ship learn data.

Therefore, this study aims to implement an intelligent environment that enables autonomous ships to acquire sufficient experience using RL in the general marine environment and to learn the events that humans cannot estimate by themselves. Furthermore, this study proposes the ILFAS to investigate whether an autonomous ship can learn from general issues to inherent human norms that cannot be numerically presented. This paper presents one of the solutions to solve the insufficient environmental problem of RL and transfer human knowledge and experience to autonomous ships. It can reduce RL learning time and lower the cost of building a learning environment. In conclusion, we built an intelligent learning framework that can obtain the learning results expected by humans in a short time and at low cost. The learned hierarchical RLs of stratified autonomous ships can be reused. When autonomous ships with the same control type are applied to another environment, only the navigation part is relearned in the mission, navigation, and control layers, and other layers can be reused.

In the field of defense, you can build an example of your opponent’s naval infiltration strategy and tactics. The application of this environment to the civil sector could consider the delivery of emergency medical supplies. Similar to the experiment of this paper. However, we focused on learning the human-defined rule. Research should be conducted continuously to increase the effectiveness of autonomous ships in situations such as natural environments and bad weather on behalf of humans.

Author Contributions

Conceptualization, Methodology, and Writing—Original Draft Preparation, J.K.; Writing—Review & Editing, J.P.; Project Administration and Funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Future Challenge Program through the Agency for Defense Development funded by the Defense Acquisition Program Administration.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jaradat, M.A.K.; Al-Rousan, M.; Quadan, L. Reinforcement based mobile robot navigation in dynamic environment. Robot. Comput. Manuf. 2011, 27, 135–149. [Google Scholar] [CrossRef]
Hester, T.; Quinlan, M.; Stone, P. RTMBA: A Real-Time Model-Based Reinforcement Learning Architecture for robot control. In Proceedings of the IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012; pp. 85–90. [Google Scholar] [CrossRef] [Green Version]
Specht, C.; Świtalski, E.; Specht, M. Application of an Autonomous/Unmanned Survey Vessel (ASV/USV) in Bathymetric Measurements. Pol. Marit. Res. 2017, 24, 36–44. [Google Scholar] [CrossRef] [Green Version]
Rumson, A.G. The application of fully unmanned robotic systems for inspection of subsea pipelines. Ocean Eng. 2021, 235, 109214. [Google Scholar] [CrossRef]
Zwolak, K.; Wigley, R.; Bohan, A.; Zarayskaya, Y.; Bazhenova, E.; Dorshow, W.; Sumiyoshi, M.; Sattiabaruth, S.; Roperez, J.; Proctor, A.; et al. The Autonomous Underwater Vehicle Integrated with the Unmanned Surface Vessel Mapping the Southern Ionian Sea. The Winning Technology Solution of the Shell Ocean Discovery XPRIZE. Remote Sens. 2020, 12, 1344. [Google Scholar] [CrossRef] [Green Version]
Gu, Y.; Goez, J.C.; Guajardo, M.; Wallace, S.W. Autonomous vessels: State of the art and potential opportunities in logistics. Int. Trans. Oper. Res. 2020, 28, 1706–1739. [Google Scholar] [CrossRef] [Green Version]
Knudson, M.; Tumer, K. Adaptive navigation for autonomous robots. Robot. Auton. Syst. 2011, 59, 410–420. [Google Scholar] [CrossRef] [Green Version]
Carreras, M.; Yuh, J.; Batlle, J.; Ridao, P. A action-based scheme using reinforcement learning for autonomous un-derwater vehicles. IEEE J. Oceanic Eng. 2005, 30, 416–427. [Google Scholar] [CrossRef] [Green Version]
Gaskett, C.; Wettergreen, D.; Zelinsky, A. Reinforcement learning applied to the control of an autonomous underwater vehicle. In Proceedings of the Australian Conference on Robotics and Automation, Brisbane, Australia, 20 March–1 April 1999; pp. 125–131. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Tsividis, P.A.; Pouncy, T.; Xu, J.L.; Tenenbaum, J.B.; Gershman, S.J. Human learning in Atari. In Proceedings of the AAAI Spring Symposium on Science of Intelligence: Computational Principles of Natural and Artificial Intelligence, Palo Alto, CA, USA, 27–29 March 2017; pp. 643–646. [Google Scholar]
Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A.S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J. Starcraft II: A new challenge for reinforcement learning. arXiv 2017, arXiv:1708.04782. [Google Scholar]
Ammar, H.B.; Eaton, E.; Luna, J.M.; Ruvolo, P. Autonomous Cross-Domain Knowledge Transfer in Lifelong Policy Gradient Reinforcement Learning. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3345–3351. [Google Scholar]
Marcus, G. Deep learning: A critical appraisal. arXiv 2018, arXiv:1801.00631. [Google Scholar]
Xu, X.; Lu, Y.; Liu, X.; Zhang, W. Intelligent collision avoidance algorithms for USVs via deep reinforcement learning under COLREGs. Ocean Eng. 2020, 217, 107704. [Google Scholar] [CrossRef]
Yu, Y. Towards Sample Efficient Reinforcement Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 5739–5743. [Google Scholar]
Lake, B.M.; Salakhutdinov, R.; Tenenbaum, J.B. Human-level concept learning through probabilistic program induction. Science 2015, 350, 1332–1338. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, L.; Qiao, L.; Chen, J.; Zhang, W. Neural-Network-Based Reinforcement Learning Control for Path Following of Underactuated Ships. In Proceedings of the 35th Chinese Control Conference (CCC), Chengdu, China, 27–29 July 2016; pp. 5786–5791. [Google Scholar] [CrossRef]
Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
Glatt, R.; Da Silva, F.L.; Costa AH, R. Towards knowledge transfer in deep reinforcement learning. In Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS), Recife, Pernambuco, Brazil, 9–12 October 2016; pp. 91–96. [Google Scholar]
Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
Woo, J.; Kim, N. Collision avoidance for an unmanned surface vehicle using deep reinforcement learning. Ocean Eng. 2020, 199, 107001. [Google Scholar] [CrossRef]
Martinsen, A.B.; Lekkas, A.; Gros, S.; Glomsrud, J.A.; Pedersen, T.A. Reinforcement Learning-Based Tracking Control of USVs in Varying Operational Conditions. Front. Robot. AI 2020, 7, 32. [Google Scholar] [CrossRef] [Green Version]
Chen, C.; Chen, X.-Q.; Ma, F.; Zeng, X.-J.; Wang, J. A knowledge-free path planning approach for smart ships based on reinforcement learning. Ocean Eng. 2019, 189, 106299. [Google Scholar] [CrossRef]
Xu, H.; Wang, N.; Zhao, H.; Zheng, Z. Deep reinforcement learning-based path planning of underactuated surface vessels. Cyber Physical Syst. 2019, 5, 1–17. [Google Scholar] [CrossRef]
Ye, Y.; Zhang, X.; Sun, J. Automated vehicle’s action decision making using deep reinforcement learning and high-fidelity simulation environment. Transp. Res. Part C Emerg. Technol. 2019, 107, 155–170. [Google Scholar] [CrossRef] [Green Version]
Bécsi, T.; Aradi, S.; Fehér, Á.; Szalay, J.; Gáspár, P. Highway environment model for reinforcement learning. IFAC Pap. 2018, 51, 429–434. [Google Scholar] [CrossRef]
Zhang, H.; Feng, S.; Liu, C.; Ding, Y.; Zhu, Y.; Zhou, Z.; Zhang, W.; Yu, Y.; Jin, H.; Li, Z. CityFlow: A Multi-Agent Reinforcement Learning Environment for Large Scale City Traffic Scenario. In Proceedings of the WWW ‘19: The Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 3620–3624. [Google Scholar] [CrossRef] [Green Version]
Reda, D.; Tao, T.; van de Panne, M. Learning to Locomote: Understanding How Environment Design Matters for Deep Reinforcement Learning. In Proceedings of the ACM SIGGRAPH Motion, Interaction, and Games (MIG 2020), Virtual Event, 16–18 October 2020. [Google Scholar] [CrossRef]
Bansal, T.; Pachocki, J.; Sidor, S.; Sutskever, I.; Mordatch, I. Emergent complexity via multi-agent competition. arXiv 2017, arXiv:1710.03748. [Google Scholar]
Dulac-Arnold, G.; Mankowitz, D.; Hester, T. Challenges of real-world reinforcement learning. arXiv 2019, arXiv:1904.12901. [Google Scholar]
Ye, C.; Yung, N.H.; Wang, D. A fuzzy controller with supervised learning assisted reinforcement learning algorithm for obstacle avoidance. IEEE Trans. Syst. Man, Cybern. Part B (Cybernetics) 2003, 33, 17–27. [Google Scholar] [CrossRef] [Green Version]
Wang, D.; Fan, T.; Han, T.; Pan, J. A Two-Stage Reinforcement Learning Approach for Multi-UAV Collision Avoidance Under Imperfect Sensing. IEEE Robot. Autom. Lett. 2020, 5, 3098–3105. [Google Scholar] [CrossRef]
Botvinick, M.; Ritter, S.; Wang, J.X.; Kurth-Nelson, Z.; Blundell, C.; Hassabis, D. Reinforcement learning, fast and slow. Trends Cogn. Sci. 2019, 23, 408–422. [Google Scholar] [CrossRef] [Green Version]
Justesen, N.; Torrado, R.R.; Bontrager, P.; Khalifa, A.; Togelius, J.; Risi, S. Illuminating generalization in deep re-inforcement learning through procedural level generation. arXiv 2018, arXiv:1806.10729. [Google Scholar]
Narvekar, S.; Stone, P. Learning Curriculum Policies for Reinforcement Learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 25–33. [Google Scholar]
Baker, B.; Kanitscheider, I.; Markov, T.; Wu, Y.; Powell, G.; McGrew, B.; Mordatch, I. Emergent tool use from multi-agent autocurricula. arXiv 2019, arXiv:1909.07528. [Google Scholar]
Ontanón, S.; Mishra, K.; Sugandh, N.; Ram, A. Case-Based Planning and Execution for Real-Time Strategy Games. In Proceedings of the International Conference on Case-Based Reasoning, Belfast, Northern Ireland, 13–16 August 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 164–178. [Google Scholar]
Weber, B.; Mateas, M. Case-Based Reasoning for Build Order in Real-Time Strategy Games. In Proceedings of the Artificial Intelligence and Interactive Digital Entertainment Conference, Palo Alto, CA, USA, 14–16 October 2009; pp. 106–111. [Google Scholar]
Wender, S.; Watson, I. Integrating Case-Based Reasoning with Reinforcement Learning for Real-Time Strategy Game Micromanagement. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Gold Coast, QLD, Australia, 1–5 December 2014; pp. 64–76. [Google Scholar] [CrossRef]
Hacohen, G.; Weinshall, D. On the power of curriculum learning in training deep networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 2535–2544. [Google Scholar]
Ashrafiuon, H.; Muske, K.R.; McNinch, L.C. Review of nonlinear tracking and setpoint control approaches for autonomous underactuated marine vehicles. In Proceedings of the 2010 American Control Conference, Baltimore, MA, USA, 30 June–2 July 2010; pp. 5203–5211. [Google Scholar] [CrossRef]
Woolsey, C. Review of Marine Control Systems: Guidance, Navigation, and Control of Ships, Rigs and Underwater Vehicles. J. Guid. Control. Dyn. 2005, 28, 574–575. [Google Scholar] [CrossRef]
Wang, N.; Pan, X. Path following of autonomous underactuated ships: A translation–rotation cascade control approach. IEEE ASME Trans. Mechatron. 2019, 24, 2583–2593. [Google Scholar] [CrossRef]
Ma, Y.; Hu, M.; Yan, X. Multi-objective path planning for unmanned surface vehicle with currents effects. ISA Trans. 2018, 75, 137–156. [Google Scholar] [CrossRef]
De Paula, M.; Acosta, G.G. Trajectory tracking algorithm for autonomous vehicles using adaptive reinforcement learning. In Proceedings of the OCEANS 2015-MTS/IEEE, Washington, DC, USA, 19–22 October 2015; pp. 1–8. [Google Scholar]
Singh, Y.; Sharma, S.; Sutton, R.; Hatton, D.; Khan, A. A constrained A* approach towards optimal path planning for an unmanned surface vehicle in a maritime environment containing dynamic obstacles and ocean currents. Ocean Eng. 2018, 169, 187–201. [Google Scholar] [CrossRef] [Green Version]
Cheng, Y.; Zhang, W. Concise deep reinforcement learning obstacle avoidance for underactuated unmanned marine vessels. Neurocomputing 2018, 272, 63–73. [Google Scholar] [CrossRef]
Wang, Y.; Tong, J.; Song, T.-Y.; Wan, Z.-H. Unmanned Surface Vehicle Course Tracking Control Based on Neural Network and Deep Deterministic Policy Gradient Algorithm. In Proceedings of the OCEANS-MTS/IEEE Kobe Techno-Oceans (OTO), Kobe, Japan, 28–31 May 2018; pp. 1–5. [Google Scholar] [CrossRef]
Yan, N.; Huang, S.; Kong, C. Reinforcement Learning-Based Autonomous Navigation and Obstacle Avoidance for USVs under Partially Observable Conditions. Math. Probl. Eng. 2021, 2021, 1–13. [Google Scholar] [CrossRef]
Zhou, X.; Wu, P.; Zhang, H.; Guo, W.; Liu, Y. Learn to Navigate: Cooperative Path Planning for Unmanned Surface Vehicles Using Deep Reinforcement Learning. IEEE Access 2019, 7, 165262–165278. [Google Scholar] [CrossRef]
Barto, A.G.; Mahadevan, S. Recent Advances in Hierarchical Reinforcement Learning. Discret. Event Dyn. Syst. 2003, 13, 41–77. [Google Scholar] [CrossRef]
Peng, X.B.; Berseth, G.; Yin, K.; Van De Panne, M. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Adv. Neural Inf. Processing Syst. 2016, 29, 3675–3683. [Google Scholar]
Krishnamurthy, R.; Lakshminarayanan, A.S.; Kumar, P.; Ravindran, B. Hierarchical Reinforcement Learning using Spatio-Temporal Abstractions and Deep Neural Networks. arXiv 2016, arXiv:abs/1605.05359. [Google Scholar]
Morimoto, J.; Doya, K. Acquisition of stand-up action by a real robot using hierarchical reinforcement learning. Robot. Auton. Syst. 2001, 36, 37–51. [Google Scholar] [CrossRef]
Tessler, C.; Givony, S.; Zahavy, T.; Mankowitz, D.; Mannor, S. A Deep Hierarchical Approach to Lifelong Learning in Minecraft. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Han, W.; Zhang, B.; Wang, Q.; Luo, J.; Ran, W.; Xu, Y. A Multi-Agent Based Intelligent Training System for Unmanned Surface Vehicles. Appl. Sci. 2019, 9, 1089. [Google Scholar] [CrossRef] [Green Version]
Li, R.; Wang, R.; Hu, X.; Li, K.; Li, H. Multi-USVs Coordinated Detection in Marine Environment with Deep Reinforcement Learning. In Proceedings of the International Symposium on Benchmarking, Measuring and Optimization, Seattle, WA, USA, 10–13 December 2018; pp. 202–214. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]

Figure 1. ILFAS scheme. Autonomous ship, hierarchical RL algorithm, case-based curriculum system building the curriculum and operating neighboring ships through cased-based RL, 3D geographical environment with a certain height and learning environment integrate the elements above. The environment and plug-in can be replaced by other simulations.

Figure 2. Schematic of the inter-layer supervised learning of hierarchical RL. According to the goals defined by humans, the Mission layer analyzes the space and sets the destination (left). The set goal is transmitted to the Sailing layer, and a delay reward is made according to the result. Sailing learns navigation as in Formula (3) according to Formula (1) observed from the environment, and the route consists of the actions as in Formula (2) (center). After learning, the Sailing layer becomes an observation value of Control, as shown in Formula (4), and Control learns, as shown in Formula (5), and is executed in the environment through the output Action

a_{c}

. In the Sailing layer, the control is compensated by comparing the learned target value with the environmental change caused by Control’s action.

Figure 2. Schematic of the inter-layer supervised learning of hierarchical RL. According to the goals defined by humans, the Mission layer analyzes the space and sets the destination (left). The set goal is transmitted to the Sailing layer, and a delay reward is made according to the result. Sailing learns navigation as in Formula (3) according to Formula (1) observed from the environment, and the route consists of the actions as in Formula (2) (center). After learning, the Sailing layer becomes an observation value of Control, as shown in Formula (4), and Control learns, as shown in Formula (5), and is executed in the environment through the output Action

a_{c}

. In the Sailing layer, the control is compensated by comparing the learned target value with the environmental change caused by Control’s action.

Figure 3. Curriculum classified into the global and local curriculums. The local curriculum is arranged based on the time and space sequence in the global curriculum. When learning to a given destination in the autonomous navigation is completed, sailing the learned autonomous navigation makes the control (the autonomous control) learn.

Figure 4. (a) Other ships approaching in front added to the autonomous ship (black) learned obstacle avoidance and basic path finding through the local curriculum. (b) Comparison of the values from the first learning scenario and those from 2000 scenario learnings to obtain the

V_{w} (s_{t})

of the critic. To extract the critic value,

(ε)

is fixed, and the critic is activated before and after the learning. The additional training before (blue) and after (red) is compared. After additional training, the critic is higher with low actions.

Figure 4. (a) Other ships approaching in front added to the autonomous ship (black) learned obstacle avoidance and basic path finding through the local curriculum. (b) Comparison of the values from the first learning scenario and those from 2000 scenario learnings to obtain the

V_{w} (s_{t})

of the critic. To extract the critic value,

(ε)

is fixed, and the critic is activated before and after the learning. The additional training before (blue) and after (red) is compared. After additional training, the critic is higher with low actions.

Figure 5. (a) 3D environment with the height added based on the 2D map on the actual harbor. (b) Adjustment of the ship size for the LiDAR simulation on the ship.

Figure 6. (a) Randomly selected destination set for the autonomous ship to learn departure (right bottom). Coordinates to a destination are given in the GPS. (b) Autonomous navigation after the completion of learning. A sea route is selected through a greedy method. The red rectangle in the experiment is the concerned zone, where human-defined navigation rules shall be applied because of the frequent harbor entry and departure. (c) Convergence of learning from approximately eight million steps, stabilizing after 16 million steps (approximately one learning scenario—no. of learning steps/800).

Figure 7. (a) Randomly generated neighboring ships starting to enter into the harbors. (b) Autonomous ship having lots of acceleration and deceleration controls to avoid the risk areas. (c) Learning results expressed with rewards and episode length. Convergence of learning from approximately 25 million steps, stabilizing after 60 million steps (approximately one learning scenario = No. of learning steps/780).

Figure 8. (a) Learning through two cases of neighboring ships entering harbors from two adjacent coastal areas. Cases are run through the case-based curriculum system. (b) As an example of operation in a normal port, incoming neighboring ships enter the port along the right path. (c) Convergence of learning from approximately three million steps, stabilizing after five million steps (approximately one learning scenario—no. of learning steps/700).

Figure 9. V(s) graph evaluated by critic during learning: (a) Random RL and (b) ILFAS RL. Random RL graph looks sharper because of the long learning time. Nevertheless, the ILFAS graph shows a quick convergence even in a small episode. The graph converging in a short learning time demonstrates that transfer learning is properly implemented. (Unit: value loss/learning steps.).

Figure 10. (a) Initial experiment environment set to avoid collision among inward ships, (b) ILFAS RL, and (c) random RL selecting the right path and successfully departing from the harbor. (d) The success rate exceeds 95% of ILFAS (Green) and 90% of Random RL (Red) in each 100 episodes.

Figure 11. Action of the autonomous ship. The ship is controlled by [acceleration/deceleration] and [steering]. Random and ILFAS RL are shown in red and green, respectively. The graph shows the number of controls of steering and speed in the same environment. It also indicates the stability in autonomous ship control. ILFAS (Green) is narrow range of Actions value than Random (Red).

Figure 12. (a) Initial experiment environment. The ships are artificially arranged in front and at the back to adjust the interval between four outward vessels. (b) ILFAS RL results. The ships sail on the right path. (c) Random RL results. The ships wait in habor and sail on the fast path. (d) The success rate to a destination in 100 episodes was 70% in The ILFAS (Blue) and 60% in the random RL (Red).

Figure 13. Actions of the autonomous ship. The autonomous ship is controlled by [acceleration/deceleration] and [steering]. The random and ILFAS RL are shown in red and green, respectively. The graph, keeping direction without controlling the speed, shows the stability of the autonomous ship. The graph for the ILFAS also shows the decrease in the stability of control in the environment that is not learned.

Figure 14. (a) Experiment using eight neighboring ships in the initial state waiting for entry into or departure from the harbor. (b) Autonomous ship sailing right owing to ILFAS learning; however, the arrival rate to a destination decreased because of poor speed control. The autonomous ship departed to comply with the human-defined navigation rules. (c) Random RL waits for all of the outward ships or go ahead of inward or outward ships. The autonomous ship of the random RL rapidly departs after waiting in the harbor because collision is estimated when the harbor gets busy. (d) The successful arrival rate to a destination in 100 episodes reduces to 53% for the ILFAS(Gray) and 30% for the Random RL (Red).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Park, J.; Cho, K. Continuous Autonomous Ship Learning Framework for Human Policies on Simulation. Appl. Sci. 2022, 12, 1631. https://doi.org/10.3390/app12031631

AMA Style

Kim J, Park J, Cho K. Continuous Autonomous Ship Learning Framework for Human Policies on Simulation. Applied Sciences. 2022; 12(3):1631. https://doi.org/10.3390/app12031631

Chicago/Turabian Style

Kim, Junoh, Jisun Park, and Kyungeun Cho. 2022. "Continuous Autonomous Ship Learning Framework for Human Policies on Simulation" Applied Sciences 12, no. 3: 1631. https://doi.org/10.3390/app12031631

APA Style

Kim, J., Park, J., & Cho, K. (2022). Continuous Autonomous Ship Learning Framework for Human Policies on Simulation. Applied Sciences, 12(3), 1631. https://doi.org/10.3390/app12031631

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Continuous Autonomous Ship Learning Framework for Human Policies on Simulation

Abstract

1. Introduction

1.1. Background

1.2. Challenges

1.3. Approaches

1.4. Contributions

2. Related Works

2.1. Learning Method Based on Curriculum

2.2. RL for a Hierarchical Autonomous Ship Task

3. Intelligent Learning Framework for Autonomous Ships

3.1. Architecture

3.2. Hierarchical RL Frame

3.3. Case-Based Curriculum System

3.4. ILFAS Training

4. Experiment

4.1. Training: Baseline Training(Path Finding)

4.2. Training: Learning Avoidance of Dynamic Obstacles (ILFAS vs. Random)

4.3. Experiment: Simple Traffic 1(Marine Traffic Environment for Inward Ships)

4.4. Experiment: Simple Traffic 2(Marine Traffic Environment of Outward Ships)

4.5. Experiment: Complex Traffic(Complicated Marine Traffic Environment with In-Ward/Outward Ships)

4.6. Learning and Experiment Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI