Continuous Autonomous Ship Learning Framework for Human Policies on Simulation

: Considering autonomous navigation in busy marine trafﬁc environments (including harbors and coasts), major study issues to be solved for autonomous ships are avoidance of static and dynamic obstacles, surface vehicle control in consideration of the environment, and compliance with human-deﬁned navigation rules. The reinforcement learning (RL) algorithm, which demonstrates high potential in autonomous cars, has been presented as an alternative to mathematical algorithms and has advanced in studies on autonomous ships. However, the RL algorithm, through interactions with the environment, receives relatively fewer data from the marine environment. Moreover, the open marine environment causes difﬁculties for autonomous ships in learning human-deﬁned navigation rules because of excessive degrees of freedom. This study proposes a sustainable, intelligent learning framework for autonomous ships (ILFAS), which helps solve these difﬁculties and learns navigation rules speciﬁed by human beings through neighboring ships. The application of case-based RL enables the participation of humans in the RL learning process through neighboring ships and the learning of human-deﬁned rules. Cases built as curriculums can achieve high learning effects with fewer data along with the RL of layered autonomous ships. The experiment aims at autonomous navigation from a harbor, where marine trafﬁc occurs on a neighboring coast. The learning results using ILFAS and those in an environment where random marine trafﬁc occurs are compared. Based on the experiment, the learning time was reduced by a tenth. Moreover, the success rate of arrival at a destination was higher with fewer controls than the random method in the new marine trafﬁc scenario. ILFAS can continuously respond to advances in ship manufacturing technology and changes in the marine environment.


Background
Artificial intelligence for autonomous ships (unmanned surface vehicles (USVs) and autonomous surface vehicles (ASVs)) has been extensively investigated in the private sector (including bathymetric measurements, subsea pipelines management, marine geography surveys, and safety management) and the military sector (including patrol, security, intrusion detection, blockage, and defense). Artificial intelligence has proven its benefits in solving control issues related to the operation of ships under an unstable natural environment, reducing risk, and enhancing efficiency under collaborative or competitive conditions with other ships in the real ocean [1][2][3][4][5][6].
Operating autonomous ships is more difficult than self-driving cars because it is difficult to obtain explicit decision-making data from the environment. Self-driving cars learn the driving rules from definite environmental data, such as lanes and traffic lights. Dynamic obstacles around self-driving cars are considered when making decisions regarding the actions to be taken. However, in the case of an autonomous ship, it is difficult to obtain data regarding rules from the environment, and the rules must therefore be learned from

Contributions
The training framework proposed in this study is different from those proposed in previous studies.
(1) The learning environment is intellectualized by applying case-based RL, which can include human-defined navigation rules for environments. This intellectual learning environment can solve the problem of insufficient environmental data in the marine environment using neighboring intellectual ships with rules and has expandability applicable to various ranges of simulators.
(2) The problems that the autonomous ship needs to solve are clarified, and the learning time can be saved because the global and local curricula are presented depending on the degree of difficulty. The learning results are enhanced with fewer data owing to self-play curriculum learning. Self-play curriculum learning facilitates a quick response to changes in marine traffic policies in a harbor and changes in the marine environment.
(3) The decision-making spaces of autonomous ships are simplified and clarified through hierarchical classification. The target value, state data, assessment, and compensation standards are automatically established for the supervised interlayer learning structure.
The stratified RL algorithm and intellectual learning framework can consistently cope with the changes in the marine space, where the RL algorithm is examined and operated. The change in ship control approaches through the development of shipbuilding technologies.

Learning Method Based on Curriculum
The approach of implementing data in real simulations such as game environments for solving outstanding problems has been widely adopted in studies on RL algorithms [16][17][18]. However, this approach is employed in studies on RL because its environment has to be implemented by experts to enable the learning of certain rules. Therefore, it requires a large amount of quality learning data and an extensive amount of cost, time, and labor [29][30][31]. Because it is difficult to provide a perfect environment, such as games, some have used a random scenario environment as the general learning method of the RL algorithm [32,33]. A random scenario means a scenario is selected randomly among data that are obtained from the real world. However, a weak inductive bias is used for the stability of learning when autonomous ships significantly depend on dynamic obstacles. When environmental data change without considering the movement of autonomous ships, learning is not properly performed. To obtain the desired learning results in the learning process of the RL algorithm, the problems are classified based on the degree of difficulty. Environmental data gradually become complicated and are consistently provided. Accordingly, an autonomous ship can solve difficult issues in reality [19,34]. Considering the advancement of RL algorithms, it is necessary to solve more complicated issues, and curriculum learning is becoming a general learning approach.
Regarding the learning approaches using the curriculum learning environment, the task-specific, self-play, and teach-student approaches can be implemented without any modification of the RL algorithm [20]. The task-specific approach is the most common curriculum learning that configures the curriculum with problems per the degree of difficulty. It is the most common curriculum learning that adjusts the degree of difficulty depending on the learning and trains the RL algorithm. A task-specific approach was proposed to solve the problems that autonomous ship learning could not properly adapt to in a new environment owing to the overfitting of the RL algorithm in a fixed environment. This approach achieves good performance with fewer data by controlling the degree of difficulty of the problems [35,36]. Global space learning is performed using the curriculum at a low degree of difficulty. Hence, autonomous ships go to a destination without dynamic obstacles in the learning environment. Overfitting, which may occur in the learning process, is prevented by providing a curriculum with busy marine traffic based on the degree of difficulty. Moreover, the curriculum classified into local events can solve the catastrophic forgetting issue, losing previous learning experiences.
Neighboring ships are provided in real-time during the learning of the autonomous ship and operated in relation to it. Self-play learning in curriculum learning approaches makes agents compete with one another in the same environment to learn from one another with different goals and compensations. Although self-play learning uses the same RL algorithm, it induces competitive learning because different agents detect and learn the strategies of other agents. Continuous learning through the self-play approach allows agents to fully learn environments with rich experiences through actions acquired from interactions with other agents. Furthermore, the actions of agents can be developed into human-related skills, such as intrinsic motivation, through interactive learning among agents [20,30,37]. However, the competitive environment in self-play learning needs to include strategies to facilitate learning in the curriculum. Strategies should include humandefined navigation rules. To maintain a rich learning environment and include humandefined navigation rules, an approach to engage in rules is required. Case-based learning or separate scripts have been used in some cases to define experts' knowledge related to the rules. Moreover, accumulated experience has been extracted from the demonstration by experts and applied to the planning and implementation of games [38][39][40]. When case-based learning is applied to the RL algorithm of neighboring ships in the ocean, the self-play curriculum learning method can be applied simultaneously. Neighboring ships show intelligent navigation because only the cases with minimum goals and actions are presented to enhance their autonomy. The environment includes human-defined navigation rules that autonomous ships should learn.
The RL algorithm of the autonomous ship stratified in the learning environment can consider autonomous and supervised learning among layers using the teach-student approach, which is a curriculum learning approach. The teach-student approach automatically presents the difficulties of problems depending on the learning by the RL algorithm. It adjusts the degrees of difficulty based on the loss rate of the RL algorithm and assesses the divergence of the learned weight. As described above, the teach-student approach induces transfer learning based on previous experience and degrees of difficulties. Furthermore, it engages in the RL algorithm for learning through bootstrapping [20,41]. The stratified algorithm transfers the standard target values between layers, assesses through external compensation depending on the degree of achievement, and induces learning. In this approach, a layer can automatically learn from the learned layer through supervised learning.
This study configured the learning environment by expressing ships, including navigation rules, as cases, and implementing RL in real-time. The learning environment comprises curriculum learning, including the self-play approach, to induce transfer learning by gradually solving the problems from a lower degree of difficulty to a higher degree of difficulty, thereby fully learning by the autonomous ship itself.

RL for a Hierarchical Autonomous Ship Task
Autonomous ships performing missions in the ocean instead of human beings have been investigated in various fields. The control and route search of a ship began with the navigation of a ship to a place where human beings wanted to go by controlling the direction and power based on the natural environment, including wind, waves, and tides. Ship navigation was approached using the physics and control algorithm based on a mathematical algorithm [42][43][44]. The A* algorithm was also adopted to plan the route to the waypoint, given destination in the given space, clearance of noise collected through sensors, interference and slipping of a ship in the ever-changing natural environment, and real-time adaptive ship control using Gaussian process regression [45][46][47]. RL has emerged as an important technology by replacing complicated mathematical algorithms and solving problems. It has been demonstrated to have high potential for applications in ship movement control, posture control against the natural environment, collision avoidance, and path finding [8,9].
Studies on autonomous ships using the RL algorithm in the control field were applied to obstacle avoidance, decision on access to targets, speed control, and posture control through value-based Q-learning [48]. The algorithm was also investigated in the ship control field by applying a policy-based deep deterministic policy gradient (DDPG) [49]. Moreover, studies have been conducted on learning the posture control model of a ship in real-time by configuring the given route based on the DDPG and tracking the configured route [21]. The studies above dealt with the movement issue of a ship in the local space to consider the global space or surrounding conditions depending on the necessity, focusing on the ship's posture. Accordingly, the local space in the ship control field can be defined by the natural environment, which affects the ship control and present state of a ship. Therefore, the supervision space and obstacle avoidance that must consider the given global space can be separated from the local space.
Furthermore, studies on autonomous ships using the RL algorithm in the navigation field were applied to Q-learning-based smart ships, path finding, and ship control. The environments (including the distance to a destination and punishment on obstacles) and restricted areas were implemented based on the Nomoto model in a narrow waterway. Accordingly, autonomous ships learned human-defined rules in those environments [25]. The RL algorithm was also applied to decision-making on autonomous navigation under partial observation considering a real environment [50]. The multi-agent algorithm was applied to maintain the formation during navigation and path finding, in which several autonomous ships sailed [51]. Studies on the navigation section aimed at the movement of a ship to a destination in a given marine space. Accordingly, they focused on the acquisition of state data of a space through sensors, route planning considering other ships, and obstacle avoidance. Similar to the studies on the control field, studies on the navigation sector also included the control field depending on the circumstances.
The curse of the RL algorithm dimensionality can be cleared by separating complicated issues to be solved by autonomous ships, as described above. Through stratification, the knowledge domain per problem can be separated, and the Markov decision process (MDP) of the hierarchical structure can be configured [52]. The most significant merits acquired from the stratification are to save the learning time for actions and control of the autonomous ship and to demonstrate definite learning results [53]. The hierarchical structure enables supervised/non-supervised interlayer learning and has the advantage of learning speed and generalization by simplifying the algorithm [54]. The hierarchical structure enables supervised/non-supervised interlayer learning and has the advantage of learning speed and generalization by simplifying the algorithm [55]. Moreover, it induces cooperation among separated layers. The algorithm in the higher layer uses the discrete sequence of sub-goals in the state space in the lower dimension as the learning data to achieve the main goals of the algorithm in the higher layer. The algorithm in the lower layer can solve complicated control issues through the cooperative hierarchical structure by learning the local route in the state space in the original higher layer to achieve the goals designated at the higher level [56]. The hierarchical approach was adopted as the lifelong learning framework, which separated diverse units of skills into lower-level issues that were selected and applied in accordance with the necessity. It has been proposed as a framework applying RL to solve various issues that could be faced in real environments [57].
This study divided and interconnected the issues related to autonomous ships into autonomous missions, navigation, and control. It applied autonomous navigation to the experiment to learn human-defined navigation rules.

Intelligent Learning Framework for Autonomous Ships
This study proposes an approach to create an intelligent environment for autonomous ships to achieve human-level learning through an RL algorithm. Autonomically reacting against obstacles and expressing human-defined navigation rules are provided for neighboring ships, accounting for the majority of learning decision-making of autonomous ships by applying case-based RL to make them include human knowledge. The experiment defined the human experience in operating ships in the ocean as the case and adopted a curriculum learning approach that could adjust the degrees of difficulties based on complicated marine traffic conditions. Considering these learning environments, the autonomous ship learned various issues occurring in the ocean through self-play and neighboring ships based on human-defined navigation rules. The RL algorithm of autonomous ships hierarchically separated per decision field learns through the interaction among the mission, navigation, control fields, and environment. Moreover, the upper layer allowed the lower layer to learn through the interlayer teach-student curriculum learning approach.
This study adopts the multi-agent posthumous credit assignment based on counterfactual multi-agent policy gradients (COMA) as the RL algorithm applied to an autonomous ship [58]. Autonomous ships and neighbor ships are operated by the same RL algorithm through ILFAS's Self-Play curriculum learning. Recently, research is considering applying multi-agent RL to autonomous ships. This is to perform missions stably through one or more autonomous ships and to perform complex missions through the cooperation of autonomous ships. The ILFAS prepared enables multi-agent-based autonomous ships to be trained according to this research flow [59,60]. Multi-Agent RL is based on Actor-Critic, and when applied to a single autonomous ship, it behaves the same as a general single-agent. Because autonomous control is limited to the issues of a single ship, it is explained based on proximal policy optimization, whereas the RL algorithm is used for a single agent [61]. The same RL algorithm used for an autonomous ship was also adopted to express diverse vessels operating in the environment. This paper explains the algorithm based on COMA and compares and evaluates the learning method for neighboring ships considering the performance of ILFAS.

Architecture
The learning framework (ILFAS) proposed in this study comprises three types: hierarchical RL frame to be applied to autonomous ships, case-based curriculum system-induced learning of autonomous ships, and environment used in the learning. The physical environment was created in three dimensions (3D) with a fixed height for the LiDAR simulation using unity. Moreover, the basic physics model for the acceleration of ship weight and surface sliding in the ocean was adopted. The elements are explained from left to right in Figure 1 to facilitate the understanding of the learning frame.

Architecture
The learning framework (ILFAS) proposed in this study comprises three types: hierarchical RL frame to be applied to autonomous ships, case-based curriculum systeminduced learning of autonomous ships, and environment used in the learning. The physical environment was created in three dimensions (3D) with a fixed height for the LiDAR simulation using unity. Moreover, the basic physics model for the acceleration of ship weight and surface sliding in the ocean was adopted.
The elements are explained from left to right in Figure 1 to facilitate the understanding of the learning frame. The hierarchical learning structure comprises the autonomous mission ( ), which reduces the learning complexity of the autonomous ship to learn, autonomous sailing ( ), and autonomous control ( ) to make decisions and operate various vessels. Considering in Figure 1, the autonomous ship was learned through transfer learning using a case-based curriculum system, and was learned through supervised learning based on . After Sailing was learned, ( , , , ) at which the autonomous ship should move to Control could be presented. Therefore, the expected action value a (the direction and speed) by Sailing could also be predicted. Sailing's action value and current state are the Target and State values of Control, which were provided along with environmental data, such as wind. Through Control's action in the learning process, the autonomous ship operated in the environment. The operation result of the autonomous ship was obtained as of Sailing, and Control learns by returning the difference from the expected action value to Control as reward .
is environment data through and environment plug-in learning become the state and target value of . The state was transferred to , the autonomous control, considering learning on , and the actions from are evaluated and learned by the autonomous ship.
is a general RL field related to space planning. For example, the marine space was divided based on the supervision scope of autonomous ships for ocean surveillance. The division of the marine space and movement location in sequence was determined for collecting marine data. A destination was delivered to through the spatial data input from the environment, and the learning results of were evaluated based on the successful arrival to the destination. The hierarchical learning structure comprises the autonomous mission (Mission), which reduces the learning complexity of the autonomous ship to learn, autonomous sailing (Sailing), and autonomous control (Control) to make decisions and operate various vessels. Considering Sailing in Figure 1, the autonomous ship was learned through transfer learning using a case-based curriculum system, and Control was learned through supervised learning based on Sailing. After Sailing was learned, Sailing(a, s t , s t+1 , r) at which the autonomous ship should move to Control could be presented. Therefore, the expected action value a (the direction and speed) by Sailing could also be predicted. Sailing's action value a and current state s t are the Target and State values of Control, which were provided along with environmental data, such as wind. Through Control's action in the learning process, the autonomous ship operated in the environment. The operation result of the autonomous ship was obtained as s t+1 of Sailing, and Control learns by returning the difference from the expected action value to Control as reward r. s t is environment data through Sailing and environment plug-in learning become the state and target value of Control. The state was transferred to Control, the autonomous control, considering learning on Sailing, and the actions from Control are evaluated and learned by the autonomous ship. Mission is a general RL field related to space planning. For example, the marine space was divided based on the supervision scope of autonomous ships for ocean surveillance. The division of the marine space and movement location in sequence was determined for collecting marine data. A destination was delivered to Sailing through the spatial data input from the environment, and the learning results of Sailing were evaluated based on the successful arrival to the destination.
A case-based curriculum system configured the curriculum with the degrees of difficulty and space by defining cases of neighboring ships to be operated in the ocean. The curriculum was classified into the global curriculum that learns the entire given space and the local curriculum based on the sequence of time and space. The curriculum was provided in the environment in real-time. The cases stated in the curriculum were provided based on the decision-making order of neighboring ships. Moreover, the RL algorithm implementing the cases already learned the movement between waypoints.

Hierarchical RL Frame
This study hierarchically configured the RL of an autonomous ship to enable its operation in reality. The correlation between separate layers was explained using Sailing and Control as examples. RL based on a multi-agent algorithm was applied to Sailing for learning autonomous navigation, including path finding, collision avoidance, and human-defined navigation rules considering the operation of other ships. Control learned the actions (a s ) to control the ship based on the environment transferred from Sailing. This process is consistent with the control methods and types of ships. The RL algorithm can be applied to existing ship and posture controls.
The state input and action output of Sailing are explained.
observed in the autonomous ship for the decision-making of Sailing is defined in Formula (1). Sel f is the ship state; Env is transferred from the environment, and it includes the data from the natural environment simulator. The absolute coordinates and relative directions of destination G were calculated. The LiDAR of O is the distance between the static obstacles, such as an embankment. The data were acquired through the shortrange obstacle detection in the environment and dynamic obstacles (the neighboring ships operating around).
The action (a s ) was discretely declared to learn Sailing. The maximum scope that can be changed at a time using Formula (2) limits the scope of one output action, such as (10 knot, 5 degree). The scope to be changed at a time varies based on the vessel type. The scope is the criteria for evaluation and compensation of the actions of Control.
O Obstacle by Lidar , the single-layer LiDAR inputs the obstacle detection data around the autonomous ships, and other sensors can be added when required. The relative location value of the autonomous ship was calculated using the obstacle and destination data, such as the real environment. The network of autonomous ship layers for path finding is defined, as shown in Formula (3). Considering the reward on Sailing, when the autonomous ship successfully arrives at the destination provided in the mission, the goal and collision are one and 0.001, respectively. The depreciation rate (γ) is 0.0001. The state value is received from the environment. The Sailing network is shown in Formula (3): After learning, the Sailing layer is ready to perform supervised learning for Control. The observation value of Control is the state value of Sailing and is defined, as shown in Formula (4). Sel f ( Others) is the present state value, including the gradient of a ship. Env (Option) is selectively applied based on the natural environment simulation, such as a plug-in. This value is the vector value of the wind direction and tide. Its relative value to an autonomous ship is calculated as shown in Formula (4): Env, the natural environment, was provided through Sailing, and the network is shown in Formula (5). The action (a s ) determined in Sailing is defined as the target value task terget = ±10 knot, ±5 degree of Control. If a s = (Keep), then the value zero ("0") is transferred. The target value of Control is given as Goal a s , which is the action to be taken by Sailing. The action (a c ) of Control, the autonomous control, is retransferred to Sailing, was implemented and received Sailing(s t+1 ) from the environment to calculate the compensation reward as expressed below: Network π c , Parameter θ c , Goal a s , Reward r (s t a s ) − (s t+1 a c ) 2 , Action a c (5) The reward of Control is Reward r in Formula (5) and was calculated as the difference between the action to take in the given Sailing state and the state after applying the action from Control. The action of the value expected by Sailing was taken considering the natural environment in Control, and learning was performed in the Control layer to maximize the reward. Accordingly, the tilt and rollover state of a ship based on the control method of a ship and the physical features of a hull was acquired from the environment in Control.
Contrary to the hierarchical structure connected to learning in previous studies, Sailing and Control separately generated actions and had independent learning structures. The actions (a s ) of Sailing were applied to the environment without any modifications, and learning was preceded. When navigation to a destination went on smoothly after the learning completion of Sailing, the teach-student curriculum method implementing the learning of Control was applied.

Case-Based Curriculum System
The operation case of neighboring ships in the simulation comprises the following: An event defined in this case is the detailed information to be implemented toward the final goal. The case consists of the waypoint coordinates based on the time schedule and states the detailed actions toward a destination or after movement, as described in Formula (7): Action is defined as [Action Script] or [Parameter] and configured as [Stay, Cycle, etc.] and [Speed, Random, etc.]. Speed and Random were applied to ship control on the way to Goal waypoint . Random noise was added to the action to propose the speed or unstable track. It considers a ship that is difficult to sail straight, such as a yacht. Stays and cycles are executed at Goal waypoint . A ship stays on the ocean at a destination for a designated time or sails in a circle within 10 m of a destination. Action can be added depending on the purpose of the training. RL with neighboring ships transfers the ship control authority to action in the action script per event or uses it as the reference value for navigation. Single or multiple cases can be applied simultaneously, and the number of ships to arrange was designated. Therefore, the condition Allocated Ship Num ≥ Use Case Num needs to be satisfied. Cases were defined by a human operator considering the number of ships, coordinates, and actions to be executed. Each data point was saved in a case-based system database. Cases included human-defined navigation rules and could record and define the ship operation state in harbors and coasts.
This study classifies the spaces for implementing the curriculum based on defined cases, as shown below.
Local Curriculum = [Rule skill set f or Local space] Definition (8) includes path finding and fixed obstacle avoidance as the global content for the autonomous ship to learn. The experiment selected the harbor with busy marine traffic, and the curriculum started with learning to verify the route from the anchorage harbor to coastal waters. Considering this learning process, random waypoints in coastal waters were provided for departure. The autonomous ship learned considering the harbors with fixed obstacles and other obstacles in the coastal waters for departure and arrival.
Definition (9) is the intensive learning content in the local space based on the navigation time schedule for autonomous ships and is provided sequentially. The autonomous ship stands by for the port entry of neighboring ships or remains right in the waterway depending on the complexity of the marine traffic environment configured with neighboring ships in the harbor. It comprises the movement to a final destination to avoid moving fishing boats in coastal waters (Angler's boats frequently move around, marine sports zones are small, and fast boats move around). The inward voyage was implemented in reverse order.
The curriculum learning was implemented from obstacles with a low degree of difficulty in finding a destination. This process aims to avoid static obstacles in the environment without dynamic obstacles to those with a higher degree of difficulty consisting of a busy marine traffic environment because the number and actions of neighboring ships are increased. Because the degree of difficulty got higher, the teach-student method was additionally implemented after the completion of Sailing learning for Control of the autonomous ship. The degree of difficulty in the controlling of Control increased as the environment became more complicated.
The RL algorithm applied to the cases was implemented in a given environment with waypoints and destinations transferred from the case-based curriculum system. The busy marine traffic environment was created as the neighboring ships operated by the multi-agent RL algorithm to avoid collisions. Various types of cases were implemented simultaneously, as demonstrated below: The RL algorithm implementing cases is almost the same as the algorithm applied to an autonomous ship. The observation value (o case ∈ O) is shown in Formula (10). The subsequent waypoint was received through the case-based curriculum system sequentially for G. On arrival at the final G, the case ends. The action (a c ) is shown in Formula (11): Although the same RL algorithm was applied, the agents could learn strategies through the self-reply curriculum learning with different goals and rewards and learn human-defined navigation rules through the learning process. When the diversity and number of learned autonomous ships increased, the RL algorithm for neighboring ships applied to the case-based curriculum could be used for additional learning. Self-play curriculum learning demonstrates good learning results through relatively continuous learning.

ILFAS Training
Although the types of RL algorithms have been continuously developed, they should prevent local optimization and maximization of rewards. Other researchers frequently use separately designed reward functions to achieve the above goals quickly. However, this study aims to train an RL algorithm designed to provide generally delayed and immediate rewards.
The subsequent algorithm is the policy-based multi-agent RL COMA algorithm. The following formula demonstrates the policy gradient with the Q function as the advantage using the actor-critic. The actor updates policies and the critic evaluates them in this configuration. Policies are updated simultaneously, and rewards are maximized as expressed below: The RL applied to multiple ships updated logπ θ (a t |s t ), the actor part taking log using Formula (9), and critic A π θ (s t , a t ) comprising the Q function. Policies and rewards were each updated as critics in Formula (12), as shown in Formula (13), and approximated to The actor and critic performed transfer learning by learning the environment changed based on Formulas (14) and (15).
This study configured the curriculum, and Definitions (8) and (9)  The local curriculum in learning solves more difficult issues, although the global curriculum environment continuously exists. Regarding the RL of autonomous ships interacting with dynamic obstacles, path planning for autonomous navigation is learned in the global curriculum, and the navigation rules by collision avoidance are learned in the local curriculum. When Sailing learning is completed, as shown in Figures 2 and 3, Control learning starts. The following example illustrates a curriculum configured for autonomous ships.
The RL applied to multiple ships updated ( | ), the actor part taking log using Formula (9), and critic ( , ) comprising the Q function. Policies and rewards were each updated as critics in Formula (12), as shown in Formula (13), and approximated to ( ) ≈ ( ): w ← w + α ∇ V The actor and critic performed transfer learning by learning the environment changed based on Formulas (14) and (15).
This study configured the curriculum, and Definitions (8) and (9)  The local curriculum in learning solves more difficult issues, although the global curriculum environment continuously exists. Regarding the RL of autonomous ships interacting with dynamic obstacles, path planning for autonomous navigation is learned in the global curriculum, and the navigation rules by collision avoidance are learned in the local curriculum. When learning is completed, as shown in Figure 2 and Figure 3, learning starts. The following example illustrates a curriculum configured for autonomous ships.   Transfer learning is based on baseline training, including pathfinding without obstacles. After learning was completed, learning was adjusted through ( ), as shown in Formula (16). Adjustment applies the probability rate ( ).
After the policies were adjusted through transfer learning, as shown in Formula (16), the reward trajectory ( ( ) was generated by with , which consists of { , , , , , , , , … , } . After global learning, a new environment with time and space was presented to the autonomous ship through the local curriculum. Therefore, ( ( )) was scattered and diffused when learning fails. To prevent the previous learning experience from being lost (catastrophic forgetting), the previous cases learned should be provided together as the degree of difficulty increases. Considering the simple experiment below, we examined the implementation of transfer learning in an autonomous ship.
The experiment above explains the adjustment without catastrophic forgetting through curriculum learning and how to save learning time. The experiment proposes a new curriculum that makes the autonomous ship (black) learn basic pathfinding, and obstacle avoidance in a given space (harbor map) avoids the new ship (A) when encountering. Considering the global environment given in the global curriculum, the graph in Figure 4b shows the values before (blue) and after (red) learning. Transfer learning is based on baseline training, including pathfinding without obstacles. After learning was completed, learning was adjusted through (Policy Iteration), as shown in Formula (16). Adjustment applies the probability rate r(θ).
After the policies were adjusted through transfer learning, as shown in Formula (16), the reward trajectory Trajectory(R(τ)) was generated by Policy π θ with Policy parameter θ, which consists of {S 1 , A 1 , R 2 , S 2 , A 2 , R 3 , S 3 , A 3 , . . . , S n }. After global learning, a new environment with time and space was presented to the autonomous ship through the local curriculum. Therefore, Trajectory(R(τ)) was scattered and diffused when learning fails. To prevent the previous learning experience from being lost (catastrophic forgetting), the previous cases learned should be provided together as the degree of difficulty increases.
Considering the simple experiment below, we examined the implementation of transfer learning in an autonomous ship.
The experiment above explains the adjustment without catastrophic forgetting through curriculum learning and how to save learning time. The experiment proposes a new curriculum that makes the autonomous ship (black) learn basic pathfinding, and obstacle avoidance in a given space (harbor map) avoids the new ship (A) when encountering. Considering the global environment given in the global curriculum, the graph in Figure 4b shows the values before (blue) and after (red) learning.
The values are the same as the number of control signal commands from the autonomous ship and the number of steps to a destination and reward values. Autonomous navigation to a destination can be performed based on previous learning experiences on obstacle avoidance even before additional learning. However, the present location includes non-smooth reward values and unnecessary controls in the new environment. The autonomous ship can obtain more rewards with fewer commands through transfer learning, avoiding other vessels encountered through an additional learning of 2000 steps.  The values are the same as the number of control signal commands from the autonomous ship and the number of steps to a destination and reward values Autonomous navigation to a destination can be performed based on previous learning experiences on obstacle avoidance even before additional learning. However, the presen location includes non-smooth reward values and unnecessary controls in the new environment. The autonomous ship can obtain more rewards with fewer commands through transfer learning, avoiding other vessels encountered through an additiona learning of 2000 steps.

Experiment
The computer simulation was experimented in a virtual environment built up with a real harbor in Korea. The experiment compared random autonomous ship learning in an environment to neighboring ships sailing around, using the general RL method and autonomous ship learning by the ILFAS for two hierarchical RL algorithms. The word "Random" means random scenarios that make much experience of RL algorithm through enough exploration and exploitation. In this experiment, we compared our results to random scenarios because they can address against any status. Moreover, the stability o ship control was compared based on the success rate of autonomous navigation to a destination and the actions performed during autonomous navigation in the new environment. Finally, the experience examined whether an autonomous ship could learn human-defined navigation rules.
The 3D environment was implemented by adding the height on which the LiDAR simulator could detect the structures in the harbor in unity as like Figure 5. A 3D ship model was used in the experiment. The ship size was enlarged by three times for the safety distance between ships and the data simplification of LiDAR. The penalty was immediately provided, and the scenario ended after the collision with obstacles, including neighboring ships.  To extract the critic value, (ε) is fixed, and the critic is activated before and after the learning. The additional training before (blue) and after (red) is compared. After additional training, the critic is higher with low actions.

Experiment
The computer simulation was experimented in a virtual environment built up with a real harbor in Korea. The experiment compared random autonomous ship learning in an environment to neighboring ships sailing around, using the general RL method and autonomous ship learning by the ILFAS for two hierarchical RL algorithms. The word "Random" means random scenarios that make much experience of RL algorithm through enough exploration and exploitation. In this experiment, we compared our results to random scenarios because they can address against any status. Moreover, the stability of ship control was compared based on the success rate of autonomous navigation to a destination and the actions performed during autonomous navigation in the new environment. Finally, the experience examined whether an autonomous ship could learn human-defined navigation rules.
The 3D environment was implemented by adding the height on which the LiDAR simulator could detect the structures in the harbor in unity as like Figure 5. A 3D ship model was used in the experiment. The ship size was enlarged by three times for the safety distance between ships and the data simplification of LiDAR. The penalty was immediately provided, and the scenario ended after the collision with obstacles, including neighboring ships. of the critic. To extract the critic value, (ε) is fixed, and the critic is activated before and after the learning. The additional training before (blue) and after (red) is compared. After additional training, the critic is higher with low actions.
The values are the same as the number of control signal commands from the autonomous ship and the number of steps to a destination and reward values. Autonomous navigation to a destination can be performed based on previous learning experiences on obstacle avoidance even before additional learning. However, the present location includes non-smooth reward values and unnecessary controls in the new environment. The autonomous ship can obtain more rewards with fewer commands through transfer learning, avoiding other vessels encountered through an additional learning of 2000 steps.

Experiment
The computer simulation was experimented in a virtual environment built up with a real harbor in Korea. The experiment compared random autonomous ship learning in an environment to neighboring ships sailing around, using the general RL method and autonomous ship learning by the ILFAS for two hierarchical RL algorithms. The word "Random" means random scenarios that make much experience of RL algorithm through enough exploration and exploitation. In this experiment, we compared our results to random scenarios because they can address against any status. Moreover, the stability of ship control was compared based on the success rate of autonomous navigation to a destination and the actions performed during autonomous navigation in the new environment. Finally, the experience examined whether an autonomous ship could learn human-defined navigation rules.
The 3D environment was implemented by adding the height on which the LiDAR simulator could detect the structures in the harbor in unity as like Figure 5. A 3D ship model was used in the experiment. The ship size was enlarged by three times for the safety distance between ships and the data simplification of LiDAR. The penalty was immediately provided, and the scenario ended after the collision with obstacles, including neighboring ships.

Training: Baseline Training(Path Finding)
Basic learning on the global space is implemented through the curriculum, setting a random destination in the coastal waters and departing from the harbor. The training result is shown in Figure 6.

Training: Baseline Training(Path Finding)
Basic learning on the global space is implemented through the curriculum, setting a random destination in the coastal waters and departing from the harbor. The training result is shown in Figure 6. The autonomous ship has additional learning on the avoidance of dynamic obstacles using different learning methods to compare the learned weight. General RL in an environment with random neighboring ships sailing around and RL training with two types of inward ship cases using ILFAS was compared. The learning time was measured until the autonomous ship arrived at a destination 100 consecutive times to avoid neighboring ships. The reward is high when the optimized navigation gets to the destination without collision. Episode Length is the number of actions and is equal to steps. A low value of Episode Length means that the goal has been arrived at with an optimized number of actions.

Training: Learning Avoidance of Dynamic Obstacles (ILFAS vs. Random)
Considering the random RL, four random inward neighboring ships were generated in international waters. The autonomous ship aimed to sail to a randomly generated destination to avoid the neighboring ships from using various routes. The autonomous ship started to converge at 25 million steps and was finally stabilized at 50 million steps. Learning was implemented until the autonomous ship arrived at a destination 100 consecutive times in 100 attempts. The average step length (H) per episode was 810 (no. of learning episodes = total number of learning steps/average number of scenario-ending steps). The training result is shown in Figure 7. The autonomous ship has additional learning on the avoidance of dynamic obstacles using different learning methods to compare the learned weight. General RL in an environment with random neighboring ships sailing around and RL training with two types of inward ship cases using ILFAS was compared. The learning time was measured until the autonomous ship arrived at a destination 100 consecutive times to avoid neighboring ships. The reward is high when the optimized navigation gets to the destination without collision. Episode Length is the number of actions and is equal to steps. A low value of Episode Length means that the goal has been arrived at with an optimized number of actions.

Training: Learning Avoidance of Dynamic Obstacles (ILFAS vs. Random)
Considering the random RL, four random inward neighboring ships were generated in international waters. The autonomous ship aimed to sail to a randomly generated destination to avoid the neighboring ships from using various routes. The autonomous ship started to converge at 25 million steps and was finally stabilized at 50 million steps. Learning was implemented until the autonomous ship arrived at a destination 100 consecutive times in 100 attempts. The average step length (H) per episode was 810 (no. of learning episodes = total number of learning steps/average number of scenario-ending steps). The training result is shown in Figure 7.
Adjustment of the ship size for the LiDAR simulation on the ship.

Training: Baseline Training(Path Finding)
Basic learning on the global space is implemented through the curriculum, setting a random destination in the coastal waters and departing from the harbor. The training result is shown in Figure 6. The autonomous ship has additional learning on the avoidance of dynamic obstacles using different learning methods to compare the learned weight. General RL in an environment with random neighboring ships sailing around and RL training with two types of inward ship cases using ILFAS was compared. The learning time was measured until the autonomous ship arrived at a destination 100 consecutive times to avoid neighboring ships. The reward is high when the optimized navigation gets to the destination without collision. Episode Length is the number of actions and is equal to steps. A low value of Episode Length means that the goal has been arrived at with an optimized number of actions.

Training: Learning Avoidance of Dynamic Obstacles (ILFAS vs. Random)
Considering the random RL, four random inward neighboring ships were generated in international waters. The autonomous ship aimed to sail to a randomly generated destination to avoid the neighboring ships from using various routes. The autonomous ship started to converge at 25 million steps and was finally stabilized at 50 million steps. Learning was implemented until the autonomous ship arrived at a destination 100 consecutive times in 100 attempts. The average step length (H) per episode was 810 (no. of learning episodes = total number of learning steps/average number of scenario-ending steps). The training result is shown in Figure 7. Regarding the ILFAS RL, the case-based curriculum system operates neighboring ships for human-defined general inward ships. The case is the inward ship curriculum for several ships that are generally observed in harbors. Although the neighboring ships avoid collisions when collisions are estimated in the simulation environment using the ILFAS, the episode ends when collision occurs. The autonomous ship started to converge at three million steps and was stabilized from five million steps. The experiment aimed to successfully arrive at a destination 100 times in 100 attempts. The average episode length was approximately 700. The training result is shown in Figure 8. Regarding the ILFAS RL, the case-based curriculum system operates neighboring ships for human-defined general inward ships. The case is the inward ship curriculum for several ships that are generally observed in harbors. Although the neighboring ships avoid collisions when collisions are estimated in the simulation environment using the ILFAS, the episode ends when collision occurs. The autonomous ship started to converge at three million steps and was stabilized from five million steps. The experiment aimed to successfully arrive at a destination 100 times in 100 attempts. The average episode length was approximately 700. The training result is shown in Figure 8. Learning results: Although random RL had various kinds of experiences through random inward neighboring ships, it required 10 times more learning time than the ILFAS until the autonomous ship arrived at a destination 100 times in 100 attempts. The average episode length (the ship control signal) was 100 steps more than that of ILFAS. V(s) in Figure 9 demonstrates that ( ( )) is stabilized as the autonomous navigation continues. Although the random RL showed new learning owing to catastrophic forgetting, RL by the ILFAS indicated that additional learning was performed based on the previous learning experience.
(a) (b)  Learning results: Although random RL had various kinds of experiences through random inward neighboring ships, it required 10 times more learning time than the ILFAS until the autonomous ship arrived at a destination 100 times in 100 attempts. The average episode length (the ship control signal) was 100 steps more than that of ILFAS. V(s) in Figure 9 demonstrates that Trajectory(R(τ)) is stabilized as the autonomous navigation continues. Although the random RL showed new learning owing to catastrophic forgetting, RL by the ILFAS indicated that additional learning was performed based on the previous learning experience. Regarding the ILFAS RL, the case-based curriculum system operates neighbor ships for human-defined general inward ships. The case is the inward ship curriculum several ships that are generally observed in harbors. Although the neighboring sh avoid collisions when collisions are estimated in the simulation environment using ILFAS, the episode ends when collision occurs. The autonomous ship started to conve at three million steps and was stabilized from five million steps. The experiment aimed successfully arrive at a destination 100 times in 100 attempts. The average episode len was approximately 700. The training result is shown in Figure 8. Learning results: Although random RL had various kinds of experiences throu random inward neighboring ships, it required 10 times more learning time than the ILF until the autonomous ship arrived at a destination 100 times in 100 attempts. The avera episode length (the ship control signal) was 100 steps more than that of ILFAS. V(s) Figure 9 demonstrates that ( ( )) is stabilized as the autonomous navigat continues. Although the random RL showed new learning owing to catastrop forgetting, RL by the ILFAS indicated that additional learning was performed based the previous learning experience.

Experiment: Simple Traffic 1(Marine Traffic Environment for Inward Ships)
After learning the avoidance of dynamic obstacles, two RL algorithms were implemented and compared in the new environment. The new environment had neighboring ships following the right path to comply with human-defined navigation rules in the harbor that were not randomly created. The experiment result is shown in Figure 10.

Experiment: Simple Traffic 1(Marine Traffic Environment for Inward Ships)
After learning the avoidance of dynamic obstacles, two RL algorithms were implemented and compared in the new environment. The new environment had neighboring ships following the right path to comply with human-defined navigation rules in the harbor that were not randomly created. The experiment result is shown in Figure 10. The learning results in Section 4.2 were applied to this experiment. Although the successful arrival rates to a destination were similar (90% in 100 episodes), the ILFAS showed a slightly high success rate. The ILFAS learned the intrinsic motivation that the right path was safe; therefore, it avoided neighboring ships encountered. Random RL did not know the navigation rules of neighboring ships entering the harbor using the left path; hence, it selected the right path as the avoidance action. However, both learning methods showed significant differences in controlling autonomous ships. As shown in Figure 11, the ILFAS is stable in controlling the ship; however, the control by the random RL fluctuates substantially.  The learning results in Section 4.2 were applied to this experiment. Although the successful arrival rates to a destination were similar (90% in 100 episodes), the ILFAS showed a slightly high success rate. The ILFAS learned the intrinsic motivation that the right path was safe; therefore, it avoided neighboring ships encountered. Random RL did not know the navigation rules of neighboring ships entering the harbor using the left path; hence, it selected the right path as the avoidance action. However, both learning methods showed significant differences in controlling autonomous ships. As shown in Figure 11, the ILFAS is stable in controlling the ship; however, the control by the random RL fluctuates substantially.
implemented and compared in the new environment. The new environment had neighboring ships following the right path to comply with human-defined navigation rules in the harbor that were not randomly created. The experiment result is shown in Figure 10. The learning results in Section 4.2 were applied to this experiment. Although the successful arrival rates to a destination were similar (90% in 100 episodes), the ILFAS showed a slightly high success rate. The ILFAS learned the intrinsic motivation that the right path was safe; therefore, it avoided neighboring ships encountered. Random RL did not know the navigation rules of neighboring ships entering the harbor using the left path; hence, it selected the right path as the avoidance action. However, both learning methods showed significant differences in controlling autonomous ships. As shown in Figure 11, the ILFAS is stable in controlling the ship; however, the control by the random RL fluctuates substantially.

Experiment: Simple Traffic 2(Marine Traffic Environment of Outward Ships)
The experiment compared the operation of an autonomous ship to several outward ships without additional learning. In contrast to the experiment in Section 4.3, four neighboring ships departed from the harbor, and two ships were artificially placed in front of those ships. There is a gap between neighboring ships #3 and #2, wide enough for one ship to enter. Two additional ships departed from the last ship. Moreover, the neighboring ships sped down to induce the autonomous ship to crash against other ships. To avoid collision with the neighboring ship in front, the autonomous ship sped down. This experiment aims to verify whether the autonomous ship can retain the learning experience of taking the right path as specified in the human-defined navigation rules, even without neighboring ships entering the harbor using the left path. The experiment result is shown in Figure 12.
The experiment compared the operation of an autonomous ship to several outward ships without additional learning. In contrast to the experiment in Section 4.3, four neighboring ships departed from the harbor, and two ships were artificially placed in front of those ships. There is a gap between neighboring ships #3 and #2, wide enough for one ship to enter. Two additional ships departed from the last ship. Moreover, the neighboring ships sped down to induce the autonomous ship to crash against other ships. To avoid collision with the neighboring ship in front, the autonomous ship sped down. This experiment aims to verify whether the autonomous ship can retain the learning experience of taking the right path as specified in the human-defined navigation rules, even without neighboring ships entering the harbor using the left path. The experiment result is shown in Figure 12. The learning results in Section 4.1 (the learning results using the ILFAS and random RL) were applied to two inward cases without additional learning. Neither the ILFAS nor random RL did not learn this environment before. This experiment aimed to verify whether the autonomous ship could adapt to the marine traffic environment with neighboring ships departing from the harbor simultaneously.
Although the ILFAS did not sufficiently learn how to respond to obstacles existing in front through learning with neighboring ships on the right path, its success rate to a destination was 10% higher than that of the random RL. The ILFAS complied with humandefined navigation rules, maintained the right path if possible, and did not surpass the neighboring ships in front. However, the random RL did not keep its position among the outward ships, stayed in the harbor, and departed later or departed using the empty space on the left. The graphs in Figure 13 compare the number of controls. The ILFAS shows a better stability of control than the random RL with sufficient experience. The learning results in Section 4.1 (the learning results using the ILFAS and random RL) were applied to two inward cases without additional learning. Neither the ILFAS nor random RL did not learn this environment before. This experiment aimed to verify whether the autonomous ship could adapt to the marine traffic environment with neighboring ships departing from the harbor simultaneously.
Although the ILFAS did not sufficiently learn how to respond to obstacles existing in front through learning with neighboring ships on the right path, its success rate to a destination was 10% higher than that of the random RL. The ILFAS complied with humandefined navigation rules, maintained the right path if possible, and did not surpass the neighboring ships in front. However, the random RL did not keep its position among the outward ships, stayed in the harbor, and departed later or departed using the empty space on the left. The graphs in Figure 13 compare the number of controls. The ILFAS shows a better stability of control than the random RL with sufficient experience. The busy harbor conditions were implemented by adding inward and outward ships to the environment in Section 4.3. Considering a successful departure, the autonomous Figure 13. Actions of the autonomous ship. The autonomous ship is controlled by [acceleration/deceleration] and [steering]. The random and ILFAS RL are shown in red and green, respectively. The graph, keeping direction without controlling the speed, shows the stability of the autonomous ship. The graph for the ILFAS also shows the decrease in the stability of control in the environment that is not learned.

Experiment: Complex Traffic(Complicated Marine Traffic Environment with In-Ward/Outward Ships)
The busy harbor conditions were implemented by adding inward and outward ships to the environment in Section 4.3. Considering a successful departure, the autonomous ship needs to properly control the speed between outward ships or depart from a harbor after entry into or departure from the harbor is completed. RL using ILFAS cannot learn how to wait. By contrast, the random RL did not learn the navigation rules through neighboring ships, and there was no space on the left path to get ahead of other ships in the busy marine traffic environment. The experiment result is shown in Figure 14. Figure 13. Actions of the autonomous ship. The autonomous ship is controlled by [acceleration/deceleration] and [steering]. The random and ILFAS RL are shown in red and green, respectively. The graph, keeping direction without controlling the speed, shows the stability of the autonomous ship. The graph for the ILFAS also shows the decrease in the stability of control in the environment that is not learned.

Experiment: Complex Traffic(Complicated Marine Traffic Environment with In-Ward/Outward Ships)
The busy harbor conditions were implemented by adding inward and outward ships to the environment in Section 4.3. Considering a successful departure, the autonomous ship needs to properly control the speed between outward ships or depart from a harbor after entry into or departure from the harbor is completed. RL using ILFAS cannot learn how to wait. By contrast, the random RL did not learn the navigation rules through neighboring ships, and there was no space on the left path to get ahead of other ships in the busy marine traffic environment. The experiment result is shown in Figure 14. Figure 14. (a) Experiment using eight neighboring ships in the initial state waiting for entry into or departure from the harbor. (b) Autonomous ship sailing right owing to ILFAS learning; however, the arrival rate to a destination decreased because of poor speed control. The autonomous ship departed to comply with the human-defined navigation rules. (c) Random RL waits for all of the outward ships or go ahead of inward or outward ships. The autonomous ship of the random RL rapidly departs after waiting in the harbor because collision is estimated when the harbor gets busy.
(d) The successful arrival rate to a destination in 100 episodes reduces to 53% for the ILFAS and 30% for the Random RL.
The ILFAS created busy marine traffic circumstances has a high degree of difficulty induced by the presence of inward and outward ships. The circumstances are the same for the curriculum with a high degree of difficulty and a total of eight neighboring ships because two cases are implemented simultaneously in the ILFAS. The autonomous ship could smoothly sail among outward ships while keeping it right as usual. In the Figure 14. (a) Experiment using eight neighboring ships in the initial state waiting for entry into or departure from the harbor. (b) Autonomous ship sailing right owing to ILFAS learning; however, the arrival rate to a destination decreased because of poor speed control. The autonomous ship departed to comply with the human-defined navigation rules. (c) Random RL waits for all of the outward ships or go ahead of inward or outward ships. The autonomous ship of the random RL rapidly departs after waiting in the harbor because collision is estimated when the harbor gets busy. (d) The successful arrival rate to a destination in 100 episodes reduces to 53% for the ILFAS(Gray) and 30% for the Random RL (Red).
The ILFAS created busy marine traffic circumstances has a high degree of difficulty induced by the presence of inward and outward ships. The circumstances are the same for the curriculum with a high degree of difficulty and a total of eight neighboring ships because two cases are implemented simultaneously in the ILFAS. The autonomous ship could smoothly sail among outward ships while keeping it right as usual. In the experiment, although the autonomous ship by the ILFAS RL proceeded and showed expected actions, 30% of the episodes crashed against the ship in front, failing to control the speed.
The autonomous ship using the random RL learned only the path finding in the shortest time as random learning. The successful arrival rate to a destination was low because there was no space to get ahead to avoid inward ships on the left and outward ships slowly moving on the right front as compared to the experiment in Section 4.3. The experiment indicated that it was difficult to induce an autonomous ship to learn human-defined navigation rules through unorganized random learning.

Learning and Experiment Results
Because RL algorithms cannot consider all situations, learning is sufficiently provided by adding noise to the environment created using random rules or a random environment in unmanned ships to avoid collision, such as in sensors and cameras [32,33]. However, it is difficult to solve the issues of autonomous ships using random learning in navigation if there are dynamic obstacles that occupy most of the area in the environment and certain rules. Furthermore, even when the rules to learn are included, it requires excessive time to learn certain rules using the random learning method.
This study demonstrates that when the autonomous ship learns autonomous navigation using the ILFAS in the space, unnecessary experience is eliminated and learning results are stabilized as compared to general random learning. Additional learning can be performed in a new environment based on previous experiences. Moreover, the learning time was reduced and definite learning results were acquired. Subsequently, the autonomous ship can learn human-defined navigation rules using the ILFAS. To generalize the learning methods and verify the learning results, the ILFAS demonstrated better results in the autonomous navigation field in the new environment and relatively stable results in the ship control field.
Additional learning related to the complex-traffic environment in Section 4.5 was performed. Based on the successful arrival rate to waypoints, learning was completed only after approximately 3400 episodes. Furthermore, because intelligent neighboring ships sailed using a multi-agent RL algorithm, a slight difference was always generated in the gaps between ships and actions. The autonomous ship could obtain sufficient learning data from such changes and sailed to the right among outward ships or standby ships in a complicated marine traffic environment. Even when the autonomous ship waited in the harbor, neighboring ships sailed as a reaction against the autonomous ship. Thus, the autonomous ship can learn how to properly wait in the harbor for departure.

Conclusions
This study has attempted to solve learning environment issues when applying RL to autonomous ships. If the environment is not intelligent, even with such a great expert and distinguished algorithm, and if the autonomous ship has to adapt itself to the real environment, then it cannot help facing the limits in learning. Particularly, if the neighboring ships are not intelligent, then there is a limit to the research on intelligent autonomous ships. Moreover, it requires too much time and cost to find an expert or data to consider all the issues and to make the autonomous ship learn data.
Therefore, this study aims to implement an intelligent environment that enables autonomous ships to acquire sufficient experience using RL in the general marine environment and to learn the events that humans cannot estimate by themselves. Furthermore, this study proposes the ILFAS to investigate whether an autonomous ship can learn from general issues to inherent human norms that cannot be numerically presented. This paper presents one of the solutions to solve the insufficient environmental problem of RL and transfer human knowledge and experience to autonomous ships. It can reduce RL learning time and lower the cost of building a learning environment. In conclusion, we built an intelligent learning framework that can obtain the learning results expected by humans in a short time and at low cost. The learned hierarchical RLs of stratified autonomous ships can be reused. When autonomous ships with the same control type are applied to another environment, only the navigation part is relearned in the mission, navigation, and control layers, and other layers can be reused.
In the field of defense, you can build an example of your opponent's naval infiltration strategy and tactics. The application of this environment to the civil sector could consider the delivery of emergency medical supplies. Similar to the experiment of this paper. However, we focused on learning the human-defined rule. Research should be conducted continuously to increase the effectiveness of autonomous ships in situations such as natural environments and bad weather on behalf of humans.