Scenario-Based Marine Oil Spill Emergency Response Using Hybrid Deep Reinforcement Learning and Case-Based Reasoning

: Case-based reasoning (CBR) systems often provide a basis for decision makers to make management decisions in disaster prevention and emergency response. For decades, many CBR systems have been implemented by using expert knowledge schemes to build indexes for case identiﬁcation from a case library of situations and to explore the relations among cases. However, a knowledge elicitation bottleneck occurs for many knowledge-based CBR applications because expert reasoning is di ﬃ cult to precisely explain. To solve these problems, this paper proposes a method using only knowledge to recognize marine oil spill cases. The proposed method combines deep reinforcement learning (DRL) with strategy selection to determine emergency responses for marine oil spill accidents by quantiﬁcation of the marine oil spill scenario as the reward for the DRL agent. These accidents are described by scenarios and are considered the state inputs in the hybrid DRL / CBR framework. The challenges and opportunities of the proposed method are discussed considering di ﬀ erent scenarios and the intentions of decision makers. This approach may be helpful in terms of developing hybrid DRL / CBR-based tools for marine oil spill emergency response.


Introduction
Oil spills have become one of the most severe marine ecological disasters worldwide. With oil imports exceeding 420 million tons in 2017, China surpassed the United States as the world's largest oil importer for the first time. As a large amount of oil is imported by sea transportation, oil spills frequently occur in China, threatening China's marine fishery, coastal environment and coastal cities; providing a rapid response following marine oil spill emergencies has received increasing attention. After an accident occurs, direct and effective methods can be used to quickly retrieve similar historical cases by using certain intelligent methods and then assisting decision makers in quickly formulating emergency response plans to cope with the current emergency based on historical experience. Case-based reasoning (CBR) systems compare a new problem to a library of cases and adapt a similar library case to the problem, thereby producing a preliminary solution [1]. Since CBR systems require only a library of cases with successful solutions, such systems are often used in areas lacking a strong theoretical domain model, such as diagnosis, classification, prediction, control and action planning. CBR has been applied to help improve cost-efficiency control during infrastructure asset management in developing countries by estimating costs through retrieving and comparing the most similar instances across a case library [2]. Additionally, farmers have been provided with advice about farming operation management at a high case retrieval speed based on the associated representation method [3].

1.
A hybrid method using deep reinforcement learning (DRL) and CBR is proposed to produce a preliminary solution for marine oil spill emergencies.

2.
To address the uncertainty of marine oil spill accidents, a preprocess of constructing a marine oil spill scenario tree is employed, and the scenario is also used to represent historical cases in our CBR system.

3.
Reward functions are considered based on different decision intentions to supporting decision making; this approach may be helpful for improving the level of oil spill emergency response.
The remainder of this paper is organized as follows. Section 2 presents a brief introduction to the fundamental theory of the proposed framework. Section 3 shows the experimental results to verify the effectiveness of the scenario-based hybrid DRL/CBR method. Finally, a brief discussion is given, and the study conclusions and proposed future work are discussed.

Materials and Methods
CBR is defined as the process of reusing experiences to deal with current situations that are similar to ones solved and stored previously [10], and the foundation of the CBR system is the representation Appl. Sci. 2020, 10, 5269 3 of 16 and definition of a case. We consider marine oil spill emergency response tasks in which a decision maker addresses marine oil spill accidents and makes decisions based on comparisons with historical data by using similarity measurements to identify a relevant past case. At each time step, the decision maker selects an emergency response action a from the set of legal marine oil spill emergency response actions Set A and receives feedback as a reward r t , which represents the result of the emergency response action at step t. Note that the emergency response result depends on the entire prior sequence of actions; feedback about an action can only be received after many time steps have elapsed. Therefore, we consider sequences of actions and observations, s t = x 1 , a 1 , x 2 , . . . , a t−1 , x t , and learn the actions that depend on these sequences, which represent the internal state of the marine oil spill observed by the decision maker. This state is a vector of values x representing the current status of the oil spill. All the sequences in the emulator are assumed to terminate after a finite number of time steps. This condition gives rise to a large but finite Markov decision process (MDP) [15,16] in which each sequence is a distinct state.
The framework of our approach to scenario-based hybrid CBR/DRL is shown in detail in Figure 1. Scenario analysis provides an approach for addressing unknown but related problems based on marine oil spill historical cases. The CBR [17,18] method provides retention, retrieval, reuse and revision of scenario analysis results, which is formalized as a four-step process [19]. Three of these steps are implemented with the DQN algorithm.
Appl. Sci. 2020, 10, x FOR PEER REVIEW  3 of 16 representation and definition of a case. We consider marine oil spill emergency response tasks in which a decision maker addresses marine oil spill accidents and makes decisions based on comparisons with historical data by using similarity measurements to identify a relevant past case. At each time step, the decision maker selects an emergency response action from the set of legal marine oil spill emergency response actions Set and receives feedback as a reward , which represents the result of the emergency response action at step . Note that the emergency response result depends on the entire prior sequence of actions; feedback about an action can only be received after many time steps have elapsed. Therefore, we consider sequences of actions and observations, = , , , … , , , and learn the actions that depend on these sequences, which represent the internal state of the marine oil spill observed by the decision maker. This state is a vector of values representing the current status of the oil spill. All the sequences in the emulator are assumed to terminate after a finite number of time steps. This condition gives rise to a large but finite Markov decision process (MDP) [15,16] in which each sequence is a distinct state. The framework of our approach to scenario-based hybrid CBR/DRL is shown in detail in Figure  1. Scenario analysis provides an approach for addressing unknown but related problems based on marine oil spill historical cases. The CBR [17,18] method provides retention, retrieval, reuse and revision of scenario analysis results, which is formalized as a four-step process [19]. Three of these steps are implemented with the DQN algorithm.

•
Retention. Scenario analysis is employed to address marine oil spill accident uncertainties, such as spill magnitude uncertainties and the uncertainties related to spill accident evolution. Each individual historical case can be represented as a detailed "chain of consequences", which is named the scenario chain in this paper. Through the cluster algorithm, similar scenario instances can be merged as a typical scenario, which consequently expands the scenario and forms a branch to construct scenario trees. Through scenario analysis, marine oil spill cases are stored as scenario instances and scenario trees in the scenario library.

•
Retrieval. When applying cases to train the proposed hybrid CBR/DRL model, the scenario library is considered as an environment for the agent to explore, and each marine oil spill scenario instance is regarded as a state of the environment. Thus, each instance is a vector composed of features representing the marine oil spill scenario.

•
Reuse. The agent chooses the action with the highest expected value using the -greedy strategy.
With the probability of the strategy, the algorithm chooses an action based on the available knowledge, and with the probability of 1 − , a random action is selected [20]. • Retention. Scenario analysis is employed to address marine oil spill accident uncertainties, such as spill magnitude uncertainties and the uncertainties related to spill accident evolution. Each individual historical case can be represented as a detailed "chain of consequences", which is named the scenario chain in this paper. Through the cluster algorithm, similar scenario instances can be merged as a typical scenario, which consequently expands the scenario and forms a branch to construct scenario trees. Through scenario analysis, marine oil spill cases are stored as scenario instances and scenario trees in the scenario library.

•
Retrieval. When applying cases to train the proposed hybrid CBR/DRL model, the scenario library is considered as an environment for the agent to explore, and each marine oil spill scenario instance is regarded as a state of the environment. Thus, each instance is a vector composed of features representing the marine oil spill scenario.

•
Reuse. The agent chooses the action with the highest expected value using the -greedy strategy.
With the probability of the strategy, the algorithm chooses an action based on the available knowledge, and with the probability of 1 − , a random action is selected [20].
• Revision. The revision phase uses the DQN to update to the utilities Q for actions a chosen by the agent. Eligibilities represent the cumulative contributions of individual state and action combinations in previous time steps.

Marine Oil Spill Scenario and Scenario Tree Construction Method
A marine oil spill historical case can be divided into multiple scenarios according to its evolution. Each marine oil spill scenario can be described from the following three aspects: hazard, exposure and human behavior [21]. Since human behavior can strongly affect the results of a disaster, for example, due to the effective implementation of preparedness actions such as evacuation and rescue procedures, it is considered as a controllable driver of the development branch of oil spills. The hazard is the time-space distribution of the intensity of a given marine oil spill accident with an assigned occurrence probability at a given time and in a given geographical area. The exposure is the distribution of the probability that a given element (including people, buildings, infrastructure, the economy, or the environment) is affected by a disaster. In this paper, an oil spill scenario can be represented by a set of scenario elements as S = {E 1 , E 2 , . . . , E n }, n ∈ N + , where E i is a scenario element instance that alternates in type between hazard and exposure. The scenario element instance set E i = (T 1 , T 2 , . . . , T m ), where m ∈ N + , is a vector composed of features, where T i represents the attributes of a scenario element instance; such attributes may include the tonnage of the oil tanker and the amount of spilled oil as shown in Table 1. In this case, the scenario instance can be represented as An emergency response scenario is not a typical case, and the core of this approach is to identify instances with similar characteristics. The similar scenario instances are merged into a typical scenario, and consequence scenario instances are linked to the typical scenario. Thus, the expanded branches express the uncertainties of the evolution of marine oil spill accidents. The k-means [22] algorithm is employed to find the similarity scenario instances to minimize the squared error since the marine oil spill is represented as a numeric scenario matrix (dimensions 9 × 13): where u i is the mean vector of cluster C i . A new scenario chain extracted from a marine oil spill case is first decomposed into scenarios based on the corresponding relationships. As the scenario chain increases in size, some similar scenarios can be merged, and as children scenarios are connected, the chain is extended to become a scenario tree. A new scenario is linked to the existing scenario tree node only if the distance to the closest cluster is larger than the threshold parameter τ. Thus, τ acts as a mechanism for controlling the density of the scenario instance. If a case cannot be linked to an existing scenario tree, the scenario chain is regarded as an independent initial scenario tree template and added to the scenario library. These branches generally form because of human behavior changes, thus providing significant and intuitive help for decision making.

Hybrid DRL/CBR Method for Marine Oil Spill Emergency Response
In this research, a marine oil spill emergency response is assumed as an MDP, and the policy is trained by the DQN algorithm using CBR. The CBR system provides an environment for reinforcement learning (RL) agent exploration. Many RL algorithms have been developed to learn approximations of an optimal action based on agent experience in a given environment. The return function is defined in the MDP as follows: where future rewards are discounted by a factor γ per time step t with a start state s 0 ∈ S. State S is a vector composed of features representing a marine oil spill scenario, where r is the reward for the current emergency response action. The DQN uses experience to learn value functions that map state-action pairs to the maximal expected reward that can be achieved for a given state-action pair. The optimal action value function Q * (s, a) is defined as the maximum expected return achievable by following any strategy after a state s is reached and an action a is taken: where π is a function that maps policies to emergency response actions; emergency response action a ∈ A, and A is a list of possible marine oil spill emergency response actions decision makers can take for the current spill scenario. Equation (5) shows that the optimal value function Q * (s, a) gives the maximum emergency response action value for spill scenario s and action a achievable by policy π: where P a s→s is the transition probability and R a s→s is the reward at state s translated to s . This equation is in agreement with the following intuition: the optimal strategy involves selecting the emergency response action a that maximizes the expected value, which is a γ-related cumulative reward function when the optimal value Q * (s , a ) of the sequence spill scenario s at the next time step is known for all possible emergency response actions a . The optimal action value function obeys an important identity Appl. Sci. 2020, 10, 5269 6 of 16 known as the Bellman optimization equation, which can also be used as an iterative updating formula with a learning rate parameter α: The Q-network is a neural network with a weight of θ as a function approximator to estimate the action value function Q(s, a; θ) ≈ Q * (s, a). A Q-network can be trained by minimizing a sequence of the loss function L(θ) that changes at each iteration t, where y t is the target for iteration t and ρ(.) is a probability distribution over a sequence of oil spill scenarios s and emergency response actions a.
In this paper, the model is trained with an actor-critic strategy [23]. The actor selects a behavior based on probability, and the critic estimates performance based on the actor. The critic is trained at every step, and the actor synchronizes with the parameters of the critic model after specific steps. The neural network of actors and critics consists of a nine-layer convolution neural network for the state function approximator. The input to the neural work is a vector of the oil spill scenario instance. After each step of the exploration, we calculate the Q values corresponding to the current state and action using (6), and (7) is applied to calculate the loss and update the critic model parameters from the previous iteration θ t−1 , which are fixed when optimizing the loss function L t (θ t ). The approximator input and output are shown in Figure 2.
In this paper, the model is trained with an actor-critic strategy [23]. The actor selects a behavior based on probability, and the critic estimates performance based on the actor. The critic is trained at every step, and the actor synchronizes with the parameters of the critic model after specific steps. The neural network of actors and critics consists of a nine-layer convolution neural network for the state function approximator. The input to the neural work is a vector of the oil spill scenario instance. After each step of the exploration, we calculate the Q values corresponding to the current state and action using (6), and (7) is applied to calculate the loss and update the critic model parameters from the previous iteration , which are fixed when optimizing the loss function ( ). The approximator input and output are shown in Figure 2. The three components based on the hybrid DQN and CBR method for the oil spill emergency response model are detailed as follows: • State. A marine oil spill scenario instance can be regarded as a state, which is a vector composed of features representing marine oil spill accidents that have been stored in the CBR system. The scenario instance and typical scenario are represented according to Equation (1).

•
Reward. An interaction occurs between the marine oil spill scenario observed and the step-bystep process of decision making in a discrete time series. If the emergency response action makes the next scenario safer, the reward of the step is close to 1, and other actions yield reward values close to 0. To reflect the severity of a marine oil spill accident, Dutch scholar W. Koops proposed a DLSA evaluation model for oil spills that used nine individual indicators to analyze oil spill pollution [24]. In the DLSA model, the indicator weights are given by expert knowledge. Human experts, whose time is valuable and scarce, often find it difficult to precisely explain their reasoning. In 1948, the problem of information quantification was solved through the concept of information entropy, which was proposed by Shannon. Based on traditional information entropy, Chen et al. defined the concept of unconventional emergency scenario-response multidimensional entropy [25]. In combination with information theory, we believe that lowprobability events that occur during oil spill accidents are important to consider due to our insufficient understanding of these events and the unpredictability of the corresponding risk. In contrast, for accidents with high probability, due to the relatively sufficient knowledge of the corresponding events, response actions can be taken based on the known threat of the accident.
In this paper, we consider the quantity of spilled oil, vessel characteristic, sea area, and sea conditions as factors that influence the severity of marine oil spill accidents. In addition, The three components based on the hybrid DQN and CBR method for the oil spill emergency response model are detailed as follows: • State. A marine oil spill scenario instance can be regarded as a state, which is a vector composed of features representing marine oil spill accidents that have been stored in the CBR system. The scenario instance and typical scenario are represented according to Equation (1).

•
Reward. An interaction occurs between the marine oil spill scenario observed and the step-by-step process of decision making in a discrete time series. If the emergency response action makes the next scenario safer, the reward of the step is close to 1, and other actions yield reward values close to 0. To reflect the severity of a marine oil spill accident, Dutch scholar W. Koops proposed a DLSA evaluation model for oil spills that used nine individual indicators to analyze oil spill pollution [24]. In the DLSA model, the indicator weights are given by expert knowledge. Human experts, whose time is valuable and scarce, often find it difficult to precisely explain their reasoning. In 1948, the problem of information quantification was solved through the concept of information entropy, which was proposed by Shannon. Based on traditional information entropy, Chen et al. defined the concept of unconventional emergency scenario-response multidimensional entropy [25].
In combination with information theory, we believe that low-probability events that occur during oil spill accidents are important to consider due to our insufficient understanding of these events and the unpredictability of the corresponding risk. In contrast, for accidents with high probability, due to the relatively sufficient knowledge of the corresponding events, response actions can be taken based on the known threat of the accident. In this paper, we consider the quantity of spilled oil, vessel characteristic, sea area, and sea conditions as factors that influence the severity of marine oil spill accidents. In addition, information entropy is employed to assist in measuring the severity of marine oil spill scenarios, instead of using expert knowledge. The eleven indicators considered can be matched among marine oil spill scenario instances. The indicator of scenario instance I obeys the distribution ρ. The term P(I) is the probability that the indicator has a value at I. Thus, the entropy of a marine oil spill scenario can be defined as where g is the mapping function from indicator I to the risk level; the details of this function are given in Appendix A. In this paper, we regard the severity of a marine oil scenario as a binary state that is safe or unsafe. Therefore, we use the sigmoid function [26] to calculate the severity of marine oil scenarios as the reward function, where R ∈ (0, 1), and the value of R is close to 1, which means that the evolution of marine oil spill accidents tends to become increasingly safe. When the value of R is close to 0, the evolution of an accident can gradually become out of control, and the situation can become unsafe. The 11 indicators used in this paper are shown in Table 2. • Action. From the branches of scenario trees and the International Tanker Owners Pollution Federation Limited (ITOPF) technical information papers, we developed a relatively comprehensive response action set for marine oil spill emergencies, which can be divided into three categories, as shown in Table 3. In this paper, one-hot coding [27] is employed to digitize discrete and disordered features, and this approach mainly uses an n-bit status registry to encode N states. The number of marine oil spill emergency response actions is 15. For example, the action "use of booms" can be encoded as [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], and "use of dispersants" can be encoded as [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The Training of the Action Policy Selection Process in Marine Oil Spill Emergency Response
The proposed methodology is intended to train the action policy selection process in emergency response by fully using historical marine oil spill cases to maximize the cumulative reward and reduce the risk of accidents. In this study, the policy selection method was trained based on information from 55 spills recorded since 1967. The data for these spills were mainly collected from ITOPF, Wikipedia and specific websites. The selected historical case names are listed in Appendix B.
In our experiment, we assumed that 10 continuous emergency response actions should be taken in one epoch; that is, the policy provides recommended actions for 10 marine oil spill instances. The experimental results include the cumulative reward and accuracy of the training models at 300, 500, 900 total epochs. The training curves are shown in Figure 3.

The Training of the Action Policy Selection Process in Marine Oil Spill Emergency Response
The proposed methodology is intended to train the action policy selection process in emergency response by fully using historical marine oil spill cases to maximize the cumulative reward and reduce the risk of accidents. In this study, the policy selection method was trained based on information from 55 spills recorded since 1967. The data for these spills were mainly collected from ITOPF, Wikipedia and specific websites. The selected historical case names are listed in Appendix B.
In our experiment, we assumed that 10 continuous emergency response actions should be taken in one epoch; that is, the policy provides recommended actions for 10 marine oil spill instances. The experimental results include the cumulative reward and accuracy of the training models at 300, 500, 900 total epochs. The training curves are shown in Figure 3. The experimental results show that with an increasing number of training epochs, the cumulative reward and accuracy of the model increase. Specifically, 300 and 500 training epochs are inadequate for training, but the reward and accuracy tend to be smooth and steady after 800 epochs. According to the optimal response policy given by the trained model, the cumulative reward theoretically reaches 7.2. Based on the results of training, we hypothesize that the application of the hybrid DRL/CBR model can assist decision makers in determining the best marine oil spill emergency response by providing effective countermeasures. The experimental results show that with an increasing number of training epochs, the cumulative reward and accuracy of the model increase. Specifically, 300 and 500 training epochs are inadequate for training, but the reward and accuracy tend to be smooth and steady after 800 epochs. According to the optimal response policy given by the trained model, the cumulative reward theoretically reaches 7.2. Based on the results of training, we hypothesize that the application of the hybrid DRL/CBR model can assist decision makers in determining the best marine oil spill emergency response by providing effective countermeasures.

Comparison of Hybrid Application Results and Similarity Matching Results
To support emergency decision making, the trained decision model uses the vector of an oil spill scenario as the inputs and outputs the Q value corresponding to each action. Generally, the higher the Q value is, the greater the probability of taking the corresponding action that the model suggests to the decision maker. To verify the feasibility of the proposed method, four typical marine oil spill scenarios (five scenario instances) are selected in this section, as shown in Table 4. Using these five scenario instance vectors as inputs, the outputs of the state-action value curve are shown in Figure 4. Set spilled oil amount. Sea condition parameter values set to "normal". Scenario instance extracted from the case "BRAER".
Set spilled oil amount. Sea condition parameter values set to "dangerous". Scenario instance extracted from the case "TANIO".

Marine organism death scenario
Assume the spilled oil has been cleaned up. Sea condition parameter values set to "normal". Scenario instance extracted from the case "BRAER".  Figure 4a shows an oil tanker collision scenario instance under normal sea conditions. The optimal emergency response action suggested by the model is "use of mechanical recycling and sorbent materials". From the results, the Q value of the optimal action is not far from the Q value of other emergency response actions, including the "use of booms" and the "use of dispersants". Additionally, in such a tanker collision scenario, all potential actions can be implemented at once. Figure 4b shows an example of a tanker fire scenario under normal sea conditions. The best recommendation given by the model is "extinguishing the fire", and the Q value for selecting a firefighting emergency response action is much higher than that of other emergency response actions. This recommendation is consistent with the actions taken for the SEA STAR accident. In the historical case of the SEA STAR, the oil tanker exploded during recovery without extinguishing the fire, which led to the ship sinking in the Gulf of Oman. Figure 4c shows the results for two oil spill scenarios. When the sea conditions are normal, various methods for remediating spilled oil are recommended. Moreover, only "mechanical recycling" is recommended under rough sea conditions because oil booms lose efficacy under high wave conditions and dispersants are ineffective in low-temperature water. However, in the case of rough sea conditions, the optimal emergency response action given by the model is "stopping ship leaks", with the Q value of the action being much higher than that for other actions, which seems unreasonable. Therefore, it is essential to further optimize the values of the indicators used to assess scenarios in the future. Figure 4d represents a biological impairment scenario in marine environments that leads to organism death. The optimal recommendation given by the model is "shut down sensitive resources", such as affected economic facilities and fish farms. The other recommendations include "spontaneous recovery" and "biological recovery".
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 16    Figure 4a shows an oil tanker collision scenario instance under normal sea conditions. The optimal emergency response action suggested by the model is "use of mechanical recycling and sorbent materials". From the results, the Q value of the optimal action is not far from the Q value of other emergency response actions, including the "use of booms" and the "use of dispersants". Additionally, in such a tanker collision scenario, all potential actions can be implemented at once. Figure 4b shows an example of a tanker fire scenario under normal sea conditions. The best recommendation given by the model is "extinguishing the fire", and the Q value for selecting a Emergency response actions can be also achieved by scenario instance similarity matching from historical cases in CBR systems. As a comparison, the matrix of typical scenarios is used for similarity calculation, and when the Euclidean distance is less than τ (defined in Section 2.1), it matches a historical scenario instance. The results are compared in Table 5: Table 5. Comparison of the two methods in typical oil spill emergency response action suggestion.

Scenario Instance Scenario Similarity Matching Scenario-Based Hybrid DRL/CBR
Tanker collision scenario "Firefighting and fire extinction" "Use of booms" "Use of dispersants" "Use of mechanical recycling and sorbent materials" "Firefighting and fire extinction" Tanker fire scenario "Firefighting and fire extinction" "Firefighting and fire extinction" Oil spill scenario-BREAR None "Use of booms" "Use of dispersants" "Use of mechanical recycling and sorbent materials" Oil spill scenario-TANIO "Cleaning spilled oil" methods are not recommended "Stopping ship leaks" Marine organism death scenario None "Shut down sensitive resources" From the results, it is obvious that the proposed method provides richer emergency response action suggestions for the decision maker. Because we changed the sea condition parameters in the oil spill scenario instances "Oil spill scenario-BREAR" and "Marine organism death scenario", they do not match appropriate scenario instances in the existing CBR system, which need to be revised according to expert knowledge. Moreover, the proposed method suggestions have clear decision intentions: to reduce the severity of oil spills. In general, the application results show that the optimal emergency response model trained to reduce the severity of oil spills can provide a variety of reasonable response actions for decision makers and aid in making decisions during marine oil spill emergencies.

Discussion
When using DQN to solve MDP problems, if the design of the reward function is not suitable, the algorithm may display an extremely long convergence time or even not converge at all. In this study, 11 indicators were selected to reflect the severity of marine oil spill accidents and reduce the risk of oil spills. The reward function is regarded as an expression of the decision intent and the value of the reward R ∈ (0, 1) after each emergency action. Similarly, we constructed another reward function to measure marine biosafety by selecting fixed indicators that meet the conditions for an oil spill close to shore, a fishery farm, a reef or an important habitat type. The intent of this reward function is to optimally protect marine life. The reward function R(x) can be simply defined as follows: where min(d 1 , d 2 , d 3 , d 4 ) is the minimum distance between spilled oil and these four selected locations in a scenario instance and τ is the threshold parameter used to indicate that the spilled oil is approaching a biologically sensitive resource. The model was retrained with the new reward function, and the results were applied in oil spill scenario instances extracted from "BRAER" and "TANIO" cases, as shown in Figure 5. Figure 5a shows the result for the oil spill scenario instance in which the tanker "BRAER" was grounded at Garths Ness, with oil flowing into the sea from the moment of impact. From the result, the "shut down sensitive resources" action was taken because an oil spill occurred near the shore. The action "shut down sensitive resources" was also taken in the historical "BREAR" case, thus providing positive feedback for model training. Figure 5b shows the result of using the oil spill scenario instance for the "TANIO" case; this vessel broke into two pieces during violent weather conditions off the coast of Brittany, France. The results show that the new model seems completely insensitive to sea conditions, potentially because the reward function ignores sea condition indicators when calculating the reward.
"TANIO" cases, as shown in Figure 5. Figure 5a shows the result for the oil spill scenario instance in which the tanker "BRAER" was grounded at Garths Ness, with oil flowing into the sea from the moment of impact. From the result, the "shut down sensitive resources" action was taken because an oil spill occurred near the shore. The action "shut down sensitive resources" was also taken in the historical "BREAR" case, thus providing positive feedback for model training. Figure 5b shows the result of using the oil spill scenario instance for the "TANIO" case; this vessel broke into two pieces during violent weather conditions off the coast of Brittany, France. The results show that the new model seems completely insensitive to sea conditions, potentially because the reward function ignores sea condition indicators when calculating the reward.
(a) (b) Figure 5. Results of the retrained model with the new reward function. (a) Q value of the emergency response action for oil spill scenario instance form the case "BRAER" (b) Q value of the emergency response action for oil spill scenario instance from the case "TANIO".
Another aspect that may limit the quality of the model is the number of states the agent observes from the environment exploration. From the 55 selected historical cases, a total of 193 oil spill scenario instances are extracted, which is far from enough for DQN agent exploring. To solve this problem, some scenario instances in the same cluster are participating in scenario element exchanging to generate more than 800 new scenario instances for experience replay in DQN training. It is still a need to collect more marine oil spill cases to improve the quality of response.
The potential applications of the proposed method can be further explored to aid in marine oil emergency response using different approaches. First, various decision intents can be combined to establish the reward function and train models, which may help improve the level of the marine oil spill emergency response. Second, when faced with a real oil spill accident, we strongly recommend the use of models with different decision intents because a single model cannot fully utilize the scenario tree of historical cases. Another aspect that may limit the quality of the model is the number of states the agent observes from the environment exploration. From the 55 selected historical cases, a total of 193 oil spill scenario instances are extracted, which is far from enough for DQN agent exploring. To solve this problem, some scenario instances in the same cluster are participating in scenario element exchanging to generate more than 800 new scenario instances for experience replay in DQN training. It is still a need to collect more marine oil spill cases to improve the quality of response.

Conclusions
The potential applications of the proposed method can be further explored to aid in marine oil emergency response using different approaches. First, various decision intents can be combined to establish the reward function and train models, which may help improve the level of the marine oil spill emergency response. Second, when faced with a real oil spill accident, we strongly recommend the use of models with different decision intents because a single model cannot fully utilize the scenario tree of historical cases.

Conclusions
A new approach that combines the CBR and DRL algorithms to aid in marine oil emergency response decision making is presented in this paper. The proposed method provides a useful task decomposition process that allows agents to learn tactical policies that can assist decision makers in making decisions across different marine spill instances. Compared with traditional CBR, the proposed method only requires knowledge of a marine oil spill scenario or the construction of scenario instances. Because the proposed method combines the reward function in reinforced learning with the decision intention and applies this approach to train multiple models with different decision intents, the suggested emergency response actions are easy to explain and more informative than those produced by the similarity matching-based CBR system. However, the article only gives two reward functions, which is not enough for a real complex marine oil spill accident, and this limitation will be the focus of future studies.

Spilled Oil-Toxicity (Soluble Aromatic Hydrocarbon Derivatives) Evaluation Value
Almost insoluble in water and includes no oil-containing aromatic hydrocarbons 0.2 Heavy kerosene, some aromatic hydrocarbons and other oils 0.6 Gasoline, light kerosene, many aromatic hydrocarbons and other oils 1.0