A Novel Framework of Real-Time Regional Collision Risk Prediction Based on the RNN Approach

: Regional collision risk identiﬁcation and prediction is important for tra ﬃ c surveillance in maritime transportation. This study proposes a framework of real-time prediction for regional collision risk by combining Density-Based Spatial Clustering of Applications with Noise (DBSCAN) technique, Shapley value method and Recurrent Neural Network (RNN). Firstly, the DBSCAN technique is applied to cluster vessels in speciﬁc sea area. Then the regional collision risk is quantiﬁed by calculating the contribution of each vessel and each cluster with Shapley value method. Afterwards, the optimized RNN method is employed to predict the regional collision risk of speciﬁc seas in short time. As a result, the framework is able to determine and forecast the regional collision risk precisely. At last, a case study is carried out with actual Automatic Identiﬁcation System (AIS) data, the results show that the proposed framework is an e ﬀ ective tool for regional collision risk identiﬁcation and prediction.


Introduction
Maritime transport is the backbone of international trade and the global economy. In recent decades, the rapid development and great volume of marine transportation [1] lead to higher marine traffic density and complexity which trigger vessel collision accidents easily. In general, vessel collision may cause great loss of human lives and property, as well as severe environment pollution [2]. Specifically, the collision risk is a major indicator for navigators and surveillance operators to judge the collision danger between meeting vessels [3], as well as the surveillance on shore plays an important role in preventing vessel collision accidents [4]. In practice, the operators of Vessel Traffic Service (VTS) system get access to real-time vessel data from modern navigational equipment [5], such as Automatic Radar Plotting Aids (ARPA), Automatic Identification System (AIS), Electronic Chart Display and Information System (ECDIS), etc. With the shipping increasing rapidly, the burden of VTS surveillance operators get heavier. Hence, an advanced marine surveillance framework, which could identify and predict collision risk between meeting ships precisely, has been emerged as an effective tool to lighten the burden of VTS surveillance operators.
Compared with a large amount of research in the field of collision risk identification for single or multiple vessels, the research on regional collision risk identification refers to determine the vessel collision risk in a certain water area, and only a few of studies have been carried out. First of all, based on the historical statistical data, the number of collision accidents per unit time in a certain water area was first considered to describe the regional collision risk by researchers. For example, the Formal Safety Assessment (FSA) concept and Bayesian network method were used to evaluate the collision risk of vessels in Yangtze River waters in China with real accident data [6]. In addition, the accident data from 1995 to 2015 in the southern port area of Shenzhen were selected and analyzed to build a regional risk assessment model of port waterway [7].
Afterwards, risk modeling and risk analysis based on the probability and consequences of accidents is usually used to identify risk areas with high incidence of accidents in the selected water areas, and used to predict the consequences of accidents to further control the impact of accidents sometimes. Hu [8] considered five types of detailed accident characteristic information and used the information in the risk classification and quantitative modeling process of the pilot port safety assessment of Shanghai Port. Debnath and Chin [9] applied the vessel conflict theory to the quantitative measurement of the waterway collision risk in Singapore harbors, using two proximity indicators in time and space to quantify the collision risk in the waterway. Montewka et al [10] proposed a new geometric collision probability calculation method based on previous experience extracted from a large amount of vessel data using Monte Carlo and genetic algorithms. Le et al. [11] also uses the vessel collision probability of statistics from the Norwegian Classification Society to evaluate the risk factor of vessel collision with the platform.
Meanwhile, the study based on analysis of non-accident data is also a significant approach in the field of regional collision risk identification. In general, researchers have applied vessel factors such as vessel speed, Distance at Closet Point to Approach (DCPA), Time to Closet Point of Approach (TCPA), and ship domains to the study of regional collision risk in selected water waters. Qu et al. [12] used the three indicators of vessel speed dispersion, vessel acceleration, and number of fuzzy vessel domains to evaluate the regional collision risk of vessels in the Singapore strait. Bukhari et al. [5], in order to quantify the collision risk between all vessels in the area, combined DCPA, TCPA, bearing position, and change of Vessel's Compass Degree (VCD) to build a regional collision risk assessment framework based on fuzzy logic.
However, the shortcomings of the analysis based on historical case data are that the number of collisions and collision rate are based on historical data, which describing the results, and cannot indicate the real-time collision risk and danger zone of vessels in the water areas. As an estimated value based on statistical distribution and prediction, the collision probability is still different from the actual collision risk in actual water areas at that time, and there are certain limitations in the real-time nature of regional collision risk. Recently, Liu et al. [13] proposed a regional collision risk calculation model based on AIS data. Cooperative game theory is used to model the collision risk of regional collision risk in the framework, which stars from the information of single vessel's DCPA, TCPA, etc. Compared with the traditional model, the framework can obtain more accurate results of instantaneous regional collision risk, and avoids the influence of traffic flow in the water area. Therefore, in this paper, DCPA, TCPA and Ship Domain Overlapping Index (SDOI) are used as the vessel parameters for measuring regional collision risk of selected water area.
Moreover, the regional collision risk prediction in selected water areas can help the operators of maritime surveillance grasp the trend and value of the regional collision risk more accurately. In recent years, research on regional collision risk prediction has also been carried out and certain results were achieved. Nivoliantou et al. [14] selected the major accidents in the Aegean waters from 2008 to 2012 as historical data and carried out a Bayesian network-based forecasting study for the vessel's navigational environment risk in the Aegean waters. Fan et al. [15] extracted the influencing factors from the data of 218 vessel accidents in the research water area in 2013 and combined these with the Bayesian network to build a prediction model of collision accident levels in the Yangtze River. Kim et al. [16] used a deep neural network called ship traffic extraction network, which consisted with convolutional neural network and a large amount of historical AIS data to make mid-term and long-term predictions of vessel traffic in congested port waters. Okazaki et al. [17] used the Support Vector Machine (SVM) method in the study of collision risk prediction of vessels on the exit of the sea route. Fukuto and Imazu [18] combined the vessel's course prediction work and the Obstacle Zone by Target (OZT) method to determine the collision probability area. Although the regional collision risk prediction has been studied by various methods in previous studies, certain limitations still exist. Some models require a large amount of historical data of vessel accidents to build a database for predicting the risk of regional vessel collisions, which has a random effect on the real-time quantification and short-term prediction of vessel collision risks in water areas. In addition, the existing models have good effects on mid-term and long-term predictions of regional collision risk, but there is no corresponding research on real-time and short-term predictions of regional collision risk. The real-time and short-term regional collision risk prediction in water areas with limited data is a shortcoming of current research.
As an important method in the field of machine learning and artificial intelligence networks, RNNs can realize the prediction of sequence-labeled data with time. Up to now, RNNs have been applied to a variety of problems, especially those involving ordered data processing. In terms of data prediction based on existing databases, RNN has shown great success, such as in image processing [24,25], language processing [26], and prediction problems in different fields [27][28][29][30][31]. RNNs contain recurrent connections that make them more powerful than traditional neural networks to model such sequence data. In the application of speech recognition, RNNs contain network activations that are recurrent from the previous time step as input networks to influence the prediction of the current time step. These activations are stored in the internal state of the network which can save long-term temporal context information in principle. This mechanism allows RNNs to take advantage of changing the context according to the history of input sequence dynamically instead of using static context such as traditional neural networks. Besides, RNNs do not need large memory storage in applications. Such advantages give RNNs better performance in prediction and dealing with sequence-labeled issues compared with traditional forward neural networks.
Based on above observation, to identify and predict regional collision risk precisely, a prediction framework for regional collision risk is proposed in this article. This framework uses limited non-accident data from selected water area to achieve the prediction of regional collision risk more accurately and effectively. Vessel parameters include speed, position, course, length, and the number of vessels entered the selected water area through the traffic lane in a specific period of time, are all extracted from AIS data. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, improved Shapely value method and RNN divide the framework into two steps. As a result, the application of the framework can display the distribution of regional collision risk in future clearly, and mark special areas that deserve attention.
The following content was arranged as follows: In Section 2, the regional collision risk identification is introduced. Section 3 describes the optimized RNN approach for regional collision risk prediction in detail. In Section 4, a case study to verify the feasibility and effectiveness of the framework is developed. Finally, conclusions, discussions and future perspectives are presented in Section 5.

Procedure of Regional Collision Risk Prediction
Overall, the procedure of proposed prediction framework for regional collision risk as follows: Firstly, by combining DBSCAN spatial clustering algorithm and improved Shapely value method, the real-time regional collision risk in selected water area is obtained. Then, history data consist of regional collision risk and the number of vessels entered the selected water area through traffic lane during the corresponding time period are used to build the RNN training data set. Based on the training data set, the prediction model was trained with proposed RNN algorithm in Tensorflow on a personal CPU system (Core i7 with 8 GB RAM). Furthermore, for certain time point, the regional collision risk can be predicted by trained RNN approach, which helps maritime traffic surveillance department to grasp the regional collision risk in the future and allocate reasonable surveillance force more reasonable.

The Regional Collision Risk Identification
As an approach of expressing the collision risk of vessels in a certain area, the regional collision risk is the overall collision risk formed by all vessels in a certain area. Therefore, the contribution of all vessels in the area to the collision risk is used in the measurement of regional collision risk.
In a certain busy water area that meets the research requirements, plenty of vessels will appear at the same time. To calculate the regional collision risk in the water area at that moment, a clustering algorithm based on spatial density is used for data processing. To improve the calculation efficiency of regional collision risk and reduce the computational complexity, this study uses the spatial density clustering algorithm DBSCAN to cluster vessels in selected water areas.
Clustering belongs to an unsupervised learning method and is a branch of data mining. So far, many clustering algorithms have been proposed [32][33][34][35][36][37], such as K-Means clustering [32] and DBSCAN [34]. Among them, the DBSCAN algorithm is a density clustering method proposed by Ester [34], as a classic density-based clustering algorithm, the main idea of the DBSCAN algorithm is to cluster a given set of objects in space, group densely distributed points into a clustering cluster, and leave the remaining sparse individual points apart marked as noise points. The DBSCAN algorithm can identify noise among cluster objects, find clusters of any shape and size that reaches the target density. The clustering results of DBSCAN algorithm are relatively more accurate when there is no specified number of clustering results, and DBSCAN can find clusters of any shape in the data with noisy points. The DBSCAN algorithm has been widely used in various fields presently, such as medicine, biology [38], analytical chemistry [39], marine transportation [3,4], etc. This study used the DBSCAN clustering algorithm in the process of quantifying the regional collision risk in selected water area, which improved the efficiency and visualization of the quantization process, while reducing the amount of calculation and the complexity of the calculation. After clustering with DBSCAN, several clusters will be obtained in the area as shown in Figure 1.

The Regional Collision Risk Identification
As an approach of expressing the collision risk of vessels in a certain area, the regional collision risk is the overall collision risk formed by all vessels in a certain area. Therefore, the contribution of all vessels in the area to the collision risk is used in the measurement of regional collision risk.
In a certain busy water area that meets the research requirements, plenty of vessels will appear at the same time. To calculate the regional collision risk in the water area at that moment, a clustering algorithm based on spatial density is used for data processing. To improve the calculation efficiency of regional collision risk and reduce the computational complexity, this study uses the spatial density clustering algorithm DBSCAN to cluster vessels in selected water areas.
Clustering belongs to an unsupervised learning method and is a branch of data mining. So far, many clustering algorithms have been proposed [32][33][34][35][36][37], such as K-Means clustering [32] and DBSCAN [34]. Among them, the DBSCAN algorithm is a density clustering method proposed by Ester [34], as a classic density-based clustering algorithm, the main idea of the DBSCAN algorithm is to cluster a given set of objects in space, group densely distributed points into a clustering cluster, and leave the remaining sparse individual points apart marked as noise points. The DBSCAN algorithm can identify noise among cluster objects, find clusters of any shape and size that reaches the target density. The clustering results of DBSCAN algorithm are relatively more accurate when there is no specified number of clustering results, and DBSCAN can find clusters of any shape in the data with noisy points. The DBSCAN algorithm has been widely used in various fields presently, such as medicine, biology [38], analytical chemistry [39], marine transportation [3,4], etc. This study used the DBSCAN clustering algorithm in the process of quantifying the regional collision risk in selected water area, which improved the efficiency and visualization of the quantization process, while reducing the amount of calculation and the complexity of the calculation. After clustering with DBSCAN, several clusters will be obtained in the area as shown in Figure 1. The vessels in this area are divided into several clusters. Therefore, the cluster's collision risk is obtained by calculating the collision risk of every single vessel based on Equation (1), and then the regional collision risk is calculated from the cluster's collision risk based on Equation (2).
where CRC refers to collision risk of each cluster, CRi refers to collision risk of every single vessel and Wi refers to every single vessel's contribution. The vessels in this area are divided into several clusters. Therefore, the cluster's collision risk is obtained by calculating the collision risk of every single vessel based on Equation (1), and then the regional collision risk is calculated from the cluster's collision risk based on Equation (2).
where CRC refers to collision risk of each cluster, CR i refers to collision risk of every single vessel and W i refers to every single vessel's contribution.
where RCR refers to regional collision risk, CRC j refers to collision risk of every single cluster and W j refers to every single cluster's contribution. There are n vessels in each cluster, and any vessel in the cluster and other vessels form n vessel pairs. The collision risk of each single vessel in the cluster can be determined by summing all collision risk of corresponding vessel pairs. The collision risk of all vessel pairs of every single vessel are summed and expressed as: where CR i is the collision risk of Vessel i, CR ij , j = 1, 2, . . . , n is the collision risk of vessel pairs which include Vessel i. In this study, the analytical method was used to calculate the collision risk of the vessel pair. As a method to determine the collision risk through the vessel dynamic elements directly, the analytical method is more objective than the fuzzy logic method based on marine experts. Besides, it is simpler than the method based on artificial intelligence and the calculation results will unaffected by the previous training set. To overcome these limitations, this paper uses DCPA, TCPA, and SDOI to calculate collision risk to improve the calculation accuracy.
Considering the simplicity and practical application of the model, this study directly uses the AIS data to obtain the essential parameters of vessels, including speed, course, position (longitude and latitude) and length are obtained by decoding and extracting AIS data.
In every vessel pair, a vessel was assigned as own vessel, the other one was target vessel. The vessel parameters including speed, course, longitude, latitude and length of own vessel and target vessel can be expressed as (v 0 , c 0 , x 0 , y 0 , l 0 ) and (v t , c t , x t , y t , l t ).
Then the relative distance (r), relative bearing (c b ), relative speed (v r ) and relative course (c r ) can be calculated as follows: Here in, DCPA, TCPA and SDOI can be calculated as follows: The calculation of collision risk of vessel pair by DCPA, TCPA [3] and SDOI [13] can be expressed in the form of negative exponential equations: where the sum of a DCPA , a TCPA and a SDOI is 1 and can be set according to the actual situation of selected water area for better accuracy.
After the collision risk calculation of vessel pairs in the cluster is quantified, in order to obtain the collision risk of each cluster more accurately, this paper uses improved Shapely value to determine the summing weight of each vessel pair in every single cluster.
where S i refers to the summing weight determined by Shapely value of vessel i, G is the group formed according to vessel i, g represents the number of vessels in group G, N represents the group of all vessels, n refers the vessel number of group N, A(G) refers to the total vessel number of group G and A (G − {i}) refers to the total vessel number of group G without vessel i. S' i means the summing weight determined by Improved Shapley value of vessel i, f refers to the influencing factor of collision risk, CR fi refers to the collision risk of the factor f of the vessel i, p fi refers to the weight of the factor for vessel i, and σ f refers to the influence coefficient which can be determined by maritime experts The improved Shapely value method applied here can also be used in the quantification of the summing weight of clusters in Equation (2), therefore, regional collision risk can be quantified more precise. After calculating regional collision risk at different time points, a series of regional collision risk at different time points in selected water area is obtained. This study uses a RNN to predict regional collision risk of selected water area in the following content as the second step of the framework.

Recurrent Neural Network
In recent decades, the excellent prediction ability of RNN has studied widely. In the application of prediction, Emad et al. [40] pointed RNN has a better outcome in stock trend prediction compared with Time-Delay Neural Networks and Probabilistic Neural Network. Tian and Pan [41] used Long Short-Term Memory Recurrent Neural Network capture the nonlinearity and randomness in short-term traffic flow prediction more effectively. Maher and Biswajeet [42] applied an RNN model to predict the injury severity of traffic accidents, the RNN model outperformed the Multilayer Perceptron and Bayesian Logistic Regression models in comparative analyses. Xu et al. [43] carried out risk predictions from Electronic Health Records with the proposed RNN approach, the experimental results shown that RNN prediction model has good prediction performance. The application of RNN model in various fields' prediction motivated the evaluation of proposed RNN model for real-time prediction in regional collision risk.
A simple RNN consists of an input layer, a hidden layer, and an output layer, where X t is the input, Y t is the output, H t is the hidden layer, U t is the weight of the input value X t , and V t is the weight of the hidden layer H t . In a RNN, the input information continuously loops through the structure. The simple structure of RNN is shown in Figure 2. The structure of the unfolded RNN is shown in Figure 3. In Figure 3, X1, …, Xt are inputs at different times, Y1, …, Yt are outputs at different times, H1, …, Ht are hidden layers at different times in the network, and U1, …, Ut are weights of input values at different times. V1, …, Vt are the weights of the hidden layers at different times, W1, …, Wt−1 are the transfer weights of the hidden layers at different times. It can be seen that the inputs of the RNN include not only the current input data, but also the previous input information. The determination of the RNN at previous time steps will affect the subsequent determination of following time steps. These continuous massages are saved in the hidden state of the recurrent network. This hidden state spans multiple time steps and is passed forward layer by layer, which has always affected the network's processing of each new example. At the same time, the hidden state has been constantly corrected. Therefore, the real-time input data and the lasted input data become two input sources of the RNN. The combination of them will determine how the RNN processes new data. In traditional feed-forward networks, neurons directly forward information, and the transmitted information will not contact the nodes that have passed through it again, while the recurrent network uses historical information to update the weights in the network. The value of the hidden layer Ht of the RNN depends not only on the current input Xt, but also on the value of the previous hidden layer Ht−1. The weight matrix Wt−1 is the weight collected by the previous value of the hidden layer and used in the hidden layer of this time. To realize the prediction of the regional collision risk in selected water area, RNN was used in this study.

RNN Model for Regional Collision Risk Prediction
Because the RNN has the advantage of updating the system weights based on historical input data, this study applies the RNN to the regional collision risk prediction at continuous time points. The structure of the unfolded RNN is shown in Figure 3. In Figure 3, X1, …, Xt are inputs at different times, Y1, …, Yt are outputs at different times, H1, …, Ht are hidden layers at different times in the network, and U1, …, Ut are weights of input values at different times. V1, …, Vt are the weights of the hidden layers at different times, W1, …, Wt−1 are the transfer weights of the hidden layers at different times. It can be seen that the inputs of the RNN include not only the current input data, but also the previous input information. The determination of the RNN at previous time steps will affect the subsequent determination of following time steps. These continuous massages are saved in the hidden state of the recurrent network. This hidden state spans multiple time steps and is passed forward layer by layer, which has always affected the network's processing of each new example. At the same time, the hidden state has been constantly corrected. Therefore, the real-time input data and the lasted input data become two input sources of the RNN. The combination of them will determine how the RNN processes new data. In traditional feed-forward networks, neurons directly forward information, and the transmitted information will not contact the nodes that have passed through it again, while the recurrent network uses historical information to update the weights in the network. The value of the hidden layer Ht of the RNN depends not only on the current input Xt, but also on the value of the previous hidden layer Ht−1. The weight matrix Wt−1 is the weight collected by the previous value of the hidden layer and used in the hidden layer of this time. To realize the prediction of the regional collision risk in selected water area, RNN was used in this study.

RNN Model for Regional Collision Risk Prediction
Because the RNN has the advantage of updating the system weights based on historical input data, this study applies the RNN to the regional collision risk prediction at continuous time points. In Figure 3, X 1 , . . . , X t are inputs at different times, Y 1 , . . . , Y t are outputs at different times, H 1 , . . . , H t are hidden layers at different times in the network, and U 1 , . . . , U t are weights of input values at different times. V 1 , . . . , V t are the weights of the hidden layers at different times, W 1 , . . . , W t−1 are the transfer weights of the hidden layers at different times. It can be seen that the inputs of the RNN include not only the current input data, but also the previous input information. The determination of the RNN at previous time steps will affect the subsequent determination of following time steps. These continuous massages are saved in the hidden state of the recurrent network. This hidden state spans multiple time steps and is passed forward layer by layer, which has always affected the network's processing of each new example. At the same time, the hidden state has been constantly corrected. Therefore, the real-time input data and the lasted input data become two input sources of the RNN. The combination of them will determine how the RNN processes new data. In traditional feed-forward networks, neurons directly forward information, and the transmitted information will not contact the nodes that have passed through it again, while the recurrent network uses historical information to update the weights in the network. The value of the hidden layer H t of the RNN depends not only on the current input X t , but also on the value of the previous hidden layer H t−1 . The weight matrix W t−1 is the weight collected by the previous value of the hidden layer and used in the hidden layer of this time. To realize the prediction of the regional collision risk in selected water area, RNN was used in this study.

RNN Model for Regional Collision Risk Prediction
Because the RNN has the advantage of updating the system weights based on historical input data, this study applies the RNN to the regional collision risk prediction at continuous time points. In this study, an RNN with one hidden layer structure is used as a prediction model with two sets of matrices as inputs and one matrix as output. A linear function is used as activation function of hidden layer in proposed network. The inputs include the number of vessels entered the selected water area through the traffic lane in different time periods and the regional collision risk value of the selected water area at different times. Among them, the number of vessels passing through the traffic separation lane in two consecutive time periods is used as a set of 2 × 2 input matrices, and the regional collision risk value in selected water area at two consecutive time points is used as another set of 2 × 2 size input matrix. A regional collision risk value at different time points in selected water area is used as the output data in this model, and the RNN is trained with a certain data sample. The learning results of the RNN on the samples are used as the basis for predicting the regional collision risk in selected water area.
The structure of the RNN used in the prediction model is shown in Figure 4. In this study, an RNN with one hidden layer structure is used as a prediction model with two sets of matrices as inputs and one matrix as output. A linear function is used as activation function of hidden layer in proposed network. The inputs include the number of vessels entered the selected water area through the traffic lane in different time periods and the regional collision risk value of the selected water area at different times. Among them, the number of vessels passing through the traffic separation lane in two consecutive time periods is used as a set of 2 × 2 input matrices, and the regional collision risk value in selected water area at two consecutive time points is used as another set of 2 × 2 size input matrix. A regional collision risk value at different time points in selected water area is used as the output data in this model, and the RNN is trained with a certain data sample. The learning results of the RNN on the samples are used as the basis for predicting the regional collision risk in selected water area. The structure of the RNN used in the prediction model is shown in Figure 4.
Among them, X1-2t, 1, X1-2t, 2 are input matrices, Y1-t are 1 × 1 output matrix, H1-2t are hidden layers, bH is bias vector of hidden layer, bY is bias vector of output, U1-2t, 1, U1-2t, 2, are the weights of different input matrices, and V1-t are the weights of the hidden layer H2-2t. Therefore, we can get the formula, and the regression equation about the output value. The difference between the regression equation of the output value and the output value is used to obtain a new equation shown as Equation (20). When the new equation approaches to 0, it indicates that the RNN has achieved good results in learning the samples.
Construct the historical data obtained from Section 2 as a training database for RNN to learn. By calculating the regional collision risk at different time points, the time distribution of regional collision risk in selected water area can be obtained as inputs with the number of entered vessels at different time periods, therefore the RNN is used to predict regional collision risk of selected water area at following subsequent time point. The flow chart of this second-step framework is shown in Figure 5. Among them, X 1−2t, 1 , X 1−2t, 2 are input matrices, Y 1−t are 1 × 1 output matrix, H 1−2t are hidden layers, b H is bias vector of hidden layer, b Y is bias vector of output, U 1−2t, 1 , U 1−2t, 2 , are the weights of different input matrices, and V 1−t are the weights of the hidden layer H 2−2t . Therefore, we can get the formula, and the regression equation about the output value. The difference between the regression equation of the output value and the output value is used to obtain a new equation shown as Equation (20). When the new equation approaches to 0, it indicates that the RNN has achieved good results in learning the samples.
Construct the historical data obtained from Section 2 as a training database for RNN to learn. By calculating the regional collision risk at different time points, the time distribution of regional collision risk in selected water area can be obtained as inputs with the number of entered vessels at different time periods, therefore the RNN is used to predict regional collision risk of selected water area at following subsequent time point. The flow chart of this second-step framework is shown in Figure 5.

Data Selection
To verify the validity of the prediction framework, a case study was performed at the western entrance of the Malacca Strait, which is the busiest water area of Singapore. The water area in this study is selected between 103.4° E to 103.6° E in longitude and 1.1° N to 1.2° N in latitude with the data from maritime safety research [44,45], as shown in Figure 6 and Figure 7.  This area is located at the end of a traffic separation scheme. In this area, the vessel enters the port and anchorage after passing through the traffic lane, or enters the traffic lane from the port and anchorage. Without traffic separation scheme's control, the increase in traffic density and more

Data Selection
To verify the validity of the prediction framework, a case study was performed at the western entrance of the Malacca Strait, which is the busiest water area of Singapore. The water area in this study is selected between 103.4 • E to 103.6 • E in longitude and 1.1 • N to 1.2 • N in latitude with the data from maritime safety research [44,45], as shown in Figures 6 and 7.

Data Selection
To verify the validity of the prediction framework, a case study was performed at the western entrance of the Malacca Strait, which is the busiest water area of Singapore. The water area in this study is selected between 103.4° E to 103.6° E in longitude and 1.1° N to 1.2° N in latitude with the data from maritime safety research [44,45], as shown in Figure 6 and Figure 7.  This area is located at the end of a traffic separation scheme. In this area, the vessel enters the port and anchorage after passing through the traffic lane, or enters the traffic lane from the port and anchorage. Without traffic separation scheme's control, the increase in traffic density and more

Data Selection
To verify the validity of the prediction framework, a case study was performed at the western entrance of the Malacca Strait, which is the busiest water area of Singapore. The water area in this study is selected between 103.4° E to 103.6° E in longitude and 1.1° N to 1.2° N in latitude with the data from maritime safety research [44,45], as shown in Figure 6 and Figure 7.  This area is located at the end of a traffic separation scheme. In this area, the vessel enters the port and anchorage after passing through the traffic lane, or enters the traffic lane from the port and anchorage. Without traffic separation scheme's control, the increase in traffic density and more This area is located at the end of a traffic separation scheme. In this area, the vessel enters the port and anchorage after passing through the traffic lane, or enters the traffic lane from the port and anchorage. Without traffic separation scheme's control, the increase in traffic density and more vessel intersections will lead to greater collision risks. The selected area does not contain anchorages, narrow channels, and shallow water areas.
The Automatic Identification System (AIS) refers to a navigation assistance system applied to maritime safety and communication, which is applied between vessels and vessels, vessels, and shores. The AIS system can automatically exchange important information such as position, speed, heading, vessel name, call sign, Maritime Mobile Service Identify (MMSI), etc. According to the amendments to the International Convention for the Safety of Life at Sea adopted by the International Maritime Organization: All international navigation vessels over 300 t, non-international vessels of less than 500 t, and all passenger vessels must be compulsorily installed with AIS equipment, so that maritime traffic surveillance departments can obtain vessel data. As an important means to obtain vessel motion information data, the position information of the AIS system is derived from the Global Satellite Positioning System (civil GPS). Its positioning accuracy can already guarantee in 10 m, which meets the positioning requirements for vessels in maritime transportation surveillance. Based on GPS data that meets the accuracy requirements, the AIS system combines vessel dynamic information such as vessel position, vessel speed, changing heading rate, and heading. As well as vessel static information such as vessel name, call sign, draught, and dangerous goods. Such information is broadcasting from VHF channels to nearby vessel and shore station. The dynamic and static information enables neighboring vessel and shore stations to grasp the information of all vessels in the vicinity timely, which affords a great help to ensure the safety of maritime transportation.
In this study, MMSI, longitude, latitude, speed, heading, and vessel length information were selected from the 27 kinds of dynamic and static information contained in AIS data. After processing the AIS data, they were used to calculate the real-time regional collision risk of selected water area. The selected AIS data is the AIS vessel data received in the selected water area from 1800 to 1900 on 3 January 2014.

AIS Data Screening
The obtained AIS data is decoded and stored in the database, then carry out the work of pre-processing and cleaning to the data, so as to obtain valid AIS data. The main work of the pre-processing is to filter the AIS data on longitude and latitude according to the selected water area position information. Delete the data with MMSI of 0, and delete the AIS data where the position, speed or course exceeds a reasonable value.

AIS Data Processing
AIS information is sent discontinuously by different vessels at different time intervals. Because this study needs to calculate the regional collision risk at a specific time in selected water area, interpolation algorithm processing is performed on the filtered AIS data to obtain the different vessels', characteristic information at a specific time.
By collecting AIS data 3 minutes before and after 1800, 1810, ..., 1900 time points, the distribution of vessels in selected water area at the time point was optimized and applied in the following study.

Prediction Model Application
After the processed AIS information database was established according to the above process, Spatial clustering work was carried out to selected water area at different time point according to database, the results are shown in Figures 8-13.                   The regional collision risk at the time of 1800, 1810, …, 1900 in the selected water area was calculated by using the improved Shapely value after clustering analysis at each time point, the regional collision values were given in Table 1. In addition, the number of entered vessels through traffic lane to selected water area is shown in Table 2.  In the next steps, the regional collision risk at different time points and the number of vessels entering the selected water area from the traffic lane in different time periods (10 minutes before each time point) were used as one input parameters to construct an RNN training data set, see Table 3.
The above data was applied to the RNN used in this study, and the parameters, weights were gradually modified, set the number of learning times to 3000 rounds. The output set obtained by RNN approach is shown in Table 4 and the training process of RNN is shown in Figure 14.
It can be illustrated from Figure 14 that the values of the parameter mae and parameter loss have changed from large to small, and have gradually stabilized.  The regional collision risk at the time of 1800, 1810, …, 1900 in the selected water area was calculated by using the improved Shapely value after clustering analysis at each time point, the regional collision values were given in Table 1. In addition, the number of entered vessels through traffic lane to selected water area is shown in Table 2.  In the next steps, the regional collision risk at different time points and the number of vessels entering the selected water area from the traffic lane in different time periods (10 minutes before each time point) were used as one input parameters to construct an RNN training data set, see Table 3.
The above data was applied to the RNN used in this study, and the parameters, weights were gradually modified, set the number of learning times to 3000 rounds. The output set obtained by RNN approach is shown in Table 4 and the training process of RNN is shown in Figure 14.
It can be illustrated from Figure 14 that the values of the parameter mae and parameter loss have changed from large to small, and have gradually stabilized. The regional collision risk at the time of 1800, 1810, . . . , 1900 in the selected water area was calculated by using the improved Shapely value after clustering analysis at each time point, the regional collision values were given in Table 1. In addition, the number of entered vessels through traffic lane to selected water area is shown in Table 2.  In the next steps, the regional collision risk at different time points and the number of vessels entering the selected water area from the traffic lane in different time periods (10 minutes before each time point) were used as one input parameters to construct an RNN training data set, see Table 3.
The above data was applied to the RNN used in this study, and the parameters', weights were gradually modified, set the number of learning times to 3000 rounds. The output set obtained by RNN approach is shown in Table 4 and the training process of RNN is shown in Figure 14.
It can be illustrated from Figure 14 that the values of the parameter mae and parameter loss have changed from large to small, and have gradually stabilized.  The meaning of parameter loss in RNN is shown in Equation (20), the meaning of mae here can be expressed as follow The lower the value of mae, the better the goodness of fit. The lower the value of loss, the better the predictive ability of a model. Finally, the regional collision risk value at 1840 and 1850 of selected water area, the number of entered vessel in 1840-1850 and 1850-1900 were applied in the previous trained prediction model, therefore, a value of 1900 was calculated. The value of regional risk value at 1900 is 0.11747.

Validation of Prediction Framework
To verify the validity of the proposed framework, the actual regional collision risk of selected waters calculated based on AIS data was compared with the results obtained from the RNN prediction framework.
This study uses the RNN regional collision risk prediction model to predict the regional collision risk at different time points in selected water area from 1800 to 1900 on 3 January 2014. The results obtained from prediction framework and actual value of regional collision risk obtained from historical AIS data are shown in Table 5 and Figure 15.
In the following study, the data set was constructed from AIS data of selected water area at different time points from 1800 to 1900 on 4 and 5 January 2014, and the number of vessels entered the selected waters through traffic lane in different time periods. The RNN prediction framework proposed in this study was used to predict the regional collision risk respectively.
The data set constructed based on the AIS historical data on 4 and 5 January 2014 is shown in Table 6. The RNN prediction framework proposed in this study was trained based on the data set, and the factor weights were gradually modified, set the number of learning times as 3000 rounds and 7000 rounds. The prediction result is shown in Table 7, the training process diagram of RNN is shown in Figures 16 and 17, the parameter mae and parameter loss in each figure have changed from large to small, and have gradually stabilized. Table 5. 03-Jan-14 Regional Collision Risk.

03-Jan-14
Regional    The data set constructed based on the AIS historical data on 4 and 5 January 2014 is shown in Table 6. The RNN prediction framework proposed in this study was trained based on the data set, and the factor weights were gradually modified, set the number of learning times as 3000 rounds and 7000 rounds. The prediction result is shown in Table 7, the training process diagram of RNN is shown in Figures 16 and 17, the parameter mae and parameter loss in each figure have changed from large to small, and have gradually stabilized.       Figure 15. 03-Jan-2014 Regional collision risk.
The data set constructed based on the AIS historical data on 4 and 5 January 2014 is shown in Table 6. The RNN prediction framework proposed in this study was trained based on the data set, and the factor weights were gradually modified, set the number of learning times as 3000 rounds and 7000 rounds. The prediction result is shown in Table 7, the training process diagram of RNN is shown in Figures 16 and 17, the parameter mae and parameter loss in each figure have changed from large to small, and have gradually stabilized.    In the next step, the prediction models obtained by this approach was used to predict the regional collision risk at 1900 on 4 and 5 January 2014. The predicted and true values for regional collision risk values are shown in Tables 8 and 9, and Figures 18 and 19. Table 8. 04-Jan-14 Regional Collision Risk.  . 05-Jan-2014 Regional collision risk. Table 9. 05-Jan-14 Regional Collision Risk. Based on above observation, at 1900 on 3, 4, and 5 January 2014, the prediction results of regional collision risk obtained through the RNN prediction framework is close to the actual value In the next step, the prediction models obtained by this approach was used to predict the regional collision risk at 1900 on 4 and 5 January 2014. The predicted and true values for regional collision risk values are shown in Tables 8 and 9, and Figures 18 and 19. Table 8. 04-Jan-14 Regional Collision Risk. In the next step, the prediction models obtained by this approach was used to predict the regional collision risk at 1900 on 4 and 5 January 2014. The predicted and true values for regional collision risk values are shown in Tables 8 and 9, and Figures 18 and 19. Table 8. 04-Jan-14 Regional Collision Risk.  . 05-Jan-2014 Regional collision risk. Table 9. 05-Jan-14 Regional Collision Risk. Based on above observation, at 1900 on 3, 4, and 5 January 2014, the prediction results of regional collision risk obtained through the RNN prediction framework is close to the actual value In the next step, the prediction models obtained by this approach was used to predict the regional collision risk at 1900 on 4 and 5 January 2014. The predicted and true values for regional collision risk values are shown in Tables 8 and 9, and Figures 18 and 19. Table 8. 04-Jan-14 Regional Collision Risk.  . 05-Jan-2014 Regional collision risk. Table 9. 05-Jan-14 Regional Collision Risk. Based on above observation, at 1900 on 3, 4, and 5 January 2014, the prediction results of regional collision risk obtained through the RNN prediction framework is close to the actual value Figure 19. 05-Jan-2014 Regional collision risk. Table 9. 05-Jan-14 Regional Collision Risk. Based on above observation, at 1900 on 3, 4 and 5 January 2014, the prediction results of regional collision risk obtained through the RNN prediction framework is close to the actual value obtained based on AIS historical data. In addition, the prediction results of the prediction framework for the other time points are also closer to the actual values. Comparing the predicted values of regional collision risk with actual historical data, it can be seen from Figures 15, 18 and 19 that the predicted values obtained from RNN prediction framework are close to the actual values of regional collision risk. The change tendency of predicted regional collision risk value and the actual regional collision risk value have reach a good agreement. The results of previous application and validation all shown that the RNN prediction framework proposed in this study can effectively predict the regional collision risk in specific water area.

Conclusions
A regional collision risk prediction framework based on real-time AIS data is proposed in this paper. To improve the efficiency and reduce the computational complexity of the quantification part of the regional collision risk in the model, the DBSCAN spatial clustering algorithm was used to obtain the cluster distribution of vessels in selected water area, and several clusters each including several vessels in selected water area are obtained. Then, an improved Shapely value method is used to define every single vessel's contribution in each cluster and risk contribution of each cluster in selected water area to obtain the regional collision risk at a specific time point of selected water area. As following, the regional collision risk at a series of time points in selected water area obtained through the above steps, and the number of vessels entered the selected water area through traffic lane during the corresponding time period are used to build data set for training the RNN prediction model. In the second step of the RNN prediction framework, the RNN is trained by previous data sets, and finally the regional collision risk at future time point of selected water area is obtained. To validate the proposed prediction framework, a case study was carried out in a water area of Malacca, Singapore, and the results shown that the RNN prediction framework proposed in this study can effectively predict the regional collision risk in specific water area.
In contrast to other models, this prediction framework starts from the calculation of the collision risk of a single vessel, uses real-time AIS data in selected water area and other information to predict the regional collision risk in future time point. Moreover, the proposed framework can achieve more accurate real-time prediction of regional collision risk, without the limitation of construction of large database or large index set. The application of the RNN prediction framework in regional collision risk can help surveillance operators achieve better risk monitoring about the future trend of regional collision risk. In future, the framework can be applied to other water areas, such as port waters and waterways which have a need for regional collision risk monitoring.
During AIS data processing, interpolation is used to process different times' AIS data to obtain AIS information at the same time. To obtain more accurate vessel position information, consider more dynamic vessel parameters such as Speed Over Ground, Course Over Ground, Change of Speed, and Rate of Turn in future study. When calculating the collision risk of every single vessel, ship domain is considered to be circular in the SDOI parameters used in this model, while the more advanced geometric concepts of ship domain such as ellipse should be used to improve the accuracy of the model. In addition, the prediction framework involves cluster analysis of vessel distribution at multiple moments in selected water area, each cluster analysis needs to determine appropriate parameters based on real-time conditions. Therefore, an optimized DBSCAN spatial clustering algorithm should be used in future research. The limited size of training data and simple structure of RNN structure applied here can affect the predictions. To achieve more accurate prediction results, considering more factors, with more advanced structures and constructing larger training data, optimized RNN algorithms will also be applied in the future work.

Conflicts of Interest:
The authors declare no conflict of interest.