Prediction of Sea Surface Temperature in the China Seas Based on Long Short-Term Memory Neural Networks

: Sea surface temperature (SST) in the China Seas has shown an enhanced response in the accelerated global warming period and the hiatus period, causing local climate changes and a ﬀ ecting the health of coastal marine ecological systems. Therefore, SST distribution prediction in this area, especially seasonal and yearly predictions, could provide information to help understand and assess the future consequences of SST changes. The past few years have witnessed the applications and achievements of neural network technology in SST prediction. Due to the diversity of SST features in the China Seas, long-term and high-spatial-resolution prediction remains a crucial challenge. In this study, we adopted long short-term memory (LSTM)-based deep neural networks for 12-month lead time SST prediction from 2015 to 2018 at a 0.05 ◦ spatial resolution. Considering the sub-regional di ﬀ erences in the SST features of the study area, we applied self-organizing feature maps (SOM) to classify the SST data ﬁrst, and then used the classiﬁcation results as additional inputs for model training and validation. We selected nine models di ﬀ ering in structure and initial parameters for ensemble to overcome the high variance in the output. The statistics of four years’ SST di ﬀ erence between the predicted SST and Operational SST and Ice Analysis (OSTIA) data shows the average root mean square error (RMSE) is 0.5 ◦ C for a one-month lead time and is 0.66 ◦ C for a 12-month lead time. The southeast of the study area shows the highest predictable accuracy, with an RMSE less than 0.4 ◦ C for a 12-month prediction lead time. The results indicate that our model is feasible and provides accurate long-term and high-spatial-resolution SST prediction. The experiments prove that introducing appropriate class labels as auxiliary information can improve the prediction accuracy, and integrating models with di ﬀ erent structures and parameters can increase the stability of the prediction results.


Introduction
Changes in sea surface temperature (SST) vary regionally. Most of the global oceans have experienced a trend of warming SST, while a small portion experienced cooling, e.g., the Atlantic to the south of Greenland and some areas in the equatorial Pacific [1]. SST profoundly affects the local climate and is a significant contributor to regional marine ecological system health [2]. Recent studies showed that the SST over the China Seas showed certain trends during the accelerated global warming period and the hiatus period, like a faster rising rate or a more significant downward trend than

Data
The data used in this study were derived from two SST reanalysis L4 products, SST_GLO_SST_L4_REP_OBSERVATIONS_010_011 and SST_GLO_SST_L4_NRT_OBSERVA TIONS_010_001, which are available at the Copernicus Marine Environment Monitoring Service (CMEMS) website (http://marine.copernicus.eu/services-portfolio/access-toproducts/). They were processed at the Met Office, using the Operational SST and Sea Ice Analysis (OSTIA) system. The SST is the foundation temperature, which is free of diurnal variability. The reference depth was 5-10 m. The products contained SST daily data from 1 January 1985 to 31 December 2007, and 1 January 2006 to the present, respectively, on a global regular grid at a 0.05° resolution. Global evaluation of these two products using independent near-surface Argo data indicated that the reprocessed analysis had a -0.1 K bias, with a standard deviation of 0.55 K, and -0.06 K bias, with a standard deviation of 0.46 K, respectively [32,33]. Using the subset from the above data, with the spatial range of 0-45° N and 105-135° E (601 × 901 grid), the data from January 1985 to December 2014 were used as the training

Data
The data used in this study were derived from two SST reanalysis L4 products, SST_GLO_SST_L4_REP_OBSERVATIONS_010_011 and SST_GLO_SST_L4_NRT_OBSERVA TIONS_010_001, which are available at the Copernicus Marine Environment Monitoring Service (CMEMS) website (http://marine.copernicus.eu/services-portfolio/access-toproducts/). They were processed at the Met Office, using the Operational SST and Sea Ice Analysis (OSTIA) system. The SST is the foundation temperature, which is free of diurnal variability. The reference depth was 5-10 m. The products contained SST daily data from 1 January 1985 to 31 December 2007, and 1 January 2006 to the present, respectively, on a global regular grid at a 0.05 • resolution. Global evaluation of these two products using independent near-surface Argo data indicated that the reprocessed analysis had a -0.1 K bias, with a standard deviation of 0.55 K, and -0.06 K bias, with a standard deviation of 0.46 K, respectively [32,33]. Using the subset from the above data, with the spatial range of 0-45 • N and 105-135 • E (601 × 901 grid), the data from January 1985 to December 2014 were used as the training set Remote Sens. 2020, 12, 2697 4 of 20 and the data from January 2015 to December 2018 as the test set. The data pre-processing is shown in Figure 2. The first step was generating the monthly average SST from the daily SST data. Then the SST inter-annual mean (SSTM) from 1985 to 2014 was calculated, which can be formulated in Equation (1). The SST inter-annual standard deviation (SSTD) was calculated in the same way. To obtain the SST anomaly (SSTA), the monthly SST data was subtracted by the SSTM.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 21 set and the data from January 2015 to December 2018 as the test set. The data pre-processing is shown in Figure 2. The first step was generating the monthly average SST from the daily SST data. Then the SST inter-annual mean (SSTM) from 1985 to 2014 was calculated, which can be formulated in Equation (1). The SST inter-annual standard deviation (SSTD) was calculated in the same way. To obtain the SST anomaly (SSTA), the monthly SST data was subtracted by the SSTM.  (1)

Proposed Method
The main processes of the proposed method include classification by SOM, deep neural network models training, iterative multi-step (IMS) in predicting, and ensemble predicting results. The following sections detail these processes.

Classification
In this process, we aimed to classify the study area into different zones, which have similar SST features. We adopted the widely used SOM for classification, which is a kind of unsupervised neural network introduced by Kohonen in 1981 [34]. Figure 3 shows the Kohonen SOM model represented as a two-dimensional sheet, depicting a topology of the network in the form of a lattice of neurons, which defines a discrete output space. Kohonen's SOM algorithm is based on competitive learning, which can cluster high-dimensional data vectors according to a similarity measure. Another two essential aspects of the algorithm are the time-varying neighborhood function ℎ , ( ) ( ) and the learning-rate parameter ( ) [35,36]. The neighbor function ℎ , ( ) ( ) is shown in Equation (2):

Proposed Method
The main processes of the proposed method include classification by SOM, deep neural network models training, iterative multi-step (IMS) in predicting, and ensemble predicting results. The following sections detail these processes.

Classification
In this process, we aimed to classify the study area into different zones, which have similar SST features. We adopted the widely used SOM for classification, which is a kind of unsupervised neural network introduced by Kohonen in 1981 [34]. Figure 3 shows the Kohonen SOM model represented as a two-dimensional sheet, depicting a topology of the network in the form of a lattice of neurons, which defines a discrete output space. Kohonen's SOM algorithm is based on competitive learning, which can cluster high-dimensional data vectors according to a similarity measure.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 21 set and the data from January 2015 to December 2018 as the test set. The data pre-processing is shown in Figure 2. The first step was generating the monthly average SST from the daily SST data. Then the SST inter-annual mean (SSTM) from 1985 to 2014 was calculated, which can be formulated in Equation (1). The SST inter-annual standard deviation (SSTD) was calculated in the same way. To obtain the SST anomaly (SSTA), the monthly SST data was subtracted by the SSTM.  (1)

Proposed Method
The main processes of the proposed method include classification by SOM, deep neural network models training, iterative multi-step (IMS) in predicting, and ensemble predicting results. The following sections detail these processes.

Classification
In this process, we aimed to classify the study area into different zones, which have similar SST features. We adopted the widely used SOM for classification, which is a kind of unsupervised neural network introduced by Kohonen in 1981 [34]. Figure 3 shows the Kohonen SOM model represented as a two-dimensional sheet, depicting a topology of the network in the form of a lattice of neurons, which defines a discrete output space. Kohonen's SOM algorithm is based on competitive learning, which can cluster high-dimensional data vectors according to a similarity measure. Another two essential aspects of the algorithm are the time-varying neighborhood function ℎ , ( ) ( ) and the learning-rate parameter ( ) [35,36]. The neighbor function ℎ , ( ) ( ) is shown in Equation (2): Another two essential aspects of the algorithm are the time-varying neighborhood function h j,i(x) (n) and the learning-rate parameter η(n) [35,36]. The neighbor function h j,i(x) (n) is shown in Equation (2): Remote Sens. 2020, 12, 2697 where r j defines the position of excited neuron j and r i defines the position of winning neuron i, d j,i is the absolute distance between the excited neuron j and winning neuron i.
where σ(n) is the neighborhood radius at discrete time n (i.e., the number of iterations in model training), σ 0 is the value of the neighbor radius at the initiation of the SOM algorithm, and τ 1 is a time constant chosen by the designer. The learning-rate parameter η(n) is shown in Equation (5): where η 0 is the initial learning rate and τ 2 is another time constant of the SOM algorithm. The basic steps involved in the application of the SOM algorithm are as follows: 1.
Choose small random values for the initial weight vectors w j (0), j = 1, 2, . . . , l, where l is the number of neurons, which is the classified number chosen by the designer, and w j (0) is different in l.

2.
Select an input vector x randomly in the training data set.

3.
Find the best-matching (winning) neuron i(x) at time-step n using the minimum-distance criterion:

4.
Adjust the synaptic weight vectors of all excited neurons using the update formula: At the SOM training stage, the SSTM and SSTD from 1985 to 2014 (30 years) were used. The sequences of input and output vectors are shown in the flowchart in Figure 4, where the left part shows the input data construction. For each grid (total 601 × 901) in the study area, there are 24 values as input vectors. When setting up the different number of neurons, that is, the number of classes, we found that under the input conditions, the maximum number of classes of the model for the study area was 130. Thus, the number of classes were set from 5 to 130, with a step size of 5. Finally, we produced 26 different class maps for the study area. The right part of Figure 4 shows different class maps generated by the SOM model. Different colors represent different class labels in the class map. At this step, each grid got a class label in each class map. The class label of each grid was used as auxiliary information in the following LSTM training stage.
Remote Sens. 2020, 12, x FOR PEER REVIEW  5 of 21 where, where defines the position of excited neuron and defines the position of winning neuron , , is the absolute distance between the excited neuron and winning neuron .
where ( ) is the neighborhood radius at discrete time (i.e., the number of iterations in model training), is the value of the neighbor radius at the initiation of the SOM algorithm, and is a time constant chosen by the designer. The learning-rate parameter ( ) is shown in Equation (5): where is the initial learning rate and is another time constant of the SOM algorithm. The basic steps involved in the application of the SOM algorithm are as follows: 1. Choose small random values for the initial weight vectors (0), = 1, 2, … , , where is the number of neurons, which is the classified number chosen by the designer, and (0) is different in . 2. Select an input vector randomly in the training data set. 3. Find the best-matching (winning) neuron ( ) at time-step using the minimum-distance criterion: 4. Adjust the synaptic weight vectors of all excited neurons using the update formula: At the SOM training stage, the SSTM and SSTD from 1985 to 2014 (30 years) were used. The sequences of input and output vectors are shown in the flowchart in Figure 4, where the left part shows the input data construction. For each grid (total 601 × 901) in the study area, there are 24 values as input vectors. When setting up the different number of neurons, that is, the number of classes, we found that under the input conditions, the maximum number of classes of the model for the study area was 130. Thus, the number of classes were set from 5 to 130, with a step size of 5. Finally, we produced 26 different class maps for the study area. The right part of Figure 4 shows different class maps generated by the SOM model. Different colors represent different class labels in the class map. At this step, each grid got a class label in each class map. The class label of each grid was used as auxiliary information in the following LSTM training stage.  When the number of classes was 5, the study area was basically divided based on the temperature zone. As the number of classes increased, the model tended to divide the area according to the SST When the number of classes was 5, the study area was basically divided based on the temperature zone. As the number of classes increased, the model tended to divide the area according to the SST features. Some regions with strong upwelling in the China Seas, as shown in [37], were well divided by the SOM model with the class number of 115.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 21 features. Some regions with strong upwelling in the China Seas, as shown in [37], were well divided by the SOM model with the class number of 115.

Deep Neural Networks Models
To predict the SSTs for the time steps, we chose deep neural networks based on LSTMs. Because of its powerful learning capacity, the LSTM has become the focus of deep learning. The LSTM is a kind of recurrent neural network (RNN) architecture that was proposed by Hochreiter and Schmidhuber in 1997 [38]. Compared with standard RNNs, it solves decaying error back flow problems, has default behaviors to capture long-term dependencies, and avoids loss of short time lags in practice. The proposer proved that LSTMs can learn to bridge time intervals in excess of 1000 time steps without of loss of short-term information. The key to achieving the above capabilities is the cell state ( ) in LSTMs, which is the horizontal line running through the middle of the flow chart ( Figure 6). It only has simple linear interactions in the data flow to preserve constant information, and the three gate units in the memory cell can add and remove information to the cell state: forget gate unit ( ), input gate unit ( ), and output gate unit ( ). From time step , the input vector and output vector of the previous step ℎ pass through the memory cells. After adjusting the three gates, the LSTM outputs ℎ and updates the cell state [39,40]. , , , are weight matrices in the following equations.
1. In the first step, the LSTM determines what information is removed from the previous cell state . The input vector , the outputs ℎ of the memory cells in the previous step, and the forget gate bias are calculated in the forget gate unit:

Deep Neural Networks Models
To predict the SSTs for the time steps, we chose deep neural networks based on LSTMs. Because of its powerful learning capacity, the LSTM has become the focus of deep learning. The LSTM is a kind of recurrent neural network (RNN) architecture that was proposed by Hochreiter and Schmidhuber in 1997 [38]. Compared with standard RNNs, it solves decaying error back flow problems, has default behaviors to capture long-term dependencies, and avoids loss of short time lags in practice. The proposer proved that LSTMs can learn to bridge time intervals in excess of 1000 time steps without of loss of short-term information. The key to achieving the above capabilities is the cell state (c t ) in LSTMs, which is the horizontal line running through the middle of the flow chart ( Figure 6). It only has simple linear interactions in the data flow to preserve constant information, and the three gate units in the memory cell can add and remove information to the cell state: forget gate unit ( f t ), input gate unit (i t ), and output gate unit (o t ).
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 21 features. Some regions with strong upwelling in the China Seas, as shown in [37], were well divided by the SOM model with the class number of 115.

Deep Neural Networks Models
To predict the SSTs for the time steps, we chose deep neural networks based on LSTMs. Because of its powerful learning capacity, the LSTM has become the focus of deep learning. The LSTM is a kind of recurrent neural network (RNN) architecture that was proposed by Hochreiter and Schmidhuber in 1997 [38]. Compared with standard RNNs, it solves decaying error back flow problems, has default behaviors to capture long-term dependencies, and avoids loss of short time lags in practice. The proposer proved that LSTMs can learn to bridge time intervals in excess of 1000 time steps without of loss of short-term information. The key to achieving the above capabilities is the cell state ( ) in LSTMs, which is the horizontal line running through the middle of the flow chart ( Figure 6). It only has simple linear interactions in the data flow to preserve constant information, and the three gate units in the memory cell can add and remove information to the cell state: forget gate unit ( ), input gate unit ( ), and output gate unit ( ). From time step , the input vector and output vector of the previous step ℎ pass through the memory cells. After adjusting the three gates, the LSTM outputs ℎ and updates the cell state [39,40]. , , , are weight matrices in the following equations.
1. In the first step, the LSTM determines what information is removed from the previous cell state . The input vector , the outputs ℎ of the memory cells in the previous step, and the forget gate bias are calculated in the forget gate unit: Figure 6. Structure of long short-term memory (LSTM) memory cell [39,40].
From time step t, the input vector x t and output vector of the previous step h t−1 pass through the memory cells. After adjusting the three gates, the LSTM outputs h t and updates the cell state c t [39,40]. W f , W c , W i , W o are weight matrices in the following equations.

1.
In the first step, the LSTM determines what information is removed from the previous cell state c t−1 . The input vector x t , the outputs h t−1 of the memory cells in the previous step, and the forget gate bias b f are calculated in the forget gate unit: The range of f t is scaled by the sigmoid function σ, which is between 0 (completely remove) and 1 (completely keep).

2.
In the next step, the LSTM determines what new information is stored in the cell state c t . This includes adding a new candidate value c t to the cell state and updating information through the input gate. The computation is as follows: 3. The third step updates the cell state based on the above output values.
4. Finally, the LSTM determines output h t : In addition to LSTM, leaky rectified linear units (leaky ReLUs) were used as the activation function, and dropout regularization was added to prevent overfitting. The adaptive moment estimation was chosen as an optimization function. The learning process related to deep neural networks involves many hyper-parameters, for example, the initial learning rate, the learning rate schedule, the momentum, the number of hidden neurons, layers and training iterations, the leaky ReLUs, dropout ratio, etc. Determining how to find the optimal hyper-parameters configuration is critical to achieving a good performance before training the models. During the configuration in this study, we first selected 500 sample grids in the study area. The SSTA data of each sample from 1985 to 2013 were used for training the random models under different configurations, and the 2014 data were used to evaluate the performance of the corresponding model. We initialized the models' weights using small random values from a uniform distribution; then, the procedure previously reported [41] was followed to find the range of initial values that can make the model converge. The random search process, which was shown to be more efficient than grid search in high-dimensional spaces [42], was repeated 300 times to assign different values for the parameters. We calculated the root mean square errors (RMSEs) of the prediction results for each random model to ensure good prediction accuracy. Instead of selecting the best model and discarding the rest, we retained the first 9 hyper-parameter configurations with the smallest RMSEs. Figure 7a shows the LSTM trained with class labels. The SSTA sequence data of each grid and its corresponding class label were input to the model in this case. The x in the time series of SSTA data denotes the input length. Figure 7b shows the LSTM trained without class labels, which means the SOM steps were missed out entirely in this case. For predicting the SSTA of 2015, the data from January 1985 to December 2014 were used as the training set. Then, the data for 2015 were added to the training set to update the model when predicting SSTA for 2016. The same update was also applied to 2017 and 2018. At the prediction stage, we used IMS methods. IMS estimation is one of the approaches for multistep prediction in deep neural networks. The IMS method first involves a one-step prediction, and then the generated prediction samples are iteratively fed to the single-step predictor to obtain the next-step prediction. This kind of multi-step prediction is relatively easy to operate and can be recursive for obtaining predictions of any length. In this study, we varied the lead time from 1 to 12 months with this strategy. The lead time was defined as the time elapsed between when the model was run and the forecast month.

Ensemble Predicting Results
One of the advantages of ensemble learning is that it can prevent the possibility of the network with the best performance on the validation set not being the one with the best performance on new test data [43]. Training and combining the networks from the ensemble instead of a single network can overcome the high output variance [43]. As mentioned in Section 2.3.1, there were a total of 26 class maps generated by the SOM. Therefore, during model training, the input SSTA field sequences could select 26 different class maps. In addition, there were 9 different model parameter configurations differing in input length, initial learning rate, layers, etc., for a total of 234 (26 × 9) models in each year's prediction. The aim of testing all the class maps was to find more appropriate ones for the study area. We also tested models trained without class labels for comparison. Figure 8 shows the change in the RMSEs of the 12-step prediction with input different class maps when the ensemble members ranged from 1 to 9. The years from 2015 to 2018 are denoted by different colored bars, and the label appearing in the abscissa represents the results of the model trained without class labels. The RMSE in all classes gradually decreased and stabilized as the ensemble members increased from 1 to 9. The average RMSE of the class at 90, 110, and 115 were below 0.7 °C in these four years, which were the smallest. This suggests these three classes are more appropriate for the study area. Therefore, the SSTA results predicted by the models trained with these three classes were composed together. Finally, the SSTA prediction value was added to the respective SSTM to produce the SST prediction map. At the prediction stage, we used IMS methods. IMS estimation is one of the approaches for multi-step prediction in deep neural networks. The IMS method first involves a one-step prediction, and then the generated prediction samples are iteratively fed to the single-step predictor to obtain the next-step prediction. This kind of multi-step prediction is relatively easy to operate and can be recursive for obtaining predictions of any length. In this study, we varied the lead time from 1 to 12 months with this strategy. The lead time was defined as the time elapsed between when the model was run and the forecast month.

Ensemble Predicting Results
One of the advantages of ensemble learning is that it can prevent the possibility of the network with the best performance on the validation set not being the one with the best performance on new test data [43]. Training and combining the networks from the ensemble instead of a single network can overcome the high output variance [43]. As mentioned in Section 2.3.1, there were a total of 26 class maps generated by the SOM. Therefore, during model training, the input SSTA field sequences could select 26 different class maps. In addition, there were 9 different model parameter configurations differing in input length, initial learning rate, layers, etc., for a total of 234 (26 × 9) models in each year's prediction. The aim of testing all the class maps was to find more appropriate ones for the study area. We also tested models trained without class labels for comparison. Figure 8 shows the change in the RMSEs of the 12-step prediction with input different class maps when the ensemble members ranged from 1 to 9. The years from 2015 to 2018 are denoted by different colored bars, and the label N appearing in the abscissa represents the results of the model trained without class labels. The RMSE in all classes gradually decreased and stabilized as the ensemble members increased from 1 to 9. The average RMSE of the class at 90, 110, and 115 were below 0.7 • C in these four years, which were the smallest. This suggests these three classes are more appropriate for the study area. Therefore, the SSTA results predicted by the models trained with these three classes were composed together. Finally, the SSTA prediction value was added to the respective SSTM to produce the SST prediction map. Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 21

Results
Both the model trained with class labels and without class labels were used separately to predict the SSTs during the period 2015 to 2018. The bias, standard deviation (SD), RMSE and structural similarity index (SSIM) values were calculated between the predicted monthly SST and corresponding OSTIA monthly SST. To better evaluate the spatial pattern of the prediction results, we selected 2016 and 2018 as examples.

Results
Both the model trained with class labels and without class labels were used separately to predict the SSTs during the period 2015 to 2018. The bias, standard deviation (SD), RMSE and structural similarity index (SSIM) values were calculated between the predicted monthly SST and corresponding OSTIA monthly SST. To better evaluate the spatial pattern of the prediction results, we selected 2016 and 2018 as examples. Figure 9 displays the histograms of the SST difference obtained from the model trained with class labels (blue bars and text) and without class labels (red bars and text) from 2015 to 2018. Compared with the model trained without class labels, the statistics show that the model trained with class labels has a lower bias and a higher proportion of SST difference within ±0.5 • C. In different prediction lead times, the errors of the model trained with class labels are always lower than the model without class labels. Especially when the lead-time is seven months, the statistical differences are the largest. The bias from the model trained with class labels is -0.17 • C, whereas that for the model trained without labels is -0.33 • C.

Comparison between the Model Trained with and without Class Labels
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 21 Figure 9 displays the histograms of the SST difference obtained from the model trained with class labels (blue bars and text) and without class labels (red bars and text) from 2015 to 2018. Compared with the model trained without class labels, the statistics show that the model trained with class labels has a lower bias and a higher proportion of SST difference within ±0.5 °C. In different prediction lead times, the errors of the model trained with class labels are always lower than the model without class labels. Especially when the lead-time is seven months, the statistical differences are the largest. The bias from the model trained with class labels is -0.17 °C, whereas that for the model trained without labels is -0.33 °C. To further explore the differences in the spatial distribution of the errors between these two models with a seven-month prediction lead time, we chose 2016 (a year with the strong El Niño) to display the SST prediction results for the four seasons. Figure 10a depicts the results obtained from the model trained with class labels, whereas Figure 10b is the model trained without class labels. Compared with Figure 10b, the proportion of the errors exceeding ±1 °C in Figure 10a is significantly lower. To further explore the differences in the spatial distribution of the errors between these two models with a seven-month prediction lead time, we chose 2016 (a year with the strong El Niño) to display the SST prediction results for the four seasons. Figure 10a depicts the results obtained from the model trained with class labels, whereas Figure 10b is the model trained without class labels. Compared with Figure 10b, the proportion of the errors exceeding ±1 • C in Figure 10a is significantly lower.

SST Prediction with Different Lead Times from 2015 to 2018
The monthly statistics including the biases and standard deviations as well as the average RMSE for the four years with different lead times using the proposed model are shown in Figure 11. The monthly biases and standard deviations obtained by the model are denoted by blue points and error bars. At a one-month lead time, the prediction accuracy is the highest, with the RMSE around 0.5 °C. When the lead time increases to two months, the RMSE increases to 0.59 °C and then slowly rises. It reaches a maximum when the lead time is 11 months, with an RMSE around 0.66 °C. Compared with the biases of each year, the standard deviations are more stable as the lead-time increases. Overall, the prediction error shows a fluctuating rising and falling tendency along with the prediction time series. Generally, a relatively large bias is likely to occur in summer, such as in August 2016. Compared with other months in 2016, the standard deviation of the prediction results for August suddenly increased. We checked the prediction results and found that the model underestimated the SST in the East China Sea more seriously. In the prediction period, the bias in 2015 is lower, which is -0.06 °C with a one-month lead time and -0.05 °C for a 12-month lead time.

SST Prediction with Different Lead Times from 2015 to 2018
The monthly statistics including the biases and standard deviations as well as the average RMSE for the four years with different lead times using the proposed model are shown in Figure 11. The monthly biases and standard deviations obtained by the model are denoted by blue points and error bars. At a one-month lead time, the prediction accuracy is the highest, with the RMSE around 0.5 • C. When the lead time increases to two months, the RMSE increases to 0.59 • C and then slowly rises. It reaches a maximum when the lead time is 11 months, with an RMSE around 0.66 • C. Compared with the biases of each year, the standard deviations are more stable as the lead-time increases. Overall, the prediction error shows a fluctuating rising and falling tendency along with the prediction time series. Generally, a relatively large bias is likely to occur in summer, such as in August 2016. Compared with other months in 2016, the standard deviation of the prediction results for August suddenly increased. We checked the prediction results and found that the model underestimated the SST in the East China Sea more seriously. In the prediction period, the bias in 2015 is lower, which is -0.06 • C with a one-month lead time and -0.05 • C for a 12-month lead time. Figure 12 shows the distribution of the RMSE from 2015 to 2018 with different prediction lead times in the study area. When the predicted lead time is gradually increased to 12 months, the RMSE near 0-10 • N, 120-130 • E is less than 0.4 • C; the RMSE near the eastern and southwestern Philippines, and near the Kuroshio is less than 0.6 • C; the RMSE near the coastal areas is around 1 • C. The prediction accuracy near the northeast of the Korean Peninsula is the lowest. Overall, from the southwest to the northeast of the study area, as the lead time increases, the prediction accuracy gradually decreases. The distribution patterns of prediction accuracy have a strong relationship with the ocean currents. The main current systems in the southeastern Taiwan area are the equatorial north flow and the Kuroshio mainstream. They are relatively stable in speed and temperature, generally with no regular seasonal changes. The prediction accuracy remains high under a long prediction lead time in this kind of area. The waters near the mainland are dominated by coastal currents. The direction of these currents changes seasonally. In some areas, e.g., the Yangtze River estuary and the northeast of the Korean Peninsula, the cold coastal currents join the warm currents, which means the water is characterized by large annual variations in temperature. The prediction accuracy noticeably decreases as the lead-time increases in the areas with large annual variations in temperature.

Predicted SST Distribution in Space
The 12-month distribution of predicted SSTs at 12 different lead times is shown in Figure 13. The SST features, like the Kuroshio, the Yellow Sea warm currents and the cold coastal China currents in winter are clearly shown in the prediction maps, and they show obvious seasonal variability. We adopted the SSIM [44] to evaluate the quality of the prediction maps relative to the validation maps. The local SSIM maps and global mean SSIM values are shown in Figure 14. Brighter pixels are associated with higher SSIM values, which means the SST patterns are more similar to the observed SST. This set of images shows that the predicted SST maps are overall consistent with those of the OSTIA SST maps. The global SSIM values are all above 0.97, and large pattern deviations mainly appear in the northern part of the Japan Sea, which means that the local SSIM is near 0.5. This suggests the classification of the study area does not affect the continuity of the SST pattern in this region. Figure 15 shows the differences in the SST distributions between the predicted results and the observation data. Most pixels' values are within ±1 • C, and the percentages range from 85.17% to 98.19%, with the mean being 90.64% based on the statistics in Table 1. The predicted anomalies often occurred near the coastal area and the northern Japan Sea, which are beyond ±1 • C. In July and October, the area near 22 • N, 130 • E also has large biases. The possible reason may be related to the SST anomaly. As shown in Figure 16, the SST anomaly pattern obtained from OSTIA is similar to Figure 15. Generally, the areas with a large anomaly have a relatively low prediction accuracy. The biases, standard deviations, and RMSEs between the predicted and original OSTIA SST in 2018 were also calculated and are presented in Table 1. The biases range from −0.29 to 0.35 • C, with a mean of 0.12 • C; the standard deviations range from 0.39 to 0.72 • C, with a mean of 0.56 • C.  Figure 12 shows the distribution of the RMSE from 2015 to 2018 with different prediction lead times in the study area. When the predicted lead time is gradually increased to 12 months, the RMSE near 0-10° N, 120-130° E is less than 0.4 °C; the RMSE near the eastern and southwestern Philippines, and near the Kuroshio is less than 0.6 °C; the RMSE near the coastal areas is around 1 °C. The prediction accuracy near the northeast of the Korean Peninsula is the lowest. Overall, from the southwest to the northeast of the study area, as the lead time increases, the prediction accuracy gradually decreases. The distribution patterns of prediction accuracy have a strong relationship with the ocean currents. The main current systems in the southeastern Taiwan area are the equatorial north

Predicted SST Distribution in Space
The 12-month distribution of predicted SSTs at 12 different lead times is shown in Figure 13. The SST features, like the Kuroshio, the Yellow Sea warm currents and the cold coastal China currents in winter are clearly shown in the prediction maps, and they show obvious seasonal variability. We adopted the SSIM [44] to evaluate the quality of the prediction maps relative to the validation maps. The local SSIM maps and global mean SSIM values are shown in Figure 14. Brighter pixels are associated with higher SSIM values, which means the SST patterns are more similar to the observed SST. This set of images shows that the predicted SST maps are overall consistent with those of the model trained with class labels was more stable and the bias was decreased in the periods with abnormally higher SST. The stable and accurate prediction may help to observe the SST trend and provide an outlook of climate change. However, the model still lacked the capability to predict extreme events. Therefore, in subsequent work, the IOD, monsoon, and current characteristics related to the SST changes in the study area will also be considered in model training. The spatial data around the sequence will also be added to improve the accuracy of long-term prediction.

Conclusions
In this study, we completed 12-month lead time SST predictions over the China Seas and their adjacent waters. The results had a RMSE of 0.5 °C with a one-month lead time and 0.66 °C with a 12month lead time. The prediction accuracy was highest in the southeast of the study area, with an

Discussion
In the present study, we applied SOMs to classify the SST data, and then used the class labels obtained by the classification process as auxiliary inputs for LSTM-based model training and validation. Compared with the model trained without class labels, the prediction accuracy of the model trained with class labels was more stable and the bias was decreased. According to the China offshore ocean climate monitoring bulletin released by the National Marine Environment Forecasting Center, China offshore SST was generally higher than the same period of the previous year in July 2016 (a year with the strong El Niño) [45]. The model trained without class labels produced a wide range of underestimations in July, but the model trained with class labels reduced the bias considerably. The possible reason is that after classifying the SSTs in the study area through the SOM, the SSTs with different features are distinguished. These classification data containing regional information provide more characteristics during the model training process, which fits the data better and lowers the output bias.
The monthly statistical results with different prediction lead times showed that the prediction error had a fluctuating rising and falling tendency along with the prediction time series. The prediction SST had a large standard deviation in August 2016 compared with other months. We checked the error distribution and found that our model seriously underestimated the SST in the East China Sea. Referring to Tan and Cai [46], in 2018, the East China Seas experienced its warmest monthly mean SST. This extreme warming event was closely linked to the Indian Ocean Dipole (IOD). Since the SST of the Indian Ocean was not included in the training data set, the model had less capability to capture extreme warming. Another possible reason is that we used the IMS for multi-step prediction, which propagated errors from earlier time steps into later time steps, but we found no trend of linear error accumulation. Finally, the RMSE distribution in space suggested the prediction accuracy had a strong relationship with seasonal ocean current variability. Where the current was obvious and stable, the error was low; otherwise, the error obviously increased as the lead-time increased.
Overall, compared with the model trained without class labels, the prediction accuracy of the model trained with class labels was more stable and the bias was decreased in the periods with abnormally higher SST. The stable and accurate prediction may help to observe the SST trend and provide an outlook of climate change. However, the model still lacked the capability to predict extreme events. Therefore, in subsequent work, the IOD, monsoon, and current characteristics related to the SST changes in the study area will also be considered in model training. The spatial data around the sequence will also be added to improve the accuracy of long-term prediction.

Conclusions
In this study, we completed 12-month lead time SST predictions over the China Seas and their adjacent waters. The results had a RMSE of 0.5 • C with a one-month lead time and 0.66 • C with a 12-month lead time. The prediction accuracy was highest in the southeast of the study area, with an RMSE of less than 0.4 • C for a 12-month prediction lead time. This suggested it is feasible to place the data into a single model for spatio-temporal SST prediction. Introducing the spatial feature information through classification can improve prediction accuracy. Combining multiple models with different parameters and structures can improve the stability of prediction results.