Spatiotemporal Model Based on Deep Learning for ENSO Forecasts

: El Niño and Southern Oscillation (ENSO) is closely related to a series of regional extreme climates, so robust long-term forecasting is of great signiﬁcance for reducing economic losses caused by natural disasters. Here, we regard ENSO prediction as an unsupervised spatiotemporal prediction problem, and design a deep learning model called Dense Convolution-Long Short-Term Memory (DC-LSTM). For a more sufﬁcient training model, we will also add historical simulation data to the training set. The experimental results show that DC-LSTM is more suitable for the prediction of a large region and a single factor. During the 1994–2010 veriﬁcation period, the all-season correlation skill of the Nino3.4 index of the DC-LSTM is higher than that of the current dynamic model and regression neural network, and it can provide effective forecasts for lead times of up to 20 months. Therefore, DC-LSTM can be used as a powerful tool for predicting ENSO events.


Introduction
ENSO is a combination of El Niño and Southern Oscillation. The former is an abnormal warming phenomenon of sea surface temperature in the tropical Pacific, and the latter describes a seesaw fluctuation in sea-level atmospheric pressures over the southern Pacific and Indian oceans. Both phenomena have obvious periodicity, occurring once every four years on average. Bjerknes [1] considers that, although El Niño and Southern Oscillation have different manifestations, they have the same physical laws behind them. El Niño will destroy the normal circulation of ocean currents in the South Pacific, and then disturb the original distribution regularity of the global pressure belt and wind belt. When the ENSO scale is large, severe climate disasters will occur in many regions of the world, especially in the countries along the Pacific Ocean. For example, in the ENSO phenomena of 1986 and 1987, the sea surface temperature in the equatorial Pacific was about 2°higher than the average temperature. The northern and central parts of Peru were plagued by rainstorms, while South Asia and northern Africa were dry and rainless. This brought huge economic losses to local agriculture and fisheries [2,3]. The international mainstream standard takes the regional sea surface temperature anomaly as the basic monitoring index. For example, the US Climate Prediction Center defines an El Niño/La Nina event as the absolute value of the Niño 3.4 index's three-month moving average exceeding 0.5 and lasting for at least five months. (The Niño3.4 index is the average sea surface temperature anomaly (SSTA) in the area from 170°W to 120°W in longitude and 5°N to 5°S in latitude [4]). In view of the great influence of ENSO, scientists from various countries have conducted a lot of research on methods for forecasting it. Traditional ENSO prediction models mainly include statistical and dynamical models.
The statistical models are developed using the correlation between historical datasets and are typically designed to minimize RMSE. The National Centers for Environmental Prediction (NCEP), the Climate Prediction Center (CPC) and the Earth System Research Laboratory (ESRL) of NOAA have developed some statistical tools for sea surface temperature prediction in the tropical Pacific, including the Linear Inverse Model (LIM) [5], Climatology and Persistence (CLIPER) [6], the Markov Model (MKV) [7], Constructed Analogues (CA) [8] and Canonical Correlation Analysis (CCA) [9]. The dynamical models are based primarily on the physical equations of the atmosphere and the land and ocean system, ranging from relatively simple physics to comprehensive fully coupled models. As a fully coupled dynamical model, Climate Forecast System Version 2 (CFSV2) [10] has been widely used. It was developed at the Environmental Modeling Center at NCEP and became operational in March 2011. The CFSv2 model offers hourly data with a horizontal resolution down to one-half of a degree, greatly improving SST prediction skills.
Statistical and dynamical models have their own advantages and weaknesses in ENSO forecasts. Previous studies [11] suggest that the average skill of dynamical models [12,13] is usually better than that of statistical models [14]. However, due to the uncertainty of the initial conditions, in practice, it is difficult to simulate the inter-annual average change of sea surface temperature [15]. Statistical models need long histories of predictor data to discover potential relationships, but observations of the tropical Pacific only began in the 1990s. However, statistical models still have development potential, because, compared with complex dynamic models, they reduce costs and are easy to develop. Despite decades of effort, the use of traditional methods to provide effective ENSO forecasts for more than one year is still a problem.
With the advent of the big data era, artificial intelligence technology has continuously made breakthrough achievements in various fields. Some scholars [16][17][18] try to use machine learning methods to directly predict the Nino3.4 index. Nooteboom [17] combined the classic autoregressive integrated moving average technology with an artificial neural network combined to predict the ENSO index. Ham [18] utilized SST and heat content (vertically averaged oceanic temperature in the upper 300 m) anomaly maps over 0-360°E, 55°S-60°N for three consecutive months to produce skilful ENSO forecasts for lead times of up to one and a half years through a convolutional neural network and transfer learning techniques [19]. Although the final result is better than that of the traditional model, only one indicator is obtained, and it is difficult to further analyze its distribution or cause. In addition, some scholars use the method of multi-mode ensemble to reduce the variance of prediction. Zhang [20] used the Bayesian model average to describe the uncertainty of the model. Different models will be assigned different weights according to the prediction performance, and better models will be assigned greater weights. The combination of three statistical and dynamical models can provide a more skillful ENSO prediction than any single model. In recent years, some scholars have achieved excellent results in predicting meteorological factors using deep learning methods [21][22][23][24], such as precipitation and temperature. Considering that ENSO is mainly reflected in changes in sea surface temperature, forecasting ENSO phenomena is equivalent to forecasting sea surface temperature anomalies. Therefore, the SST prediction method can be used for ENSO prediction. Aguilar [25] used the Bayesian Neural Network (BNN) and Support Vector Regression (SVR), two nonlinear regression methods, to predict the tropical Pacific sea surface temperature anomalies over the next 3 to 15 months. Zhang [26] formulated the SST forecast problem as a time series regression problem and made short-term and long-term forecasts. Although this kind of approach finally obtains the regional sea surface temperature distribution, it obviously breaks the relationship between adjacent points and loses the spatial feature of temperature. In addition, the current training set of most models only contains reanalysis data, so the amount of data is too small to fully train the model.
In this study, we use simulation data and assimilation data to alleviate the problem of insufficient training sets. We introduce an unsupervised learning domain method to predict ENSO and propose DC-LSTM to also predict sea surface temperature, not only the nino3.4 index.
Firstly, sea surface temperature prediction is represented as a spatiotemporal prediction problem rather than a time series regression task. We use the improved model-DC-LSTM-to predict the monthly average sea surface temperature anomaly distribution in the equatorial Pacific and the corresponding Nino3.4 index over the next two years. A dense convolution layer and a transposed convolution layer are used to extract the spatial features and sampling of SSTA, and the multi-layer causal-LSTM is used to mine the deep law of sea temperature changes. We designed six networks with different widths and depths, and found the optimal structure of the model by comparing the root mean squared error (RMSE). In addition, we also try to integrate other meteorological elements, including T300 (vertically averaged oceanic temperature in the upper 300 m, u-wind, vwind), taking into account that the global ocean temperature distribution observation data are available from 1871. This means that for each calendar month, the number of samples is less than 150. In order to meet the amount of data necessary for model training, we used historical simulation data and reanalysis data. To some extent, they can simulate the development of ENSO and provide initial training for the model. After we mix them, we use a sliding window method to randomly select sequences as the training set. Compared with traditional dynamic models and time series deep learning models, our method can provide longer more skilful forecasts.
The rest of the paper is organized as follows: Section 2 describes unsupervised spatiotemporal prediction and the DC-LSTM methodology. Section 3.1 describes the dataset that was used during training. Section 3.2 describes the training process of DC-LSTM. In Section 3.3, the results are presented and discussed. In Section 4, we conclude the strengths and weaknesses of DC-LSTM and the direction of the work we will expand in the future.

Materials and Methods
Suppose we express the meteorological data as a dynamical system over time. The observation at any time can be represented by a tensor X ∈ R P×M×N , where M × N represents the spatial region, and P is the number of meteorological factors. Then the prediction of a sea surface temperature anomaly can be expressed as an unsupervised spatiotemporal prediction problem, that is, according to a length − J sequence, and can predict the most likely distribution of the sequence after K times in the future: At present, neural networks based on RNN [27] and LSTM [28] have achieved great success in the field of spatiotemporal prediction [29][30][31][32][33][34][35][36]. We chose the leading and most stable model to improve from the perspective of feature pre-extraction, and propose Dense Convolution-LSTM (DC-LSTM). This section will describe the detailed structure of DC-LSTM. As shown in Figure 1, our model contains three components: the module below is the Dense Block, which is used to extract the spatial features of each time input; between adjacent Dense Blocks is the max pooling layer, which is used to reduce the size of feature maps; the middle module is L-layer Causal LSTM, and the input of each layer is the hidden state of the previous layer. The green arrow shows the transition paths of state M, the blue arrow shows the gradient high unit, and the red arrow shows the transition paths of the hidden state H. The subscript represents the time and the superscript represents the layer. The top module is the Transposed Convolution layer, which is used to restore the original size of the feature map. The model will forecast the next sea surface temperature according to the current meteorological factors for each iteration.   Figure 2 shows the prediction mechanism of the whole system. In the training phase, for the input, probability p is the true value, and probability 1-p is the last prediction. In the initial stage, the prediction ability of the model is weak, and the value of p is 1. With the increase of training iterations, the value of p decreases gradually. In the verification phase, p is fixed to 0. The specific structure of each module is as follows.

Causal LSTM
In sequence-to-sequence tasks, LSTM [28] has been widely used as a special RNN model. The flow of information in LSTM is controlled by a 'Gate' structure, the value of which is between 0 and 1. Among them, the forget gate f t controls the discarding of old features, the input gate i t controls the addition of new features, and the output gate o t controls the output content of the hidden layer. With the help of the gate structure, LSTM has a stable and powerful long-range dependencies modeling capability. It should be noted that, in the original LSTM, the input, output and states must be 1D vectors. The key equations are shown as below, where ' ' denotes the Hadamard product: Causal-LSTM [37] is a spatiotemporal memory module based on LSTM. Each Causal-LSTM cell memorizes the spatiotemporal features of the sequence through the state C transferred along the horizontal direction and the state M transferred along the zigzag direction. Like ConvLSTM [38], CausalLSTM extracts spatiotemporal features by replacing the full connections operation with the convolution operator in the state-to-state and input-to-state transitions. Causal-LSTM adopts a cascaded mechanism, where C t will be concatenated with x t , M l to calculate the input gate i t , and the forget gate f t . Finally, the current hidden state h t can be achieved by concatenating C t and M t . The equations of Causal LSTM are shown as follows: where is the Hadamard product , * is convolution, σ is Sigmoid activation function, and the superscript K means the layer in the stacked Causal-LSTM network. The square bracket indicates the concatenation operation of the tensors in the channel dimension. W m and W h are 1 × 1 convolutional filters for making the channel number of the concatenation tensor consistent with that of hidden state h t . For the sea surface temperature anomaly prediction task, the length of input and output sequences is usually above 30. Due to the zigzag transmission route of spatial memory M, the distance is too long for transmitting the early information to the current state. However, these features are important for the prediction of interannual SSTA. Inspired by highway layers [39], a new spatiotemporal recurrent structure, named the Gradient Highway Unit (GHU), was inserted between the bottom and second layers of Causal LSTM. GHU can help the model to learn the capability of skipping frames and make the propagation of the gradient more efficient in very deep feed-forward networks. The equations of GHU are shown below:

Dense Convolution Layer
The algorithm using LSTM requires high computation power during its training, which limits the application of a neural network to a certain extent. Therefore, the input needs to be compressed while retaining the key features as much as possible before prediction, and then needs to restore the image size after prediction. Inspired by [40], we tried to use an end-to-end trainable method called Dense Convolution and Transpose Convolution instead of the inflexible patch sampling in the original model. The basic module of the dense convolutional layer is the Dense Block, the structure of which is shown in Figure 3.
In this module, each layer of convolution can receive input from all previous layers and pass its output to all subsequent layers. The cascading method allows each layer to accept the "collective knowledge" from the front: Since the channel concatenation operation requires feature maps of the same size in the block, a translation layer is inserted between adjacent blocks to reduce the feature size. The translation layer is composed of 1 × 1 convolution and max pooling. The max pooling layer only keeps the maximum value in one region of the feature maps (size = 2, stride = 2), and 1 × 1 convolution is used to adjust the shape. In the down sampling stage, the size of the feature map will be reduced by 2 n times, where n is the number of translation layers. The addition of the Dense Convolution layer makes the DC-LSTM have higher computational efficiency and storage efficiency. K channels K channels K channels K channels C C C C

Data and Implementation
Our model uses the Simple Ocean Data Assimilation (SODA) reanalysis data [41] from 1871 to 1973 and the outputs of the Coupled Model Intercomparison Project Phase 5 (CMIP5) from 1861 to 2004 as the training set, which simulates the development of ENSO to a certain extent [42]. We only use a single ensemble member for all CMIP5 models to train the DC-LSTM model. We use the Global Ocean Data Assimilation System (GODAS) reanalysis data [43] from 1994 to 2010 as the validation set ( Table 1). The number of ensemble members and the period for each Integration CMIP5 model are given in Table 2. Since a year in the CMIP5 model is solely dependent on the prescribed greenhouse gas forcing, the observational information and the CMIP5 historical simulation are independent of each other. In order to eliminate the possible influence of marine memory in the training period on ENSO in the verification period, we also leave a twenty-year gap between the last year of the training set and the earliest year of the verification set. The predictors of our model include monthly average SSTA, T300, UA, and VA. The resolution of the meteorological map at each time is 5 • × 5 • , covering the area over 55 • S-60 • N , 0 • -360 • E. In order to further increase the amount of data, we use a sliding window with a length of 20 and a step size of 1 to extract samples for each CMIP5 mode output. In order to reduce the strong correlation between adjacent samples caused by the sliding window, all samples will be shuffled. Finally, a training set containing 44,698 sequences is obtained.

Training
The experiment in this article includes five groups, and the specific plan is as follows: 1.
Use six different hyperparameters to preliminarily determine the number of stacked layers and hidden layer states of the network, find the network scale that is most suitable for the SSTA prediction problem, and avoid over-fitting; 2.
Determine the best predictor by comparing the effects of using SSTA alone and integrating T300 with U-wind and V-wind, respectively; 3.
Compare the ENSO correlation skill of the different prediction regions; 4.
Verify the improvement of forecasting skills by using historical simulation data; 5.
Compare the ENSO prediction skills for the trained DC-LSTM model with dynamic models and other deep learning models; We use Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Structural Similarity (SSIM) to evaluate the model's ability to predict SST changes, and use the Pearson Correlation Coefficient of the three-month moving average of the Nino3.4 index (current month and the next two months) to evaluate the model's ability to predict ENSO. Low RMSE and MSE, and high correlation skills and SSIM represent good forecast ability. The indicators are defined as follows: All experiments are implemented on Pytorch, and we use Adam [44] as the optimizer with MSE loss. The batch size is set to 16, and the starting learning rate is 10 −3 . All models will be trained for 2 epochs on a computer with a single NVIDIA GTX 1080ti GPU. The model consists of 6 layers of LSTM, and each layer of hidden states is 128. The size of zero-padding, stride and convolution kernel in state-to-state and input-to-state are 2, 1 and 5. During the whole training process, the random seed is fixed to 100.

Results and Discussion
Due to the relatively long training period of the model using LSTM, it is difficult to find the optimal parameters through a large number of trial-and-error experiments. We set some insensitive hyperparameters as fixed values (time steps = 10, kernerl size = 5, stride = 1)which are widely used in other models using LSTM [35,37,38]-and fine tune others. Tables 3 and 4 show the RMSE, MSE and SSIM indicators of different hyperparameters for the training set and the test set. We can find that, as the number of layers and hidden states increase, the prediction errors gradually decrease. When the number of layers is six and the number of hidden states is 128, the model achieves the best results on the verification set. It is also worth noting that when the number of layers reaches seven and that of the hidden states reaches 128, the error continues to decline in the training set, but it starts to rebound in the verification set. This suggests that the fitting ability of the model can be improved by increasing the number of parameters, but an excessive fitting training set will affect the generalization performance of the model, that is, the over fitting phenomenon occurs. The results of the second experiment are presented in Figure 4 and show that neither the integration of T300 nor the integration of wind field information has brought a lower RMSE, that is, the current DC-LSTM model is temporarily unable to dig out the connections between various meteorological factors. By calculating the correlation coefficients of different factors, we find that the Pearson Correlation Coefficient between SSTA and T300 is too high, reaching 0.741. This suggests that there is a strong linear correlation between SSTA and T300. Therefore, the integration of T300 has no obvious effect on the improvement of accuracy. On the contrary, more feature maps will increase the burden of the model. Considering that the current single factor has the best effect, only SSTA is used as a predictor in subsequent model training.  Figure 5 shows the ENSO correlation skills of the models for predicting SSTA in different regions. We can see that model A, which only focuses on the Nino3.4 area, is better at mid and short-term (lead months < 7) prediction, and model B, which focuses on the Pacific area, has stronger long-term prediction (lead months > 7) ability. Figure 6 shows an example of predicting SSTA in different regions. Model A cannot react in advance to the temperature drop in the eastern part of the Nino3.4 area due to its small field of view. On the contrary, model B can capture the sea temperature change from a more macro perspective, so it successfully predicts the trend of sea temperature anomalies over the next 1.5 years. The results for the all-season correlation skills with different datasets are shown in Figure 7. When the model is sufficiently trained, the Nino3.4 index is above 0.5 for a lead time of up to 13 months, whereas it is 0.38 at a lead time of 20 months using only the SODA reanalysis dataset. Figure 8 shows the all-season correlation skills of the DC-LSTM model and dynamical models. We use the dynamical prediction systems included in the North American Multi-Model Ensemble (NMME) project as the comparison methods. The NMME dynamical prediction system focuses on forecasting the entire climate state-ocean, atmosphere, land and sea ice-rather than a regional sea surface temperature, so ENSO forecasts for more than 11 months cannot be provided. In the first 11 months, DC-LSTM is one of the two models with the best early forecasting skills. The all-season correlation skills of the DC-LSTM model were higher than 0.5 for 20 months. Table 5 shows that the prediction error of DC-LSTM at a lead time of 20 months is lower than that of CNN. Figure 9 shows the Nino3.4 index of the DC-LSTM and CNN, a leading regression deep learning model [18], for the December-January-February (DJF) season at a lead time of 20 months. We find that CNN and DC-LSTM can approximately predict the trend of the Nino3.4 index, but for some extreme cases (such as 1997 and 2009), the prediction is too conservative. Table 6 shows the number of parameters and the time needed to calculate one sequence. Due to the complex gate mechanism of LSTM, DC-LSTM requires much more computation power than CNN. However, considering that the ENSO prediction interval is relatively low, this cost is still acceptable. As science and engineering continue to advance, the cost will decrease. Finally, we consider that, due to its powerful spatiotemporal feature extraction capabilities and sufficient training with a large amount of simulation data, DC-LSTM can provide effective predictions of ENSO events 1.5 years in advance from the perspective of unsupervised learning and spatiotemporal prediction.

Conclusions
In this article, we treat the ENSO prediction problem as an unsupervised spatiotemporal prediction problem and design a DC-LSTM model. By analyzing the prediction results using different hyperparameters, regions and predictors, we determined the final plan. Compared with dynamic methods and regression deep learning methods, DC-LSTM can provide longer effective forecasts. Taking into account the versatility of DC-LSTM, as well as ENSO, we will also try to apply it to the prediction of radar echo, precipitation, humidity and other meteorological factors in the future.  Data Availability Statement: Data related to this paper can be downloaded from: SODA: https: //climatedataguide.ucar.edu/climate-data/soda-simple-ocean-data-assimilation (accessed on 1 May 2021); GODAS: https://www.esrl.noaa.gov/psd/data/gridded/data.godas.html (accessed on 1 May 2021); NMME: https://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/ (accessed on 1 May 2021); CMIP5: https://esgf-node.llnl.gov/projects/cmip5/ (accessed on 1 May 2021).

Conflicts of Interest:
The authors declare no conflict of interest.