Forecasting the June Ridge Line of the Western Paciﬁc Subtropical High with a Machine Learning Method

: The ridge line of the western Paciﬁc subtropical high (WPSHRL) plays an important role in determining the shift in the summer rain belt in eastern China. In this study, we developed a forecast system for the June WPSHRL index based on the latest autumn and winter sea surface temperature (SST). Considering the adverse condition created by the small observed sample size, a very simple neural network (NN) model was selected to extract the non-linear relationship between input predictors (SST) and target predictands (WPSHRL) in the forecast system. In addition, some techniques were used to deal with the small sample size, enhance the stabilization of the forecast skills, and analyze the interpretability of the forecast system. The forecast experiments showed that the linear correlation coefﬁcient between the predictions from the forecast system and their corresponding observations was around 0.6, and about three-ﬁfths of the observed abnormal years (the years with an obviously high or low WPSHRL index) were successfully predicted. Furthermore, sensitivity experiments showed that the forecast system is relatively stable in terms of forecast skill. The above results suggest that the forecast system would be valuable in real-life applications.


Introduction
The ridge line of the western Pacific subtropical high (WPSHRL) plays an important role in determining the shift in the summer rain belt in eastern China [1][2][3][4]. The prediction of the summer WPSHRL has great application value, so it has been widely studied in the field of climatology [5][6][7][8][9][10][11][12]. Both traditional statistical methods and numerical simulation methods have shown that the monthly (i.e., at the lead time of one month) forecast skill (evaluated by the correlation coefficient) can reach 0.5 or higher [13,14]. However, there is still no study that has reported an acceptable forecast skill (>0.5) at lead times longer than three months (i.e., one season), even though the prediction with a lead time of three months has more application value than the prediction with a lead time of one month [15]. Therefore, the major purpose of this study was to develop an acceptable forecast system for the June WPSHRL index with a lead time of three months. To achieve this goal, we tried several new methods (e.g., machine learning).
In recent years, machine learning methods have been used in climate prediction [16][17][18], and many successes in the prediction of sea surface temperature indexes beyond one year (>12 months) have been reported [19,20]. These successful applications of machine learning methods for predicting a climate phenomenon support this research work. In other words, this is the reason why the forecast system in this study was developed based on machine learning methods.
Machine learning methods with non-linear statistical models can extract the non-linear relationship between input predictors and target predictands, which cannot be handled with traditional linear statistical methods (e.g., multivariate linear regression) [21][22][23][24].
Because non-linear statistical models (e.g., neural network approaches) are usually much more complex (i.e., too many undetermined parameters), a sufficient number of samples is essential to train the model [25,26]. In terms of climate forecast, the observation period is usually too short to achieve proper training [17,19]. For example, there are only 61 samples from the observations for the June WPSHRL from 1961 to 2021. Considering this limitation, we decided to use a simple neural network (NN) model (dozens of parameters) [21,22]. Furthermore, developing a stable and reliable NN-based forecast system is a multifaceted problem that requires a deep understanding of the technical aspects of the training process and details of NN architecture [27][28][29][30][31][32][33]. In this study, several techniques were used to extend the samples of training data, enhance the stabilization of the forecast skills, and analyze the interpretability of the forecast system.
The aim of this study is to design a forecast system for the June WPSHRL index based on the latest autumn and winter SST (i.e., lead time of three months). The paper is organized as follows: the observed data and the NN-based forecast system are described in Section 2, followed by the application introduction and performance analysis in Section 3; finally, the discussion is presented in Section 4, and the conclusions are provided in Section 5.

Data
Global SST data provided by the Physical Sciences Laboratory (PSL) were used in this study. The dataset is stored on a 5 • × 5 • grid and consists of monthly anomalies from 1856 to the present [34]. The monthly WPSHRL indexes (integer degree of latitude) from 1961 to 2016 were downloaded from the National Climate Center of China (NCC). This dataset has been widely used for climatic prediction in China [4,10,14,35,36]. The WPSHRL index (manual statistical data) is defined as the averaged 0 line of ∂gh/∂y in the area of (10 • N~45 • N, 110 • E~150 • E) based on the monthly mean 500 hPa geopotential height (gh) field. Further details about the calculation method can be found in the study by Liu [36]. Because the NCC stopped providing this WPSHRL data after 2016, we artificially figured out the monthly WPSHRL index from 2017 to 2021 based on the ERA5 reanalysis data, following the above definition.
The SST shows an obvious decadal-scale warming trend due to global warming (not shown). This decadal-scale trend is removed by the 21-point (i.e., year) moving average method. The WPSHRL indexes are normalized to the range of −1 to 1. It is noteworthy that only the Pacific and Indian Ocean (60 • S to 60 • N, 30 • E to 90 • W) season-average (only autumn and winter) SST are used as input predictors. In order to increase the number of training data samples, we artificially produced some samples based on the composite analysis. Figure 1 presents the composites in the low (the lowest~1/4) and high (the highest~1/4) WPSHRL index years. This is consistent with previous studies in that the low composite and high composite are almost opposite, and the WPSHRL index is apt to be low/high if the pattern of prophase SST anomalies is similar to the low/high composite [37][38][39][40][41]. Furthermore, the WPSHRL index is assumed to be enhanced/weakened if the composited prophase SST anomalies are enhanced/weakened [35,42,43]. These conclusions were used as climatological background knowledge in this study. Based on this background knowledge, 20 extended samples were produced by magnifying or reducing the low composite data (SST field and corresponding WPSHRL index, upper panel of Figure 1). For example, both the SST field and corresponding WPSHRL index (i.e., −0.75) multiplied by 1.1 (or other values around 1.0) can produce an extended sample. Similarly, another 20 extended samples were produced by magnifying or reducing the high composite data (lower panel of Figure 1). Therefore, a total of 40 extended samples were used in the forecasting system.  Figure 2 shows the architecture of the NN-based forecast system for the June WPSHRL indexes. The 20 leading empirical orthogonal function (EOF) eigenvectors of observed autumn and winter SST data from 1960 to 2021 are first stored as background knowledge. The N (N ≤ 20, tunable parameter) leading EOF principal components of the latest SST data are used as predictors (i.e., the input layer of the NN model). There are two hidden layers with tunable neuron numbers (h1 and h2). The model architecture is constrained by N, h1, and h2, which are called the model's hyperparameters [25]. How to obtain the optimal architecture hyperparameters (i.e., N, h1, and h2) is discussed in the next section. leading EOF eigenvectors of observed SON and DJF SST data from 1960 to 2021 are stored as background knowledge. The N (N ≤ 20, tunable parameter) leading principal components of input SST (a certain year) obtained by calculating the pattern projection coefficients onto the EOF eigenvectors are used as predictors. The number of hidden layers is two, and the number of neurons at each hidden layer (h1, h2) is also a tunable parameter. The activation function is a hyperbolic tangent function.  Figure 2 shows the architecture of the NN-based forecast system for the June WPSHRL indexes. The 20 leading empirical orthogonal function (EOF) eigenvectors of observed autumn and winter SST data from 1960 to 2021 are first stored as background knowledge. The N (N ≤ 20, tunable parameter) leading EOF principal components of the latest SST data are used as predictors (i.e., the input layer of the NN model). There are two hidden layers with tunable neuron numbers (h1 and h2). The model architecture is constrained by N, h1, and h2, which are called the model's hyperparameters [25]. How to obtain the optimal architecture hyperparameters (i.e., N, h1, and h2) is discussed in the next section.  Figure 2 shows the architecture of the NN-based forecast system for the June WPSHRL indexes. The 20 leading empirical orthogonal function (EOF) eigenvectors of observed autumn and winter SST data from 1960 to 2021 are first stored as background knowledge. The N (N ≤ 20, tunable parameter) leading EOF principal components of the latest SST data are used as predictors (i.e., the input layer of the NN model). There are two hidden layers with tunable neuron numbers (h1 and h2). The model architecture is constrained by N, h1, and h2, which are called the model's hyperparameters [25]. How to obtain the optimal architecture hyperparameters (i.e., N, h1, and h2) is discussed in the next section.  The NN model used here is the standard feed-forward multi-layer perceptron NN model. In order to minimize the root-mean-square-error (RMSE) between the predictions (i.e., outputs) and the targets (i.e., the observed WPSHRL indexes), the model's parameters (i.e., the weights and biases associated with each layer) are adjusted through gradient descent [44]. During every update of the model's parameters (i.e., the training phase), one-third of the training set (i.e., the batch size) is used to calculate the gradients. After every update with the training set (only the observed samples), the artificially-produced samples are also used for training the model. The initial biases are random values in the range from −0.01 to 0.01. The initial weights (i.e., the undetermined parameters of the NN model) are normally distributed, and their variance is regularized by the square root of the layer neuron number [45]. The initial learning rate is set to 0.001, and it can be adaptively adjusted during model training. As the training goes on, the model can be adjusted to best fit the given training set, which causes an overfitting problem. Usually, some samples are reserved as the test set to evaluate the overfitting problem. The training is stopped once the error (i.e., RMSE) from the test set begins to increase. Unfortunately, this common approach is not suited for this study because the total sample size is too small. Here, there is no test set. The NN model would stop training once the training error (i.e., the RMSE from the training set) is less than a threshold value (hereafter referred to as the stopping threshold, ST). How to set this hyperparameter (i.e., ST) is discussed in the next section.

Forecast System
The NN model is trained by the training set (i.e., all or part of the already existing samples). However, the NN model is actually estimated by the performance on new unseen examples (i.e., generalization problem). Usually, some samples are reserved as the validation set to evaluate the model's forecast skill. Considering the small sample size in this study, the leave-one-out cross validation approach was used to quantify the forecast skill of the NN model [25,46]. Just as its name implies, one subset is chosen as the validation set and the rest of the subsets are used as the training set, and so on until all subsets have acted as the validation set. More details about the leave-one-out approach can be found in these studies [25,47]. In the forecast system developed in this study, the whole observed dataset (e.g., 1961~2021) is divided into 61 subsets, each containing one sample. Firstly, the model is trained based on the second to the sixty-first samples and the first prediction (i.e., the predicted 1961 WPSHRL index) is calculated with this model using the SST data from the first sample. Subsequently, the model is trained again based on the first and third to the sixty-first samples and the second prediction (i.e., the predicted 1962 WPSHRL index) is calculated with this new model using the SST data from the second sample. This procedure is repeated until the sixty-first prediction (i.e., the predicted 2021 WPSHRL index) has been calculated. Finally, the forecast skill of the NN model is evaluated by the Pearson linear correlation coefficient (Cor) between the predictions from 1961 to 2021 and their corresponding observations. Furthermore, since more attention should be paid to the obvious lower or higher years (i.e., abnormal years), the number of abnormal years, which are successfully predicted (Nyears), is also used to evaluate the model forecast skill. It is noteworthy that the Cor and Nyears actually represent the forecast skill of the model framework (i.e., the model's hyperparameters are fixed, but the model's parameters are undetermined) because the predictions from 1961 to 2021 are calculated by different NN models (i.e., the weights and biases of the model are different although the model framework is fixed).
The performance of a machine learning model might be sensitive to the random seed for weight initialization [48]. Our preliminary analysis shows that some good random seeds can often lead to a relatively higher forecast skill, and the model training often fails to reach the stopping threshold when starting with some bad random seeds. Aside from this random seed, there is another random seed that is used for the order of training data. In contrast to the weight initialization random seed, the impact of the training order random seed is negligible. In other words, the forecast skill of a given model framework (i.e., all hyperparameters have been fixed) is mainly influenced by the weight initialization random seed. Therefore, selecting good seeds is an important step in the model training process.
In this step, a large number (e.g., 200 in this study) of weight initialization random seeds were carried out, and each seed corresponds to a model training forecast skill estimated by the leave-one-out approach. The reason for the name, "training forecast skill" is because the forecast skill is estimated based on the training set (i.e., during the training phase). In order to use the leave-one-out approach, one sample from the training set is chosen as the validation set and the other samples are used to train the model, and so on, until all samples from the training set have acted as the validation set. The best 10 random seeds are selected based on the training forecast skill (only the Cor). After that, the NN model would be trained again using these 10 good seeds (one seed corresponds to one NN model). In this step, the leave-one-out approach is not needed; thus, all samples from the training set are used to train the model. Finally, there are 10 NN models (the model's parameters are fixed) in the forecast system, the actual prediction is the average of the 10 corresponding predictions (i.e., the ensemble-mean prediction). The standard deviation calculated from the difference in each prediction for 10 predictions is used to indicate the uncertainty of the ensemble-mean prediction to some extent. The main goal of ensemble prediction is to enhance the robustness of the forecast skills. In the rest of this article, the forecast skill usually denotes that which is calculated based on the ensemble-mean prediction. It is important to note that more details about the forecast system can be found in the code, which has been archived in a public repository.

Setting Hyperparameters
All the hyperparameters must be set at the start of model training. Most of the time, setting hyperparameters requires an expensive guess-and-check process [25]. Similar to selecting good seeds, setting hyperparameters is also based on the model training forecast skill estimated by the leave-one-out approach. In this step, all observed samples are reserved as a training set, and the training forecast skill used for setting the hyperparameters is calculated based on the ensemble-mean prediction from the best 10 random seeds.
In this study, there is one hyperparameter for the model training (i.e., ST) and three hyperparameters for the model architecture (i.e., N, h1, and h2). The number of possible combinations is huge. Fortunately, the model performance is more sensitive to the hyperparameter of the input layer (i.e., N) as compared to other hyperparameters. Therefore, N should be set first. Table 1 lists the training forecast skills (Cor, Nyears) under different model architectures. It is obvious that the optimal N is 13 at ST = 0.36. We note that 13 is also the optimal value of N at other STs (e.g., 0.33 and 0.40, not shown). It seems that a small N (N < 13) cannot provide enough useful information, and a large N (N > 13) may cause a more serious overfitting problem.  [25]. Based on the above analysis, the Cor = 0.67 and Nyears = 18 at ST = 0.36, h1 = 4 and h2 = 3 was taken as the best forecast skill, although the model framework (ST = 0.36, h1 = 4 and h2 = 4) produces the highest Nyears (Nyears = 19). In this study, N = 13, h1 = 4, h2 = 3 and ST = 0.36 were chosen as the optimal hyperparameters when the size of the training set is around 61. If the size of training set is far from 61, the optimal hyperparameters should be reset. Here, the N is set to 13.
The training forecast skills analyzed above do not indicate the actual forecast skill of the forecast system. The training forecast skills are only used for setting hyperparameters and selecting random seeds. After that, the NN model is trained again using the whole training set. To some extent, the training forecast skill might be considered as the maximum potential forecast skill of the forecast system.

Forecast Skill
Firstly, the forecast skill of the forecast system was analyzed based on all observed samples (i.e., 1961~2021). The leave-one-out approach was used twice; once to calculate the training forecast skills during the model training phase (the number of available samples is 60), and again to calculate the actual forecast skill used for estimating the forecast system (the number of available samples is 61). Three forecast experiments, named Exp1, Exp2, and Exp3, were carried out with different selection ranges of weight initialization random seeds (e.g., 0~199, 400~599, and 1000~1199). The three forecast experiments (Exp1, Exp2, and Exp3) were run twice using the same code, named Round1 and Round2. The experimental results showed that the forecast system has acceptable predictability in most years ( Figure 3). However, the predictability is always bad in a few years. For example, in 2004, the observed WPSHRL index is very high, although the spatial pattern of the prophase SST anomalies (not shown) is more similar to the composites of SST anomalies in low index years (Figure 1). In other words, 2004 is contrary to the background knowledge from the composite analysis (i.e., Figure 1). These years are called special years in this paper. It is very difficult to improve the predictability in these special years. Taken overall, the forecast skill of the forecast system is acceptable. The Cor is higher than 0.6 in all experiments. The number of abnormal years from observation is 30. The Nyears value is 17 or 18, which is around three-fifths of the observed abnormal years. This suggests that the forecast system is applicable to both normal years and abnormal years. Furthermore, the robustness of the forecast system was also analyzed. The prediction uncertainty, which is the standard deviation from the 10 ensemble members, is usually less than 0.2. As compared to the ensemble-mean value (i.e., the final prediction, which is in the range of −1 to 1), the uncertainty range (<0.2) is usually not significant. The difference in Cor and Nyears among the Exp1, Exp2, and Exp3 experiments is less than 0.02 and 2, respectively.
The impact of the range of random seed selection is relatively small and negligible. The forecast skills from these three experiments are almost the same in Round1 ((0.603, 17), (0.619, 17), and (0.604, 18), respectively) and Round2 ((0.606, 17), (0.619, 17), and (0.604, 18), respectively). In terms of forecast skill, the forecast system is relatively stable, although the NN model's parameters are different in these forecast experiments.  The forecast skill was also analyzed by applying the forecast system ( Figure 4). In the forward forecast experiments, all available samples were used as a training set. Assuming it was 1 March 1998, the June 1998 WPSHRL was forecast using the NN models trained by the observed samples from 1961 to 1997. The backward forecast experiments were carried out under the assumption that the order of years is reversed. In these forecast experiments, the optimal hyperparameters of the NN model need to be reset based on the size of the corresponding training set, and the predicted WPSHRL indexes return back to the integer degree of latitude. The Cor is around 0.66 in the backward experiments from 1990 to 1961. In the forward experiments from 1991 to 2021, the Cor is obviously decreased due to a few special years (e.g., 2004, 2014). The Cor from the forward experiments without 2004 and 2014 is around 0.63 (not shown in Figure 4), which is close to that from the backward experiments. The reason that the forecast skill (evaluated by the Cor) for the June WPSHRL is decreased after 2000 might be that the special years occur more frequently. The numbers of observed abnormal years from the backward experiment period and forward experiment period are 12 and 15, respectively. Note that, the year with a WPSHRL higher than 22° N is taken as an abnormal year. In the backward experiment period, there are too many years (7) with the WPSHRL at the threshold value (22° N). This is the reason why the number of abnormal years from the backward experiment period is 3 less (i.e.,  than that from the forward experiment period. The number of Nyears is 10 (twothirds of the observed abnormal years) in the forward experiment period. Taken overall (i.e., 1961~2021), Nyear is 17 or 18, which is around three-fifths of the observed abnormal years. The forecast skill (both Cor and Nyears) estimated by this approach (Figure 4) is similar to that estimated by the leave-one-out approach (Figure 3). The forecast skill was also analyzed by applying the forecast system ( Figure 4). In the forward forecast experiments, all available samples were used as a training set. Assuming it was 1 March 1998, the June 1998 WPSHRL was forecast using the NN models trained by the observed samples from 1961 to 1997. The backward forecast experiments were carried out under the assumption that the order of years is reversed. In these forecast experiments, the optimal hyperparameters of the NN model need to be reset based on the size of the corresponding training set, and the predicted WPSHRL indexes return back to the integer degree of latitude. The Cor is around 0.66 in the backward experiments from 1990 to 1961. In the forward experiments from 1991 to 2021, the Cor is obviously decreased due to a few special years (e.g., 2004, 2014). The Cor from the forward experiments without 2004 and 2014 is around 0.63 (not shown in Figure 4), which is close to that from the backward experiments. The reason that the forecast skill (evaluated by the Cor) for the June WPSHRL is decreased after 2000 might be that the special years occur more frequently. The numbers of observed abnormal years from the backward experiment period and forward experiment period are 12 and 15, respectively. Note that, the year with a WPSHRL higher than 22 • N is taken as an abnormal year. In the backward experiment period, there are too many years (7) with the WPSHRL at the threshold value (22 • N). This is the reason why the number of abnormal years from the backward experiment period is 3 less (i.e.,  than that from the forward experiment period. The number of Nyears is 10 (two-thirds of the observed abnormal years) in the forward experiment period. Taken overall (i.e., 1961~2021), Nyear is 17 or 18, which is around three-fifths of the observed abnormal years. The forecast skill (both Cor and Nyears) estimated by this approach (Figure 4) is similar to that estimated by the leave-one-out approach (Figure 3).

Interpretability of the Forecast System
It is difficult to illustrate the physical interpretability of an NN model with several hidden layers. However, the perturbation-based method can be used to check the relation between input predictors and target predictands from the NN model [33,49]. Here, the interpretability of the forecast system was analyzed by the perturbation-based method. Figure 5 shows the sensitivity of the predicted WPSHRL to the prophase SST (i.e., WPSHRL/ SST). Here, the sensitivities were calculated based on the NN models (with given weights and biases) used in Figure 3 (Exp1 from Round1). The relationship between the SST and WPSHRL from the observation can be shown by the linear correlation ( Figure  5, upper panel). Unlike the linear correlation from the observation, the sensitivity from the non-linear statistical model (i.e., the NN model) in one region (e.g., the North Pacific) is influenced by the SST in another region (e.g., the East Pacific). Therefore, the sensitivity from the NN model is dependent on the SST field of the year. The sensitivities of four years (1962, 1974, 1983 and 2004) are shown in Figure 5. The observed WPSHRL indexes from 1962, 1974, 1983, and 2004 are −0.15 (normal), 0.70 (high), −0.58 (low), and 1.00 (high), respectively. Their corresponding predicted WPSHRL indexes are −0.04 (normal), 0.51 (high), −0.56 (low), and −0.38 (normal), respectively. The years 1962, 1974, and 1983 have good predictions, whereas 2004 has a bad prediction. The spatial pattern of these sensitivities from the NN model is generally in agreement with the linear correlation from observation, especially for the years with good predictions. This is consistent with previous studies in that the SST anomalies of prophase autumn and winter might continue into summer [38,50], and increasing summer SST in the western Pacific warm pool would lead to a higher (more northerly) WPSHRL index [51][52][53]. The differences in sensitivity among these good years are not pronounced although the input SST fields (not shown) and predicted WPSHRL indexes (normal, high, and low) are obviously different. As compared to the sensitivities from these good years, the sensitivities from the bad year (i.e., 2004, the year with bad prediction) are relatively weak in most regions and the spatial pattern is very different from the linear correlation. These characteristics can also be found in most bad years (not shown). It seems necessary to check the sensitivity from the NN model, which is calculated based on the input SST data. The weak sensitivity and a bad spatial pattern usually indicate that the corresponding prediction is of low reliability.

Interpretability of the Forecast System
It is difficult to illustrate the physical interpretability of an NN model with several hidden layers. However, the perturbation-based method can be used to check the relation between input predictors and target predictands from the NN model [33,49]. Here, the interpretability of the forecast system was analyzed by the perturbation-based method. Figure 5 shows the sensitivity of the predicted WPSHRL to the prophase SST (i.e., ∂WPSHRL/∂SST). Here, the sensitivities were calculated based on the NN models (with given weights and biases) used in Figure 3 (Exp1 from Round1). The relationship between the SST and WPSHRL from the observation can be shown by the linear correlation ( Figure 5, upper panel). Unlike the linear correlation from the observation, the sensitivity from the non-linear statistical model (i.e., the NN model) in one region (e.g., the North Pacific) is influenced by the SST in another region (e.g., the East Pacific). Therefore, the sensitivity from the NN model is dependent on the SST field of the year. The sensitivities of four years (1962, 1974, 1983 and 2004) are shown in Figure 5. The observed WPSHRL indexes from 1962, 1974, 1983, and 2004 are −0.15 (normal), 0.70 (high), −0.58 (low), and 1.00 (high), respectively. Their corresponding predicted WPSHRL indexes are −0.04 (normal), 0.51 (high), −0.56 (low), and −0.38 (normal), respectively. The years 1962, 1974, and 1983 have good predictions, whereas 2004 has a bad prediction. The spatial pattern of these sensitivities from the NN model is generally in agreement with the linear correlation from observation, especially for the years with good predictions. This is consistent with previous studies in that the SST anomalies of prophase autumn and winter might continue into summer [38,50], and increasing summer SST in the western Pacific warm pool would lead to a higher (more northerly) WPSHRL index [51][52][53]. The differences in sensitivity among these good years are not pronounced although the input SST fields (not shown) and predicted WPSHRL indexes (normal, high, and low) are obviously different. As compared to the sensitivities from these good years, the sensitivities from the bad year (i.e., 2004, the year with bad prediction) are relatively weak in most regions and the spatial pattern is very different from the linear correlation. These characteristics can also be found in most bad years (not shown). It seems necessary to check the sensitivity from the NN model, which is calculated based on the input SST data. The weak sensitivity and a bad spatial pattern usually indicate that the corresponding prediction is of low reliability.

Discussion
It is useful to discuss the experiences of climate prediction using machine learning methods from this study and previous researches. Unlike other application areas, there are two unique aspects in regard to climate prediction. One is that the size of the observed samples is very small (adverse condition), the other is that previous studies have provided a great deal of useful background knowledge (favorable condition).
Previous studies have provided some techniques to deal with the adverse condition [19,[21][22][23]53]. For instance, Ham et al. (2019) used transfer learning to train a convolutional neural network (CNN), first on historical climate model simulations, and subsequently on reanalysis of SST data [19]. In addition, Wu et al. (2006) use a simple NN model to forecast the Pacific sea surface temperatures [23]. The NN model with dozens of parameters has fewer demands in regard to sample size as compared to the CNN model with thousands of parameters [25]. Furthermore, the transfer learning mothed could not be used in this study because climate models generally cannot forecast WPSHRL at a lead time of three months [14]. Therefore, the simple NN model was chosen in this study. It is also necessary to point out that one novelty of this study is the 40 extended samples, which were produced based on the composite analysis, following the idea from sample-extending technology used in other areas [54]. The 40 extended samples were very useful, and this improved the forecast skill by about 5%. This suggests that sample-extending technology has great potential in climate prediction.
Previous studies have also provided information on how to use the favorable condition (i.e., climatological background knowledge). For instance, the interpretability of the forecast model is checked by climatological background knowledge [19,53]. In this study, the interpretability of the NN model was also checked by climatological background knowledge ( Figure 5). Unlike these previous studies, the recently reported perturbationbased method [33,49] was used in this study. This method has the advantages of simple coding and easy understanding because the interpretability only focuses on the relation between input predictors and target predictands. Furthermore, the climatological background knowledge was also used in other ways, such as the composite analysis ( Figure  1), the selection of the regions of SST data as predictors (Figure 2), and the principal component extraction (Figure 2). Using a simple machine learning model, the forecast skill with the Pacific and Indian Ocean SST is obviously better than that with the global SST.

Discussion
It is useful to discuss the experiences of climate prediction using machine learning methods from this study and previous researches. Unlike other application areas, there are two unique aspects in regard to climate prediction. One is that the size of the observed samples is very small (adverse condition), the other is that previous studies have provided a great deal of useful background knowledge (favorable condition).
Previous studies have provided some techniques to deal with the adverse condition [19,[21][22][23]53]. For instance, Ham et al. (2019) used transfer learning to train a convolutional neural network (CNN), first on historical climate model simulations, and subsequently on reanalysis of SST data [19]. In addition, Wu et al. (2006) use a simple NN model to forecast the Pacific sea surface temperatures [23]. The NN model with dozens of parameters has fewer demands in regard to sample size as compared to the CNN model with thousands of parameters [25]. Furthermore, the transfer learning mothed could not be used in this study because climate models generally cannot forecast WPSHRL at a lead time of three months [14]. Therefore, the simple NN model was chosen in this study. It is also necessary to point out that one novelty of this study is the 40 extended samples, which were produced based on the composite analysis, following the idea from sample-extending technology used in other areas [54]. The 40 extended samples were very useful, and this improved the forecast skill by about 5%. This suggests that sample-extending technology has great potential in climate prediction.
Previous studies have also provided information on how to use the favorable condition (i.e., climatological background knowledge). For instance, the interpretability of the forecast model is checked by climatological background knowledge [19,53]. In this study, the interpretability of the NN model was also checked by climatological background knowledge ( Figure 5). Unlike these previous studies, the recently reported perturbationbased method [33,49] was used in this study. This method has the advantages of simple coding and easy understanding because the interpretability only focuses on the relation between input predictors and target predictands. Furthermore, the climatological background knowledge was also used in other ways, such as the composite analysis (Figure 1), the selection of the regions of SST data as predictors (Figure 2), and the principal component extraction (Figure 2). Using a simple machine learning model, the forecast skill with the Pacific and Indian Ocean SST is obviously better than that with the global SST.

Conclusions
In this study, we developed a forecast system for the June WPSHRL index based on the latest autumn and winter SST. Considering the limited number of observed samples, a simple NN model was used to extract the non-linear relationship between input predictors and target predictands in the forecast system. The linear correlation coefficient between the predictions from the forecast system and their corresponding observations was around 0.6, and about three-fifths of the observed abnormal years were successfully predicted. Furthermore, sensitivity experiments (i.e., Exp1-3, Round1-2) show that the forecast system is relatively stable in terms of forecast skill. Therefore, we think that the forecast system would be valuable in real-life applications. The success of this forecast system suggests that the prediction of the atmospheric circulation indexes beyond one season might be improved by non-linear statistical methods (e.g., the NN model).
The practicability of the July and August WPSHRL index was also tested using this forecast system. The input predictors were still the latest autumn and winter SST. Although the forecast skill for the July WPSHRL was significantly reduced as compared to that for the June WPSHRL, the forecast skill was still acceptable (the Cor was around 0.40). However, the forecast skill for the August WPSHRL was very low (the Cor did not reach 0.3). It is well known that the formation mechanism for the WPSHRL index is different in different months [55,56]. The forecast system introduced in this study is designed only for the prediction of the June WPSHRL with a lead time of three months. For instance, the predictors (i.e., the regional and seasonal SST data) are first selected according to background knowledge (i.e., the strong linear correlation between the prophase SST and the June WPSHRL index). Generally speaking, the background knowledge from previous studies is useful for improving the performance of the forecast system. In order to make rational use of the background knowledge, the forecast system should be designed according to the month and lead time.