K-Means and C4.5 Decision Tree Based Prediction of Long-Term Precipitation Variability in the Poyang Lake Basin, China

The machine learning algorithms application in atmospheric sciences along the Earth System Models has the potential of improving prediction, forecast, and reconstruction of missing data. In the current study, a combination of two machine learning techniques namely K-means, and decision tree (C4.5) algorithms, are used to separate observed precipitation into clusters and classified the associated large-scale circulation indices. Observed precipitation from the Chinese Meteorological Agency (CMA) during 1961–2016 for 83 stations in the Poyang Lake basin (PLB) is used. The results from K-Means clusters show two precipitation clusters splitting the PLB precipitation into a northern and southern cluster, with a silhouette coefficient ~0.5. The PLB precipitation leading cluster (C1) contains 48 stations accounting for 58% of the regional station density, while Cluster 2 (C2) covers 35, accounting for 42% of the stations. The interannual variability in precipitation exhibited significant differences for both clusters. The decision tree (C4.5) is employed to explore the large-scale atmospheric indices from National Climate Center (NCC) associated with each cluster during the preceding spring season as a predictor. The C1 precipitation was linked with the location and intensity of subtropical ridgeline position over Northern Africa, whereas the C2 precipitation was suggested to be associated with the Atlantic-European Polar Vortex Area Index. The precipitation anomalies further validated the results of both algorithms. The findings are in accordance with previous studies conducted globally and hence recommend the applications of machine learning techniques in atmospheric science on a sub-regional and sub-seasonal scale. Future studies should explore the dynamics of the K-Means, and C4.5 derived indicators for a better assessment on a regional scale. This research based on machine learning methods may bring a new solution to climate forecast.


Introduction
China's largest freshwater lake, the Poyang Lake, is located in the middle-lower Yangtze River Basin in a humid subtropical monsoon climate. The changes in the East Asian Monsoon have led to heterogeneous spatiotemporal changes in precipitation, with obvious seasonal and regional differences in the Poyang Lake basin [1]. Changes in precipitation have a profound impact on the conservation of the ecological environment, restoration and conservation of biodiversity [2], and water resource management and monitoring of a region [3], and flood control mitigation [4]. Thus, a more accurate prediction of the precipitation pattern in Poyang Lake has become a research hotspot during recent decades. multiscale precipitation variation in different climate, land, and ocean masses [31,32]. The decision tree algorithm is an important technique in data mining, which includes several methods: C4.5 algorithm, CART algorithm, PUBLIC algorithm, SLIQ algorithm, and so on. Among them, the C4.5 algorithm proposed has been well applied in the meteorological field due to its simple calculation, high data processing efficiency, and easy model interpretation [33].
In this study, we first investigated the level of the similarity in the summer precipitation over the Poyang Lake based on using the K-means [34] and evaluating clustering's results with silhouette coefficient [35], we reasonably and objectively split the whole Poyang Lake basin into sub-regions using the classical clustering machine learning algorithm [36]. With an emphasis on the sub-regional precipitation patterns and their attribution to largescale climate drivers, we then introduced the decision tree prediction model (C4.5 based algorithm) for predicting the summer precipitation patterns in the Poyang Lake basin. The findings of the study will help provide a practical, convenient, and effective model for precipitation prediction in the Poyang Lake basin. The rest of the paper contains a description of the study area, data, and methods in Section 2, followed by results in Section 3, discussion in Section 4, and conclusion in Section 5. The improved precipitation prediction can provide more accurate climate information for policymakers.

Study Area
The Poyang Lake basin is located in the middle and lower reaches of the Yangtze River over China (Figure 1a), fed by its five major tributaries, including Ganjiang, Fuhe, Xinjiang, Raohe, and Xiuhe. The total drainage area of the Poyang Lake basin is~16,220,000 km 2 , covering~97% of Jiangxi province. The altitude of the Poyang lake basin is from~191 m tõ >2127 m above mean sea level, as shown in Figure 1b, shaping the Poyang Lake basin area into a typical valley. The dominant land cover classes derived from MODIS [37] include water bodies, croplands, and forests with moderate grasslands and urban areas ( Figure 1c). The climate of the Poyang Lake can be characterized as subtropical warm humid, with monsoon east Asian monsoon precipitation system as the primary driver of the water cycle. The annual average temperature and rainfall are 17.6°C and 1639.42 mm respectively. The seasonal precipitation is unevenly distributed, accounting for about 42-53% from April to June. The extreme precipitation frequently occurs from June to July while frequently drought in August. The interannual variability of precipitation is also very big. The difference between the rainy and drier years is almost doubled. This is also one of the major reasons for the frequent occurrence of droughts and floods in PLB.
During recent years, with the accelerated urbanization and increase of multiple land development projects, several challenges such as soil erosion, changes in inflow pattern into the lake, sedimentation of the lake, and enhanced pollution of the lake have threatened the regional flora and fauna [38]. Due to the increase in the global temperature and precipitation patterns with more extreme events projected to increase [39], the Poyang Lake basin needs more detailed studies exploring changes in the water and energy cycle. During recent years, with the accelerated urbanization and increase of multiple land development projects, several challenges such as soil erosion, changes in inflow pattern into the lake, sedimentation of the lake, and enhanced pollution of the lake have threatened the regional flora and fauna [38]. Due to the increase in the global temperature and precipitation patterns with more extreme events projected to increase [39], the Poyang Lake basin needs more detailed studies exploring changes in the water and energy cycle.

Data
The daily precipitation observations used in the study were collected from the Chinese Meteorological Agency (CMA) for a total of 87 meteorological stations, which density is shown in Figure 2a. We adopted the standard normal homogeneity test (SNHT) to ensure the consistency and consistency of the data [40]. The purpose of the SNHT technique was primarily to detect outliers or spikes in a dataset that could be attributed to non-climatic factors, which most often were used for homogeneity estimation of climate data records. The analysis period of this paper was taken from 1961 to 2016. At the same time, in order to analyze the different time-scale precipitation variations, the daily data were also converted into the monthly and annual totals precipitation. After the quality screening, the number of weather station observations used were 83. The missing data in some of the stations were less than ~3%, and hence such small missing observations will not influence the results to a greater extent [41]. Figure 2b shows the summer totals precipitation climatology employed over the 1961-2016 study period. We see that the northern part

Data
The daily precipitation observations used in the study were collected from the Chinese Meteorological Agency (CMA) for a total of 87 meteorological stations, which density is shown in Figure 2a. We adopted the standard normal homogeneity test (SNHT) to ensure the consistency and consistency of the data [40]. The purpose of the SNHT technique was primarily to detect outliers or spikes in a dataset that could be attributed to non-climatic factors, which most often were used for homogeneity estimation of climate data records. The analysis period of this paper was taken from 1961 to 2016. At the same time, in order to analyze the different time-scale precipitation variations, the daily data were also converted into the monthly and annual totals precipitation. After the quality screening, the number of weather station observations used were 83. The missing data in some of the stations were less than~3%, and hence such small missing observations will not influence the results to a greater extent [41]. Figure 2b shows the summer totals precipitation climatology employed over the 1961-2016 study period. We see that the northern part of the PLB exhibited more precipitation than the southern in summer. Most of the stations with total values being observed at >550 mm in the summer and northeastern regions have the highest precipitation with the value above 600 mm over the whole PBL, while the central and southern regions received between 450 mm and 550 mm.
of the PLB exhibited more precipitation than the southern in summer. Most of the stations with total values being observed at >550 mm in the summer and northeastern regions have the highest precipitation with the value above 600 mm over the whole PBL, while the central and southern regions received between 450 mm and 550 mm. The climate index dataset from National Climate Centre (NCC) were used (http://cmdp.ncc-cma.net/Monitoring/cn_index_130.php, accessed on 28 June 2021) for a potential attribution of the large-scale climate drivers of the Poyang Lake basin. The climate signal during spring (March, April, and May) was used as a predictor to forecast the summer rainfall in sub-regions of the Poyang Lake Basin. In the land-atmosphere coupling domain, similar approaches have been widely used with a potential time-lag as a forecast indicator [42]. The 130-item climate index dataset was obtained by averaging the March, April, and May values of each index from the database, which provided a wide range of choices for selecting a potential index as a forecast indicator of Poyang Lake basin sub-regional precipitation variation.

K-Means
K-Means clustering is an unsupervised clustering algorithm [43][44][45][46][47] with the ability to automatically classify N samples of data in G-dimensional space into k number of predefined non-overlapping clusters according to their descriptive characteristics. The similarity of data within a cluster is large, and the similarity of objects among different clusters is small. The main feature of the K-Means algorithm is to determine k initial cluster centers randomly, to classify and to divide the source points based on distance comparison, and to calculate the new cluster centroids. The next round of iteration is performed until the center position is unchanged and stops the classification. The process of classification is actually the process of minimizing errors. The K-Means minimization is to minimize the sum of the distances between all points and their associated cluster centers. The evaluation index of SSE (Sum of Squared Errors) and the calculation equation are as follow: is ℎ cluster; is the sample point of ; is the cluster center of ; is the number of clusters based on prior knowledge; SSE is the sum of squares clustering The climate index dataset from National Climate Centre (NCC) were used (http: //cmdp.ncc-cma.net/Monitoring/cn_index_130.php, accessed on 28 June 2021) for a potential attribution of the large-scale climate drivers of the Poyang Lake basin. The climate signal during spring (March, April, and May) was used as a predictor to forecast the summer rainfall in sub-regions of the Poyang Lake Basin. In the land-atmosphere coupling domain, similar approaches have been widely used with a potential time-lag as a forecast indicator [42]. The 130-item climate index dataset was obtained by averaging the March, April, and May values of each index from the database, which provided a wide range of choices for selecting a potential index as a forecast indicator of Poyang Lake basin sub-regional precipitation variation.

K-Means
K-Means clustering is an unsupervised clustering algorithm [43][44][45][46][47] with the ability to automatically classify N samples of data in G-dimensional space into k number of predefined non-overlapping clusters according to their descriptive characteristics. The similarity of data within a cluster is large, and the similarity of objects among different clusters is small. The main feature of the K-Means algorithm is to determine k initial cluster centers randomly, to classify and to divide the source points based on distance comparison, and to calculate the new cluster centroids. The next round of iteration is performed until the center position is unchanged and stops the classification. The process of classification is actually the process of minimizing errors. The K-Means minimization is to minimize the sum of the distances between all points and their associated cluster centers. The evaluation index of SSE (Sum of Squared Errors) and the calculation equation are as follow: where C i is ith cluster; p is the sample point of C i ; m i is the cluster center of C i ; k is the number of clusters based on prior knowledge; SSE is the sum of squares clustering errors for all points in the source cloud, representing the effects of clustering. In K-mean, the purpose of selecting the clustering center is to minimize the error caused by p, and relocating the Atmosphere 2021, 12, 834 6 of 17 clustering center purpose is to minimize the error caused by m i , for minimizing the error during each iteration.

C4.5
The decision tree C4.5 algorithm is a non-parametric supervised machine learning technique used to generate tree-like classification rules based on the induction of data features, usually from discrete values in nature [48][49][50][51][52]. In the current work, the C4.5 algorithm was used as the classifier, summarizing the classification rules from a set of random instance cases. The model building processes of the C4.5 algorithm can be defined as follows: The construction processes of the algorithm started from "which feature in the feature attribute set U will be tested at the root node of the tree". The feature attribute with the best classification ability was selected as the root node of the tree, and then the root node with each possible value of the node feature was used to generate a branch and arrange the training sample set D under the appropriate branch; repeat the whole process, using each branch node associated training sample to select the best feature tested at the node. There were 4 feature parameters (h, w, s, and p) in the feature parameters set U. We used the gain rate of the C4.5 to select the best partition feature attributes. The specific steps [53] were as follows: Step 1: Calculating information entropy Information entropy is the most commonly used index in measuring the purity of a sample set. It chooses the attribute with the highest information gain as the split attribute of node N. This attribute minimizes the amount of information required for tuple classification in the result partition. The expected information is required to classify the tuples in D, following the below formula: where m refers to the number of different types of elements in the result set, and p i is the ratio of the number of category elements of the i-th to the total number of the sample set.
Step 2: Calculating the information entropy of each attribute Assuming the tuples in D are divided according to attribute A, and attribute A divides D into n different classes. After the division, obtaining an accurate classification of the information entropy of each attribute need to be measured by the following formula: where A is the attribute classification of D, D j is the number of different categories in the sample set, D is the total number of the whole sample set, Info D j is the entropy of certain categories extracted from the sample set.
Step 3: Calculating information gain The information gain is defined as the difference between the original information demand (that is based only on the class ratio) and the new demand (obtained after dividing A).
Step 4: Calculating attribute split information metrics The information gain rate is equal to the information gain/intrinsic information, which will cause the importance of the attribute to decrease as the intrinsic information increases. This can be regarded as compensation for purely using information gain.
Step 5: Information gain rate This value represents the information, which is generated by dividing the training data set D into n divisions corresponding to the n outputs of the attribute A test. Information gain rate definition is as follow: In this work, precipitation observations from 83 meteorological stations were used for finding the similarity in the sub-regional precipitation with a similar temporal variation. The summer season was selected as the primary study period, which contributed 54% of the annual precipitation magnitude [54]. The K-Means algorithm was used to separate the regional precipitation based on the devised methodology into clusters, and then followed by the decision tree application. In the decision tree application, the climate indices during the spring season were used as a forecast tool to predict changes in the precipitation clusters of the basin with a time lag [55,56]. Figure 3 shows the summer precipitation division derived with the application of K-Means clustering algorithm in the Poyang Lake basin. The current study used the clustering number k as 2, 3, 4, 5 for the precipitation clustering of the Poyang Lake basin for a scientific division and convenience of practical application [57,58]. The closer the silhouette coefficient [59,60] is to 1, the better cluster will be developed. Comparatively, the k = 2, with silhouette coefficient of~>0.5 provides the relatively best clustering effect from rest of the clusters (Figure 3), therefore, we set the clustering number k as 2, inferring two precipitation clusters in the Poyang Lake basin.

Precipitation Clusters
Atmosphere 2021, 12, 834 7 of 18 Step 5: Information gain rate This value represents the information, which is generated by dividing the training data set D into n divisions corresponding to the n outputs of the attribute A test. Information gain rate definition is as follow: In this work, precipitation observations from 83 meteorological stations were used for finding the similarity in the sub-regional precipitation with a similar temporal variation. The summer season was selected as the primary study period, which contributed 54% of the annual precipitation magnitude [54]. The K-Means algorithm was used to separate the regional precipitation based on the devised methodology into clusters, and then followed by the decision tree application. In the decision tree application, the climate indices during the spring season were used as a forecast tool to predict changes in the precipitation clusters of the basin with a time lag [55,56]. Figure 3 shows the summer precipitation division derived with the application of K-Means clustering algorithm in the Poyang Lake basin. The current study used the clustering number k as 2, 3, 4, 5 for the precipitation clustering of the Poyang Lake basin for a scientific division and convenience of practical application [57,58]. The closer the silhouette coefficient [59,60] is to 1, the better cluster will be developed. Comparatively, the k = 2, with silhouette coefficient of ~>0.5 provides the relatively best clustering effect from rest of the clusters (Figure 3), therefore, we set the clustering number k as 2, inferring two precipitation clusters in the Poyang Lake basin.   Figure 4), a north-south precipitation distribution can be seen, divided into two clusters. Based on the spatial pattern (Figure 4), the two clusters were expressed as cluster 1 (C1), and cluster 2 (C2), representing the northern basin and southern basin with distinct boundary obvious, respectively. The C1 covers 48 of 83 stations in the PLB, accounting for 58% of the regional station density, while Cluster2 covers 35 of 83, accounting for 42% of the station density. The region where C1(C2) was located referred to as region I (region II).   Figure 4), a north-south precipitation distribution can be seen, divided into two clusters. Based on the spatial pattern (Figure 4), the two clusters were expressed as cluster 1 (C1), and cluster 2 (C2), representing the northern basin and southern basin with distinct boundary obvious, respectively. The C1 covers 48 of 83 stations in the PLB, accounting for 58% of the regional station density, while Cluster2 covers 35 of 83, accounting for 42% of the station density. The region where C1(C2) was located referred to as region I (region II).   Figure 5 shows the interannual precipitation variability during the summer season averaged for the whole basin and both clusters (C1, and C2), respectively. The mean interannual variability of the whole basin average precipitation was shown in a dark black color, the mean of the C1 (green color) and C2 (brown color) was also shown. The results inferred similarities and differences in the mean interannual variation of the two clusters and basin-scale mean precipitation during the summer season. The statistical significance was assessed by the t-test and passing the significance at 0.01 level. The striking feature of the variability showed a more obvious change in the mean of the clusters from the whole basin mean precipitation in the recent decades, implying a possible change in the precipitation magnitude of C1 to be relatively higher than C2. Furthermore, significant differences between the two clusters were obvious during the extreme events during 1969, 1983, 1998, and 2011 implying a distinct response of each cluster from the mean. Further studies were suggested to explore these aspects in detail with emphasis on changing climate-induced increases in extreme events [61]. In conclusion, the K-Means approach application can obviously provide meaningful output in the atmospheric domain for better identification of precipitation patterns in a sub-regional domain.  Figure 6 shows the mean precipitation magnitude of C1 and C2 calculated from the interannual mean during 1961-2016 for each month of the monsoon season. The overall precipitation in C1 was higher than that in C2, with monthly scale differences obvious in each cluster (Figure 6a). The mean magnitude for C1(C2) during June was 306(248) mm, during July 160(130) and during August is 126(146), implying overall extra precipitation of 23.48% during June, followed by 23.07% during July in C1, whereas during August C1  Figure 5 shows the interannual precipitation variability during the summer season averaged for the whole basin and both clusters (C1, and C2), respectively. The mean interannual variability of the whole basin average precipitation was shown in a dark black color, the mean of the C1 (green color) and C2 (brown color) was also shown. The results inferred similarities and differences in the mean interannual variation of the two clusters and basin-scale mean precipitation during the summer season. The statistical significance was assessed by the t-test and passing the significance at 0.01 level. The striking feature of the variability showed a more obvious change in the mean of the clusters from the whole basin mean precipitation in the recent decades, implying a possible change in the precipitation magnitude of C1 to be relatively higher than C2. Furthermore, significant differences between the two clusters were obvious during the extreme events during 1969,1983,1998, and 2011 implying a distinct response of each cluster from the mean. Further studies were suggested to explore these aspects in detail with emphasis on changing climate-induced increases in extreme events [61]. In conclusion, the K-Means approach application can obviously provide meaningful output in the atmospheric domain for better identification of precipitation patterns in a sub-regional domain.   Figure 5 shows the interannual precipitation variability during the summer season averaged for the whole basin and both clusters (C1, and C2), respectively. The mean interannual variability of the whole basin average precipitation was shown in a dark black color, the mean of the C1 (green color) and C2 (brown color) was also shown. The results inferred similarities and differences in the mean interannual variation of the two clusters and basin-scale mean precipitation during the summer season. The statistical significance was assessed by the t-test and passing the significance at 0.01 level. The striking feature of the variability showed a more obvious change in the mean of the clusters from the whole basin mean precipitation in the recent decades, implying a possible change in the precipitation magnitude of C1 to be relatively higher than C2. Furthermore, significant differences between the two clusters were obvious during the extreme events during 1969, 1983, 1998, and 2011 implying a distinct response of each cluster from the mean. Further studies were suggested to explore these aspects in detail with emphasis on changing climate-induced increases in extreme events [61]. In conclusion, the K-Means approach application can obviously provide meaningful output in the atmospheric domain for better identification of precipitation patterns in a sub-regional domain.  Figure 6 shows the mean precipitation magnitude of C1 and C2 calculated from the interannual mean during 1961-2016 for each month of the monsoon season. The overall precipitation in C1 was higher than that in C2, with monthly scale differences obvious in each cluster (Figure 6a). The mean magnitude for C1(C2) during June was 306(248) mm, during July 160(130) and during August is 126(146), implying overall extra precipitation of 23.48% during June, followed by 23.07% during July in C1, whereas during August C1  Figure 6 shows the mean precipitation magnitude of C1 and C2 calculated from the interannual mean during 1961-2016 for each month of the monsoon season. The overall precipitation in C1 was higher than that in C2, with monthly scale differences obvious in each cluster (Figure 6a). The mean magnitude for C1(C2) during June was 306(248) mm, during July 160(130) and during August is 126(146), implying overall extra precipitation of 23.48% during June, followed by 23.07% during July in C1, whereas during August C1 had a −14% deficit relative to C2. In conclusion, apart from interannual differences, the K-Means clustering technique can further classify and show differences in sub-seasonal precipitation, implying its potential application in identifying sub-seasonal atmospheric variables clustering and differences. Indeed, the drivers of such regional-scale deviations were several factors ranging from local scale topography, convective activities to large-scale circulations and complex atmospheric modes that may further need to explore as well [62]. In the current work, exploring such drivers in relation to the techniques used may introduce further complexity in results interpretation and hence were excluded from being studied in a separate study. In the next section, the decision tree was used to predict changes in precipitation attributed to large-scale climate drivers.

Precipitation Features
Atmosphere 2021, 12, 834 9 of 18 had a −14% deficit relative to C2. In conclusion, apart from interannual differences, the K-Means clustering technique can further classify and show differences in sub-seasonal precipitation, implying its potential application in identifying sub-seasonal atmospheric variables clustering and differences. Indeed, the drivers of such regional-scale deviations were several factors ranging from local scale topography, convective activities to largescale circulations and complex atmospheric modes that may further need to explore as well [62]. In the current work, exploring such drivers in relation to the techniques used may introduce further complexity in results interpretation and hence were excluded from being studied in a separate study. In the next section, the decision tree was used to predict changes in precipitation attributed to large-scale climate drivers.

Experimental Data Pretreatment
In this section, the decision tree algorithm (C4.5) was used to predict changes in summer precipitation from the long-term precipitation observations. To do so, the training dataset required for the C4.5 algorithm accounted for 70% of the total number of samples, while the remaining 30% belonged to the test set. The training set was generally used to build the decision tree model, while the test set was applied to test the generalization and prediction capability of the model. The data for the period of 1961-2000 (40 years, 72% of the whole period) was selected as the training set of the model, and the data during 2001-2016 (16 years, accounting for 28%) was used as the test set. To follow the protocols, we defined the year with the standardized anomaly of the summer precipitation higher with standard deviation ~>0.5 as the rainy year and lower as the normal year. The summer precipitation in the PLB here was abstracted to the binary classification of whether the summer precipitation was more than the normal or not. Using the criteria for defining a rainy year, region I and the region II both have 16 years, belonging to the rainy year during the summer season, as shown in Table 1. Table 1. Distribution of the rainy year of summer in different years of Poyang Lake.

Experimental Data Pretreatment
In this section, the decision tree algorithm (C4.5) was used to predict changes in summer precipitation from the long-term precipitation observations. To do so, the training dataset required for the C4.5 algorithm accounted for 70% of the total number of samples, while the remaining 30% belonged to the test set. The training set was generally used to build the decision tree model, while the test set was applied to test the generalization and prediction capability of the model. The data for the period of 1961-2000 (40 years, 72% of the whole period) was selected as the training set of the model, and the data during 2001-2016 (16 years, accounting for 28%) was used as the test set. To follow the protocols, we defined the year with the standardized anomaly of the summer precipitation higher with standard deviation~>0.5 as the rainy year and lower as the normal year. The summer precipitation in the PLB here was abstracted to the binary classification of whether the summer precipitation was more than the normal or not. Using the criteria for defining a rainy year, region I and the region II both have 16 years, belonging to the rainy year during the summer season, as shown in Table 1. Table 1. Distribution of the rainy year of summer in different years of Poyang Lake.

Separation of Poyang Lake
The Year with Heavy Rain in Summer Region I 1969, 1970, 1973, 1977, 1980, 1983, 1993, 1994, 1995, 1997, 1998, 1999, 2010, 2011Region II 1961, 1962, 1968, 1973, 1976, 1977, 1982, 1993, 1994, 1995, 1996, 1997, 1999, 2002, 2006 Furthermore, as the predictive factor, the climate indices from NCC during spring were used to predict whether the summer precipitation in the PLB will more than the normal or not. We obtained 130 climate indices data for the spring season through calculating the mean values of each climate index in March, April, and May, based on the hundreds of the climate system index from NCC, providing data support for the following establishment of summer precipitation prediction model.

Construction of the Model and Its Verification
Taking "whether the summer precipitation is more than the normal or not" as the object variable, the input variables of the model were the 130-climate signal indices during the preceding spring season of the corresponding summer season. After the pretreatment, the training data were input into the C4.5 algorithm for obtaining the decision tree (Figures 7 and 8). The main climate predictor of whether the summer precipitation was more than normal or not in region I (C1) was the location of the subtropical ridgeline location over North Africa. Furthermore, as the predictive factor, the climate indices from NCC during spring were used to predict whether the summer precipitation in the PLB will more than the normal or not. We obtained 130 climate indices data for the spring season through calculating the mean values of each climate index in March, April, and May, based on the hundreds of the climate system index from NCC, providing data support for the following establishment of summer precipitation prediction model.

Construction of the Model and Its Verification
Taking "whether the summer precipitation is more than the normal or not" as the object variable, the input variables of the model were the 130-climate signal indices during the preceding spring season of the corresponding summer season. After the pretreatment, the training data were input into the C4.5 algorithm for obtaining the decision tree (Figures 7 and 8). The main climate predictor of whether the summer precipitation was more than normal or not in region I (C1) was the location of the subtropical ridgeline location over North Africa. The dominant factors about whether the summer precipitation was more than the normal or not in region II are the Atlantic-European Polar Vortex Area Index in the Atlantic and Europe. The closer of the node position to the leaf node, the significance of the node prediction variable will be less. The learning accuracy of the model in the region I was 90%. We verified the model after entering the decision tree through the test set. The test accuracy rate was up to 87.5%. The accuracy rate in region II was also up to 85.0%. After entering the decision tree through the test set for verifying the model, we found that the test accuracy rate was up to 93.8%. It is clear that the rainy year forecast model based on decision tree C4.5 has certain common ability as well robustness in predicting the precipitation deviation with emphasis on large-scale atmospheric drivers as predictors. This model can help to provide a better The dominant factors about whether the summer precipitation was more than the normal or not in region II are the Atlantic-European Polar Vortex Area Index in the Atlantic and Europe. The closer of the node position to the leaf node, the significance of the node prediction variable will be less. The learning accuracy of the model in the region I was 90%. We verified the model after entering the decision tree through the test set. The test accuracy rate was up to 87.5%. The accuracy rate in region II was also up to 85.0%. After entering the decision tree through the test set for verifying the model, we found that the test accuracy rate was up to 93.8%.
It is clear that the rainy year forecast model based on decision tree C4.5 has certain common ability as well robustness in predicting the precipitation deviation with emphasis on large-scale atmospheric drivers as predictors. This model can help to provide a better comprehensible, concise, and valuable reference for the prediction about whether the summer precipitation is more than the normal or not. The decision-making tree has the advantage of concision, following the user's logical judgment. Based on each embranchment of the decision-making tree, the rule of If . . . then . . . can be abstracted from root node to leaf node (T/F). As shown in Tables 2 and 3, all the above rules from the decision-making tree can be formed into the decision-making set of rules. Based on these rules, it will be convenient to use and seek the prediction in advance with a time-lag with the provision of early warning for extreme events and better awareness and preparedness. Simultaneously, there also exists learning accuracy for reference under each rule. Table 2. Rule sets of summer rainfall forecast model of "Whether less (lower than 0. 5 times standard deviation)" in the cluster1 of Poyang Lake basin.

Rules
Attributes For a better description regarding the spatial distribution of the summer precipitation in the PLB under different climate scenarios of the preceding spring season. The spatial distribution of the precipitation anomaly ( Figure 9) was derived using the rules for stating the factorial conditions associated with above-normal precipitation from those associated with below-normal precipitation from the decision-making tree. The regional precipitation differences owing to the set of rules derived stating whether the summer precipitation was higher than normal or not in different regions of the Poyang Lake (namely the rule B and rule D in the region I, and the rule B and rule D in the region II) in Figure 9. When the early spring climate indices were in accord with the rule B of the decision-making tree in the region I, the precipitation was higher than the normal in region I, especially in the southern and eastern reaches of region I (~>300 mm). In the northern parts of region, I, the precipitation in some parts was reduced by~50 mm from the normal level (Figure 9a). When the early spring climate indices were in accord with the rule D of the decision-making tree in the region I, the precipitation in the whole region I was higher than the normal level by~>500 mm (Figure 9b), especially in the mid-east parts of the region I (higher than 300 mm). Table 3. Rule sets of summer rainfall forecast model of "Whether less (lower than 0. 5 times standard deviation)" in cluster2 of Poyang Lake basin.

Rules
Attributes

Discussion
The summer precipitation of the Poyang Lake basin, if influenced by several factors and patterns. The climate within the basin is also subjected to differences in mean precip- When the climatic indices during the early spring were in accord with the rule B of the decision-making tree in region II, most regions of region II were experiencing above-normal precipitation~>50 mm, especially in the eastern parts with the increased precipitation up to 100 mm (Figure 9c). When the climatic indices in the early spring were in accord with the rule D of the decision-making tree in region II, the precipitation in the whole region II was increased by 100 mm than the normal level (Figure 9d). In some western regions of region II, the precipitation was increased by~250 mm in comparison with the normal level.

Discussion
The summer precipitation of the Poyang Lake basin, if influenced by several factors and patterns. The climate within the basin is also subjected to differences in mean precipitation magnitude, and deviations in seasonality under the same weather system, which are large. Thus, an accurate prediction of the summer precipitation in this lake basin has important and practical significant implications at a local scale. In this study, we used the K-Means clustering algorithm of the machine learning technique to reasonably and objectively separate the summer precipitation in the Poyang Lake basin into two clusters. Then, we built the C4.5 based algorithm decision tree prediction model for investigating whether the summer precipitation in the Poyang Lake was more than normal or not, with an insight into the possible large-scale climate drivers.
The K-Means clustering technique is among the commonly used partition clustering algorithm with simplicity and efficiency. It has become the most widely used among all clustering algorithms. The C4.5 can quickly and effectively discover potential drivers and key information from a large number of complex climate indexes and combine these drivers and information to select reasonable factors to construct a prediction model for summer precipitation in the Poyang Lake Basin. The results generally show a significant difference in summer precipitation between the southern and northern parts of the PLB. The K-Means clustering and C4.5 attribution generally agree with previous studies, which studied the regional precipitation variation due to large-scale atmospheric drivers embedded within complex earth system climate. Such forcing can range from the atmospheric response towards the oceanic forcing, solar radiation, and much more. The accuracy of model forecasting can reach a considerable level [63,64]. Min et al. [65] found that the spring SST in the South China Sea, the Bay of Bengal, and the Arabian Sea was positively correlated with the summer precipitation in the Yangtze River Basin. Lu et al. [66] proposed that the western Pacific subtropical high and the subtropical monsoon had strong (weak) linkage, whereas the West Indian Ocean circulation was negatively (positive) related to the summer rainfall in the Yangtze-Huai River. Gong et al. [67] pointed out that the Arctic Oscillation (AO) index was negatively correlated with Meiyu. Wang et al. [68] found that the North Atlantic Oscillation (NAO) in the previous winter had little effect on summer precipitation in my country, while the changes in the North Atlantic Oscillation (NAO) in the previous spring had a significant correlation with summer precipitation. All these studies individually reported the linkage between the large-scale circulations individually or forced by changes in sea surface temperature and their association with regional precipitation changes. All these studies individually reported the linkage between the large-scale circulations individually or forced by changes in sea surface temperature and their association of with regional precipitation changes. The current study attribution was indeed an initial appraisal using the machine learning algorithms and thus a more detailed study was scheduled to look into the dynamics of the indices as a forecast indicator of the above/below normal precipitation in the region.
With the continuous advances of the big data era, computing hardware and computational intelligence have been continuously strengthened. The data mining technique has been widely used to predict the short-term changes in weather and climate of summer precipitation regimes. In this study, we used the machine learning technique to separate the Poyang Lake basin precipitation into clusters and find the associated set of rules responsible for such changes in precipitation magnitude. We built an effective prediction model for investigating whether the summer precipitation is more than the normal or not in different regions, which offers a significant reference for the short-term climate forecast of the summer precipitation in the Poyang Lake basin. However, machine learning demands a larger number of data samples and a higher demand for the computation speed of the computing devices in comparison with traditional mathematical statistics, which need more complex training strategies. We suggest that there still exists a potential in the prediction accuracy, with the continually accumulated data sample and the continually optimized training strategies and parameters. In the next step, we plan to focus on calculating the effects of the dominated climate factors that have been identified in this study on the summer precipitation in regions I (C1) and II (C2) of the Poyang Lake based on the sensitivity experiments.

Conclusions
The study used long-term precipitation observations from a dense network of raingauges in a diverse climate during 1961-2016 to classify and attribute large-scale atmospheric drivers to regional precipitation variability. Using machine learning techniques, the study found significant variation in regional precipitation variation across the Poyang Lake basin (PLB), which, however, is termed to be a homogenous precipitation region. The choice of the techniques used to verify the sub-regional clusters based on precipitation magnitude variability on an interannual scale, monthly scale, were justified by attribution from larger-scale indices as a predictor of the changes in the mean precipitation. The specific conclusions are as follows: (1) Based on the K-Means algorithm, we investigated the level of similarity in the summer precipitation within the Poyang Lake basin. Based on the principle of "similarity within classes, dissimilarity between classes," the Poyang Lake basin precipitation is separated into clusters, comprised of north and south regions, namely region I and region II. These two regions are integrated, continuous, and mutually independent, meeting the objective and reasonable criterion for the separation with a distinct boundary obvious. (2) Comparing region I and region II as Cluster 1 (C1) and Cluster 2 (C2), the changes in the summer precipitation of these two regions have their respective individual characteristics on an interannual and monthly scale. On the interannual scale, the summer precipitation always exhibited significant differences between the two clusters. On the monthly scale, the amount of precipitation in June and July in region I is higher than region II, while that in August is smaller than region II (include the fraction values). (3) The C4.5 based decision tree prediction model is built to investigate whether the summer precipitation in the C1 and C2 is more than the normal or not under specific years with 0.5 standard deviations from the mean. The learning accuracy of the model in the C1 and the C2 are up to 90.0% and 85%, respectively. After checking the model by test set, the accuracy rate of the test is up to 87.5% and 93.8%, respectively. From region I to region II, the root node and the leaf node from the decision tree abstract out four and five decision rules, respectively, forming into the concise and scientific rule sets. Each rule has its own learning accuracy, which would be convenient for application use. (4) From region I to region II, based on the decision tree models, the main climate factors of the summer precipitation in region I are the modifications of the subtropical ridgeline over North Africa and the 850 hPa western pacific trade wind. The dominant factors about whether the summer precipitation is more than the normal or not in region II are the Atlantic-European Polar Vortex Area Index in the Atlantic and Europe and the changes in the number of sunspots.
The study thus concludes the potential of the machine learning techniques application valuable in atmospheric sciences for better decision making and feature attribution. Further studies should include a large-scale dynamics-based verification of the techniques