Chinese Sunspot Drawings and Their Digitizations-(VI) Extreme Value Theory Applied to the Sunspot Number Series from the Purple Mountain Observatory

: Based on the daily sunspot number (SN) data (1954–2011) from the Purple Mountain Observatory, the extreme value theory (EVT) is employed for the research of the long-term solar activity. It is the ﬁrst time that the EVT is applied on the Chinese SN. Two methods are used for the research of the extreme events with EVT. One method is the block maxima (BM) approach, which picks the maximum SN value of each block. Another one is the peaks-over-threshold (POT) approach. After a declustering process, a threshold value (here it is 300) is set to pick the extreme values. The negative shape parameters are obtained by the two methods, respectively, indicating that there is an upper bound for the extreme SN value. Only one value of the NNN -year return level (RL) is estimated: NNN = 19 years. For NNN = 19 years, the RL values of SN obtained by two methods are similar with each other. The RL values are found to be 420 for the POT method and the BM method. Here, the trend of 25th solar cycle is predicted to be stronger, indicating that the length of meridional forms of atmospheric circulation will be increased.


Introduction
It is well known that the Sun is a dynamic and complex body, releasing energy on all wavelength intervals of the electromagnetic spectrum [1]. Solar activity has an important impact on space climate, space navigation, and high-frequency radio communications [2]. Research based on long-term solar activity helps reveal the physical connections in the solarterrestrial space environment and the dynamic process operating in the solar interior [3,4]. Sunspots are the earliest observed solar activity, which are the most significant observational features on the photosphere [5]. The greatest observation of solar activity is represented by the time series of sunspot number (SN) [6,7]. A high probability of a strong solar storm is related to high values of the SN index [8].
Extreme value theory (EVT) is a branch of statistics, which offers techniques for the research and estimation of probabilities of predicting rare events [9]. In recent years, as a special and important tool, the extreme value theory (EVT) has been applied in various science fields and got some marvel findings. EVT has been used on climate change [10][11][12][13], economics [14], engineering [15], and astrophysics [16][17][18]. For example, Ref. Siscoe (1976) [19] used the EVT to analyze the largest SN per solar cycle (SC) and the result supported the argument for a change in SC statistics during the Maunder minimum. Ref. Asensio (2007) [20] applied the extreme events of the SC to investigate the laws of extreme change and the results indicated that the SC 19 was the most probable time of maximum value. Ref. Acero et al. (2017) [21] applied the EVT on the new version of the SN index at daily, monthly, and yearly timescales, and the results showed that the return level (RL) was reaching a plateau. Furthermore, they found that the daily SN values would hardly be greater than 550 in the following 1100 years. Ref. Elvidge et al. (2018) [22] used EVT to determine the probability of solar flares of Carrington-like, and the results of the EVT were consistent with flare data from the Kepler space telescope mission. The millennial years 10Be and 14C time series were used to study the solar activity [23] and the solar radio of flux at the daily scale were also used for the research of the extreme events law [24]. Ref. Love (2020) [25] did some experiments on statistical models of extreme values of magnetic superstorm intensities, and the discrete samples of the disturbance storm time (Dst) data were found to be insufficient to provide usefully accurate estimates of a statistically theoretical upper bound for magnetic storm intensity.
In these above works, two traditional methods were usually used to study long-term solar indicators. The first one is the block maxima (BM) approach [9,19], which has been used to focus on the generalized extreme value (GEV) distribution. Another one is the peaks-over-threshold (POT) approach [9], which has been used to focus on the generalized Pareto (GP) distribution.
In this paper, the daily SN data, which are from the Purple Mountain Observatory during the period of January 1954 to December 2011, are used to study the extreme events of the long-term solar activity and the RLs of the extreme events. The structure of this paper is as follows. In Section 2, the data and methods are introduced. In Section 3, we shows the experimental analysis results. Finally, Section 4 is about discussion and conclusions.

Data
In the papers in this series, Ref. Lin (2019) [26] used the Chinese historical handdrawing sunspot records to make a compilation of the data. Ref. Sun et al. (2019) [27] analyzed the accuracy of the data from the Purple Mountain Observatory. Ref. Wan et al. (2020) [28] studied the quasi-biennial oscillations of sunspot activity based on this dataset. Therefore, the data are reliable and can be used for scientific research. In recent years, much work of solar extremes are based on the SN from World Data Center, Sunspot Index and Long-term Solar Observations (WDC-SILSO). More details about the SN can be found in the article [29]. However, in this work, the daily SN data during the period of January 1954 to December 2011 are from the Purple Mountain Observatory. Here, the SN is the number of the sunspots observed in one day, and what we analyzed in our work is not images. The data for the entire 24 cycle are not processed because the data from the Purple Mountain Observatory are just from 1954 to 2011. The dataset is composed of a period of 58 years. Figure 1 shows the daily values of the SN during this period.

Methods
EVT is a tool to handle with the excesses for the distribution, and to characterise whether the levels of the behavior of a process is at small or large [9]. EVT is applied for the research of the extreme values in the timeseries, and for the purpose of predicting the probability of occurrence of rare events. Generally, there are two main methods for the research of extreme events. The first method is the BM approach focusing on the GEV distribution [30,31]. Another one is the POT approach focusing on the GP distribution.
Both methods are used to analyze the extreme values, there are also differences between them, such as the defined way of the extreme events [20]. There are many advantages for BM approach. For example, its theory is simple [20], and it is a better choice when data are not completely independent and identically distributed [32]. However, it is a weakness for it to use the data inefficiently. So, POT is the fundamental method for the analysis of extreme values because of efficient utilization. In our work, two methods are both applied to data for better observing the results and comparing the results with two methods. Considering the BM approach, the first step is that we cut the data into blocks with equal length and the maximum value of each block is obtained. The second step is that the GEV distribution is fitted to the BM. GEV is a type of "law of large numbers", with a probability distribution function, please reference to [9,21].
As for POT, the first step is to set the value of u u u, which means the upper threshold. The second step is that we consider all sample values exceeding the u u u. Then the GP distribution is fitted to the POT. The threshod is necessary but difficult, which can not be too high or too low [9]. There are two criteria for the selection of u u u, the assessment of the stability of the parameter estimates and the mean residual life plot. For the definition of the GP distribution, please reference to [9,21].

Block Maxima Approach
First, we cut the data into blocks of equal length and the block length is set to 365, which means that every block consists of data from one year. Figure 2 shows the SN divided into 58 blocks with vertical lines, indicating the blocks we got and the maximum value of each block. After that, the GEV distribution is fitted to the BM, and the method of estimating the parameters is the maximum likelihood estimation. In Figure 3, the precision of the GEV model evaluated by standard diagnostic plots is shown. The quantile-quantile (QQ) plot confirms the validity of the fitted model, and compares the empirical data quantiles and the GEV fit quantiles (a). The consequence of plotted points falls onto an straight line. The QQ-plot shows the randomly generated data, which is from the fitted GEV, against the empirical data quantiles (b), and the 95% confidence bands of the QQ-plot get close to be linear. Then, the empirical density of the observed maximum seems to be consistent with the fitted density (c), stating that a good fit exists between the two. In Figure 2 Figure 4, the yearly maximum of SN value is about 420 in the future 2030. Comparing with the yearly maximums of SN value during from 2012 to 2029, the trend of RL is upward. Table 1 shows the values of three parameters of GEV distribution and their 95% CIs. µ µ µ is the location parameter, σ σ σ is the scale parameter, and ξ ξ ξ is the shape parameter [9]. The shape parameter and its 95% CI are always negative, indicating that the distribution has an upper bound and there is the maximum extreme value. Table 2 lists the estimates of the RL for the daily time series and for 19 years. We also try to set the block length to 730, which means that every block consists of data from two years, and effective sample data are less. The results of diagnostic plots are not as good as the results of Figure 3.

Peaks-Over-Threshold Approach
Choosing u u u is the most important step. Shown in Figure 5, an approximately linear relationship exists in the interval [250, 300] of the threshold u u u in the mean residual life plot, when the CI becomes larger. There is the stability of the parameter estimation, which is fitted by the GP distribution over a range of thresholds. Shown in Figure 6, it can be found that the shape parameter and reparameterized scale parameter change slowly when u are set to 260, 280, 300. The effects of three standard diagnostic plots with u u u = 300 are best, so 300 is chosen. An extreme value over a threshold is usually accompanied by exceedances of adjacent thresholds, leading to a cluster of consecutive exceedances [9]. However, the EVT requires the maxima to be independent so that there will not be any problems with short-range dependencies. Declustering procedure, which consists of locating clusters in the exceedance over the threshold and representing them by the maximum value inside each cluster, are applied to the sample data with r r r = 13 to make them independent, and here r r r is the run length. Because the solar global rotation is 27 and half of it is 13.5 [33], we need an integer, so 13 is chosen. In the declustering procedure, if the exceedances are separated by less than a number of r r r, exceedances are marked as the same cluster. Figure 7 shows the threshold exceedances histogram, which indicates the distribution over the last six SCs included 182 exceedances in this timescale. The numbers of threshold exceedances are 68 of the SC 19, 3 of the SC 20, 33 of the SC 21, 55 of the SC 22, 23 of the SC 23, and 0 of the SC 24, respectively. Figure 8 shows the decluster data above a threshold. Then we can get the new independent threshold exceedances. The next step is to apply the GP distribution to the independent threshold exceedances. The method of estimating the parameters is also the maximum likelihood estimation and the estimations of the two parameters ξ ξ ξ, σ σ σ of the GP distribution are −0.08 and 47.98. In Figure 9, the precision of the GP model evaluated by standard diagnostic plots is shown. The QQ-plot confirms the validity of the fitted model, and compares the empirical data quantiles and the GP fit quantiles (a). The consequence of plotted points falls onto an straight line. The QQ-plot shows the randomly generated data, which is from the fitted GP, against the empirical data quantiles (b), and the 95% confidence bands of the QQ-plot get close to be linear. Then, the empirical density of the observed maximum seems to be consistent with the fitted density (c), stating that a good fit exists between the two. Therefore, the diagnostic plots support the model's accuracy.  Besides, r r r = 27 is set for the sample data and the value is about the solar global rotation. r r r = 54 is also set for our data and it is about two solar global rotations. Both their results of diagnostic plots are not as good as the results of Figure 9, indicating that 13 is the best choice.   Then in Figure 10, the estimate and its 95% CI of the RL for N N N = 19 years are obtained by bootstrapping. Table 3 shows the data in which we use the bootstrapping to obtain for the two different parameters of estimated GP distribution and their 95% CIs. The shape parameter is negative and the 95% CI is also negative, implying that there is an upper bound of the extreme SN distribution. Combining the results of three standard diagnostic plots in Figure 9 and these two parameters in Table 3, it could be seen that there exists a good GP distribution. Table 4 lists the estimates of the RL for the daily time series and for N N N = 19 years, indicating that the yearly maximum of SN value is about 420 in the future 2030. Comparing with the yearly maximums of SN value during from 2012 to 2029, the trend of RL is upward.

Discussion and Conclusions
In this work, the SN data from the Purple Mountain Observatory at the daily scale are used for the research of the extreme SN based on the EVT. Two methods, the BM approach and the POT approach, are used. Considering the extreme values in the future, we estimate the RLs for N N N = 19 years. The shape parameters in Tables 1 and 3 are negative and the 95% CIs are also negative, indicating that there will be an upper bound of the extreme SN distribution for two distributions, so the RLs for two distributions can be estimated. In the Tables 2 and 4, the results of the RL for N N N = 19 are both 420, implying that the yearly maximum of SN value is about 420 in the future 2030. The values obtained by these two different methods are same. The trends of RL are upward in Figures 4 and 10, showing that the solar activity of 25th is obviously stronger than SC 24, and there will be an increase in the length of meridional forms of circulation [34][35][36].
There are many differences between other work and our work. For example, comparing with [21], it is the first time that we use Chinese data to study the extreme values. There are data of 58 years in our work and there are about data of 300 years in [21]. For the accuracy of the data, the more data, the better the fit. The results of three diagnostic plots in our work would be as good as results in [21] if we have more sample data. POT is usually used for the analysis of the extreme values in astrophysics [20,[22][23][24], and we use both methods to analyze the results. Table 5 lists a selection of prediction for the trend of 25th solar cycle. The solar activity cycle is predictable in nature, but the high-accuracy prediction should only be done for short-to mid-term due to its intrinsically dynamical complexity [37]. In the past few decades, many researchers have predicted SN by different methods. Ref. Wu et al. (2021) [38] used the two-parameter modified logistic prediction-extension (TMLP-E) models to predict SN of the SC 25 and SC 26, and the amplitudes of these two SCs were predicted to be at the same level as that of SC 24. Ref. Kakad et al. (2020) [39] suggested that the trend of peak SN was stronger. Ref. Sarp et al. (2018) [40] predicted that the solar maximum of SC 25 was greater than that of SC 24. Ref. Li et al. (2018) [41] forecasted the SC 25 by applying the bimodal distribution and found that the trend of solar activity was stronger. In this work, we apply the EVT to the Chinese SN, and find that the distribution of the daily SN data has an upper bound, and in Table 5, we can find that the trend of our prediction is consistent with these previous predictions [39][40][41]. Comparing with previous research, we study the daily SN data from the Purple Mountain Observatory at the daily scale by the EVT. Our results are constant with, and further support the previous works.  Informed Consent Statement: Not applicable.

Data Availability Statement:
The datasets generated for this study are available from the Purple Mountain Observatory.