Statistical Learning of the Worst Regional Smog Extremes with Dynamic Conditional Modeling

: This paper is concerned with the statistical learning of the extreme smog (PM 2.5 ) dynamics of a vast region in China. Differently from classical extreme value modeling approaches, this paper develops a dynamic model of conditional, exponentiated Weibull distribution modeling and analysis of regional smog extremes, particularly for the worst scenarios observed in each day. To gain higher modeling efﬁciency, weather factors will be introduced in an enhanced model. The proposed model and the enhanced model are illustrated with temporal/spatial maxima of hourly PM 2.5 observations each day from smog monitoring stations located in the Beijing–Tianjin–Hebei geographical region between 2014 and 2019. The proposed model performs more precisely on ﬁttings compared with other previous models dealing with maxima with autoregressive parameter dynamics, and provides relatively accurate prediction as well. The ﬁndings enhance the understanding of how severe extreme smog scenarios can be and provide useful information for the central/local government to conduct coordinated PM 2.5 control and treatment. For completeness, probabilistic properties of the proposed model were investigated. Statistical estimation based on the conditional maximum likelihood principle is established. To demonstrate the estimation and inference efﬁciency of studies, extensive simulations were also implemented.


Introduction
Modeling extreme climatic conditions [1,2], extreme weather [3][4][5][6][7][8], and rather harmful air pollutants, together with their social, economic, political, and human impacts, is a contemporary research topic. It was concluded by [9] that extreme weather is the new normal. References [10,11] studied the air quality in the USA, and air pollution and health effects in a pyramid figure of effects. Smog as a more serious type of harmful air pollutant has been drawing more and more attention recently. Studies on the masses and chemical compositions, as well as the concentrations, formation, and source of smog have been done [12][13][14][15][16][17][18][19]. Considering the significant impact that some meteorological conditions may have on the PM 2.5 concentrations [20][21][22], statistical approaches, as well as physical and chemical ones, were also conducted [23][24][25]. Except for the studies based on annual averaged PM 2.5 data [26][27][28], the cases when extreme smog happens are especially concerned [29][30][31][32][33]. For example, public attitudes and responses to the first two red warnings for air pollution in Beijing in 2015 were examined [34].
China's smog problem stands out in its extremely high frequency, long duration, and high concentration. There are eleven long-lasting rounds of severe smog in 2015, which occurred mainly during the last two months of that year. On 30th November 2015, the PM 2.5 concentration in Beijing and the south of Hebei exceeded 900 µg/m 3 and even reached 976 µg/m 3 at Liuli River station, Beijing. The worst smog in 2016 started from 16th December and ended on 21st, covering 17 provinces; i.e., over one seventh of the national territory area. Three fourths of the cities are located in the Beijing-Tianjin-Hebei region and its surrounding regions. The PM 2.5 concentration in the downtown of Shijiazhuang city even broke 1000 µg/m 3 . These observations clearly direct us to be more concerned with the statistical learning of extreme smog (PM 2.5 ) dynamics of a vast region in China. For example, classical extreme value analysis to hourly PM 2.5 data from 2014 to 2016 in China have been obtained [35]. It is also worth noting that the smog problem is not a unique phenomenon in China. It occurs elsewhere in the world, especially in developing countries; e.g., Dehli, India [36].
Understanding the extreme features of smog problems in China is very important since severe smog levels are more dangerous than ordinary levels and will do greater harm to China's public health [37][38][39][40][41][42]. For example, in 2013, 83% of the whole population in China was exposed to the air pollution with PM 2.5 level exceeding 35 µg/m 3 , which might cause 1.3 million premature mortalities [43]. The smog also causes a significant economic loss for China [44,45] and even for the whole world [46,47].
Using classical extreme value theory, one could fit the generalized extreme value distribution to extreme observations recorded from each of those hundreds of smog monitoring stations. Recently, Dombry [48] studied properties of the maximum likelihood estimators for the extreme value index within the block maxima framework. Studies [49,50] are the most recent ones dealing with spatial extremes and processes. Study [51] proposed a dynamic modeling approach, the autoregressive conditional Fréchet (AcF) model, for maxima of daily negative log returns of 100 stocks in S&P100. These new models and modeling approaches can certainly be applied to the extreme observations in smog extremes in this study. However, static extreme value models do not offer a dynamic view and may be of less interest to administrators; in addition, the AcF model does not have good performance in climate extremes either. Different from published work in the literature, a new study approach is applied to smog extremes in our work, which intends to integrate a new type of extreme value modeling and dynamic modeling into a dynamic conditional distribution modeling and analysis of regional smog extremes, particularly the worst scenarios observed in each day. The results show a significant improvement compared with using existing extreme value modeling approaches. In addition, weather factors will be introduced in the model to gain higher modeling efficiency. The proposed model and the enhanced model are illustrated with real data of hourly PM 2.5 observations during 2014-2019 from smog monitoring stations located in the Beijing-Tianjin-Hebei geographical region. In particular, we studied the worst smog dynamic scenarios in the vast region of Beijing-Tianjin-Hebei in China. For joint regional extreme event monitoring and control, knowing first what can be the worst scenario in a day clearly will be very useful for administrators to decide whether or not to make warnings and some necessary control treatments. This paper enhances the understanding of how severe extreme smog scenarios can be and provide useful information for the central/local government to conduct coordinated PM 2.5 control and treatment. In the literature, researchers have paid much attention to extreme values in environmental studies; e.g., extreme temperature [2], precipitation [2,[4][5][6], snowfall [52], and biomedical physics in brain image analysis [53], among many others. Our new model can certainly find applications in these areas. On the other hand, our model can also be applicable to studying systematic risk in financial systems [54], which is a contemporary research topic. Moreover, in terms of systematic risk study in air-quality control and treatment in regional risk and hazard management, our model can be very useful.

Roadmap
The rest of the paper is organized as follows. Section 2 conducts a preliminary analysis of smog in a vast region of Beijing-Tianjin-Hebei. In Section 3, we introduce our dynamic model by specifying a latent independent standard random process of standard Weibull random variables. Probabilistic properties of stationarity and ergodicity of the proposed model are investigated. Statistical estimation based on the conditional maximum likelihood principle is established. To demonstrate the estimation and inference efficiency of studies, extensive simulations are implemented in Section 4. Section 5 contains advanced modeling of smog extremes together with weather data. Section 6 offers conclusions of the paper regarding our applied study of smog extremes. Technical arguments are deferred to Appendix A.

Preliminary Analysis of Smog in the Vast Region of Beijing-Tianjin-Hebei
2.1. Which Time Scale of PM 2.5 Data Is to Be Analyzed?
When considering extreme smog as an extremely harmful air pollutant which may last a few hours in a day or last a couple of days, it becomes rather meaningful to study the hourly characteristics of radical smog movements, instead of daily traits. One can consider an analysis of daily data as well. We will not consider this time scale in this applied project.
To adequately demonstrate the variation of smog in different years, data from 2014 to 2019 are used to fit our proposed model. The first three months' data of 2020 are used to check the predictability of our model. All the data are from the China National Environmental Monitoring Center. Due to technological testing, transition problems, hardware failure, possible delayed updates, etc., some of the data are missing and taken as non-observable smaller values to the observed daily maxima in this study. As shown in Figure 1a, in January 2014, most of the monitored areas spanning the north, east, and middle of China (especially those red areas) had quantiles exceeding 150.4 µg/m 3 , which is the starting point of the very unhealthy level of air quality according to the US standard [55]. There are five intervals of PM 2.5 levels according to the US (EPA) standard: (1) 35 µg/m 3 is the highest PM 2.5 level for the good and moderate category; (2) the range from 35 to 55.4 µg/m 3 is unhealthy for sensitive groups;

The Geographical Region to Be Focused on
(3) the range from 55.5 to 150.4 µg/m 3 is unhealthy; (4) the range above 150.4 µg/m 3 is widely viewed as very unhealthy; and (5) it is hazardous when the smog level is above 250.5 µg/m 3 . We see then that south Hebei, Beijing, and Tianjin showed the highest smog levels with the 90% quantile values being close to an outstanding 500 µg/m 3 , which is the maximum value given in the US standard. A similar conclusion can be drawn from Figure 1b is that in December 2014, south Hebei, Beijing, and Tianjin were still the worst air quality areas, although the quantiles of the smog values in most areas of China were much lower than their counterpart values in January. In January and December 2019 (as shown in Figure 1c,d respectively), the thresholds among all of China decreased significantly; that shows great improvement in extreme smog problems; south Hebei, Beijing, and Tianjin are still some of the key regions with the severest smog conditions (relatively higher threshold). Monthly 90% quantiles of 2015 and 2018 confirm this conclusion as well.
This study focuses on the smog problem in the Beijing-Tianjin-Hebei region; i.e., not all of mainland China, whose smog issues may be too diversified to draw consensus conclusions about. The importance of focusing the study on the smog problem in the Beijing-Tianjin-Hebei region is due to the fact that the areas are not only geographically connected but also economically significant in their immense contribution to China's GDP (nearly 9% in 2019 with 8% of population). With the implementation of the collaborative development of the Beijing-Tianjin-Hebei region as a national strategy, the coordination among these three areas has been becoming more and more intimate. It has been taken as an indivisible whole, especially in the issues of smog prevention and control.
The smog problem in this region has been recognized on account of health concerns as early as in January 2013, when long-lasting severe smog prevailed and caused a significant increase in the number of patients with respiratory tract infections and allergic symptoms in hospitals and clinics in Beijing, Tianjin, and Shijiazhuang (the capital city of Hebei province). Research also indicated that bad air quality in northern China causes 5.5 years reductions of people's life expectancy [56].
In the Beijing-Tianjin-Hebei region, besides two metropolitan cities, Beijing (the capital of China) and Tianjin (a major port city in northeastern China), there are 11 cities in Hebei province. Among these 11 cities, Zhangjiakou and Chengde are located in north Hebei; Qinghuangdao, Tangshan, and Cangzhou are in east Hebei; Baoding and Langfang are located in the middle of Hebei; Shijiazhuang, Hengshui, Xingtai, and Handan combine south Hebei. Their locations along with Beijing and Tianjin are shown in Figure 2. The number of smog monitoring stations in the region nearly remained the same during 2014-2019 (after deleting stations still under test, the numbers of valid stations are around 80 for all years). Worth noticing is that the station type is the key parameter to explain diurnal variability of atmospheric pollutants due to sources proximity. In this paper, however, since only the daily maxima of PM 2.5 across the whole region rather than a spatial model is considered, the station type and geographical information which describe the spatial relations are not used in the study.

Why Model the Extremes Rather than the Average Levels?
This paper focuses on the extreme smog instead of average levels for two reasons. Firstly, the extreme smog rather than the ordinary smog affects the public health and causes more loss to the welfare of the whole society [36,38,41]. Secondly, when investigating the severity of smog, the results can be very different from the use of the annual averages or the extreme values of PM 2.5 data. Taking stations of Zhangjiakou and Chengde during 2014-2016 as an example, their annual average PM 2.5 levels rank as the best 30% among the whole country, showing relatively good air quality on average. However, as far as the means of extreme values (here extreme values refers to those PM 2.5 levels above the annual 90% quantile) are concerned, those stations belong to the worst third of the whole country. This difference in ranks gives two implications. On the one hand, the poorer grades using more extreme values than those using the annual averages show that the air quality of Zhangjiakou and Chengde may not be as good as one has thought; i.e., a city that performs well on average does not guarantee an absence of extreme smog. On the other hand, for Zhangjiakou and Chengde, while their extreme PM 2.5 levels were relatively lower most of the time, there were still a few occurrences of very high PM 2.5 levels to which considerable attention should be paid. The histograms of hourly PM 2.5 extreme values for those stations also confirm such arguments. One of those histograms at one representative station in Zhangjiakou in 2014 is given in Figure 3 as example. For studies on the average PM 2.5 level rather than the extreme PM 2.5 value, we refer readers to [26][27][28].

The Study Approach and the Inclusion of Meteorological Variables
With the established arguments in the prior section, this applied study focuses on the extreme values rather than the average values of PM 2.5 data. Ideally, one should build a model to describe extreme smog level dependencies among monitoring stations in the region. Then, all analyses and inferences, even policy recommendations, are to be based on the constructed model. To achieve this goal, a model builder has to consider many location/spatial parameters and extreme spatial dependencies, which can be very complicated and hardly implementable in a multivariate/spatial extreme value and time series context. We note that in the literature, workable and meaningful air quality models, including a comprehensive air quality model with extensions (CAMx), the community multiscale air quality modeling system (CMAQ), etc., have been developed for various applications and for developing public policies. In this study, we adopt an alternative approach in modeling regional extremes in a time series modeling framework. This paper tries to model the maximum PM 2.5 values of 24 h a day among all stations in the Beijing-Tianjin-Hebei region. Here, the extreme value represents the highest daily PM 2.5 concentration of the area as a whole. It is particularly useful when planning smog prevention and control measurements, since such measurements as well as the smog pre-warning systems are integrated and unified in the Beijing-Tianjin-Hebei region. The pre-warnings should be activated and corresponding measurements should also be taken so long as the predicted PM 2.5 level at one single station inside the region exceeds a specific breakpoint wherever the station is. In this circumstance, only the maximum value of the whole part rather than the particular concentrations of all individual stations is concerned.
The line graphs of the regional daily maximum (based on hourly observations) PM 2.5 levels from 2014 to 2019 are plotted in Figure 4a. It can be seen that extreme smog levels were much worse in 2014 since its extreme values were mostly much higher compared to the same period of the following five years. It is further demonstrated in Table 1 that the extreme values in 2014 have a much higher sample mean, median, maximum and standard deviation; they improved significantly in recent years with these statistics decreasing in general. More importantly, the maximum PM 2.5 levels over the years are persistent with an exceptional high value in 2014. Meanwhile, the skewness and kurtosis are persistently larger than those from a Gaussian distribution. The extreme values also show significant seasonality for all six years in that their sample means and standard deviations are much higher in the first and fourth seasons than the other two seasons, as shown in Figure 4b. We shall consider these dynamic characteristics in our model building.   In the literature, extensive research shows that climate factors may cause a significant impact on the PM 2.5 levels [20,21] at one location. These meteorological conditions include temperature, humidity (the humidity mentioned in this paper is relative humidity (%)), wind speed, wind direction, etc.; see [23,24]. Taking Beijing as an example, Figure 5 is the daily maximum PM 2.5 level of Beijing in 2018 that is computed using the hourly PM 2.5 levels among all monitoring stations in Beijing. The selected meteorological conditions are Beijing's daily maximum and minimum temperature, daily maximum and minimum humidity, daily maximum wind levels, and its related wind direction in 2018, as shown in Figure 6. All the data are from the National Meteorological Information Center.  The annual maximum PM 2.5 level in Beijing is around 350 µg/m 3 in 2018, which is much higher than the hazardous breakpoints (250 µg/m 3 ) according to the US standard. Although the annual maximum of the PM 2.5 values in Beijing is lower than that of the regional extremes given in Table 1 (850 µg/m 3 ), around one tenth of the daily regional extremes are observed from stations in Beijing. Actually, the locations of the highest daily regional extremes are well spread among cities in the Beijing-Tianjin-Hebei region, which exactly demonstrates that modeling the regional extremes instead of local PM 2.5 dynamics is very important. Besides, the smog in the first and last seasons is much more sever than that in the other two seasons; meanwhile, the daily minimum and maximum temperatures in the first and last seasons are much lower, showing potential negative correlations between the PM 2.5 level and temperature. As for wind and humidity, they affect the concentration of smog in different ways: it is known that humidity affects the atmospheric chemistry and the formation of secondary pollutants such as particulate matter and ozone. However, wind influences the concentrations of the trace gases, which react at rates determined by their concentrations. The fourth level wind coming from the north or northeast most often occurs in the first season, which tends to decrease the PM 2.5 level. However, the smog is still severe when the wind becomes weak. In this season, the differences between the daily maximum and minimum humidity are relatively larger (mainly caused by the lower minimum humidity), which is an adverse diffusion condition for the smog [57]. In the second and third seasons, winds with the third level coming from the south and south-west dominate. In these two seasons, the smog problem is much less sever and both the maximum and minimum humidity are relatively high (resulting in lower humidity difference). In the last season, winds seem to be weaker with the second level coming from the northeast, and both the daily maximum and minimum humidity are barely low (with a relatively smaller minimum humidity), which could be reasons leading to the severe smog. Generally speaking, it is observed that the lower the wind level and the higher the humidity difference (or the lower the minimum humidity), the higher the daily PM 2.5 extremes. These observations coincide with the general results from other studies [57,58]. More details about the specification of covariates (climate factors) when modeling the regional daily maximum PM 2.5 levels will be addressed in Section 5.2. The data source is also the National Meteorological Information Center.

Model Specification
Suppose X tij is the PM 2.5 level on day t at time i (hour) of the jth station in a group containing m stations. Define Q t as An illustration of daily regional smog extremes' time series {Q t } is given in Figure 4.
In the extreme value theory, under suitable conditions ( [59]), the normalized maximum of a sequence (a group, a block, or an array) of random variables is distributed as a generalized extreme value (GEV) distribution in its limit when the size of the sequence tends to infinity. In view of the definition in Q t , random variables X tij 's are potentially showing time dependence and spatial dependence, and the distributions can be different. As a result, Q t can be hardly distributed as a GEV random variable, and the dependence between Q t and Q t−1 cannot be modeled using the models (e.g., [51,60]) developed for GEV distributed random variables.

The Proposed General Model
Inspired by the AcF model (autoregressive conditional Fréchet model) proposed by [51], we propose the following model for smog extremes modeling: ( where µ t is the lowest level of PM 2.5 in the region on day t, σ t > 0 is the scale parameter, α t > 0 is the shape parameter, and {Y t } t≥0 is a sequence of i.i.d. exponentiated Weibull (unit exponential) random variables. One notices that the standardized Q t , (Q t − µ t )/σ t follows an exponentiated Weibull distribution with the shape parameter α t . By introducing the additional parameter α t , the distributions of Q t possess great variability and flexibility. In the literature, the exponentiated Weibull distribution has been applied to model climate extremes; e.g., in flood modeling in ( [61]). Unlike the AcF model, wherein unit Fréchet distributions are assumed to capture the heavy-tailedness of financial data, Weilbull-distributed {Y t }s are considered here to fit the characteristics of smog data which are not necessarily being heavy-tailed. It has been observed in [35] that the station-wise extremes in the Beijing-Tianjin-Hebei region belong to the maximum domain of attraction of Weibull type, which leads to a reasonable assumption of Y t being standard Weibull distributed. To be more rigorous, we also fit the AcF model in Section 5 for comparison.
The next stage is to build a time series model for the {Q t } series. Following the arguments above and those in [51], to reduce the model complexity, we treat µ t as constant µ and interpret µ as the lowest level of PM 2.5 fpr all the time across the Beijing-Tianjin-Hebei region, and treat σ t and α t as dynamic.
Noticing that the smog levels can be affected by the past smog levels and weather conditions, we assume the following dynamic equations for the time varying parameters: in which T p t , Hu t , Wds t , and Wdd t denote the maximum temperature, minimum humidity, mode of wind speed, and mode of wind direction respectively. These four weather condition variables have been commonly discussed in the smog study literature; e.g., [62,63], and the references therein.
Other weather factors can also be considered. However, adding too many variables increases the model complexity and weakens the effects of important variables. For this reason, we only focus on the dynamics of variables given in (3) and (4). We call model (2)-(4) the dynamic conditional Weibull (DCW) model. In this paper, we consider functions η σ (.) and η α (.) defined in (3) and (4) as exponential functions which are widely adopted in the literature, e.g., [51], due to their flexibilities in functional properties of boundedness, differentiability, and monotonicity. It is worth noting that the functions of η σ (.) and η α (.) given in (3) and (4) can also take other forms, as long as they meet some conditions that guarantee the dynamic processes being stationary and ergodic. For clarity and simplicity in the proofs of stationarity and ergodicity, we first omit terms related to weather factors and express model (2)-(4) as: Moreover, we will discuss the model related to weather factors in Sections 4 and 5.

Theorem 1. (Stationarity and ergodicity.) If we have
In Section 3.2, we study our parameter estimation procedures and characterize asymptotic behaviors of the estimators.

Parameter Estimation and Asymptotic Properties
We begin by introducing some notation. We denote as the parameter space for the estimation problem in (5)- (7) and set the true parameter as . After letting (σ 1 ,α 1 ) be an arbitrary initial value, we then denote (σ t (θ),α t (θ)) as the t-th iterate generated from model with initializer (σ 1 ,α 1 ) and an arbitrary parameter θ in Θ s . In addition, (σ 0 t , α 0 t ) is denoted as the t-th iterate generated from the model with true (σ 1 , α 1 ) and θ 0 . Moreover, we also denote (σ t (θ), α t (θ)) as the values generated from the true initializer (σ 1 , α 1 ) and an arbitrary θ in Θ s . With the known {µ,σ t (θ),α t (θ)}, the conditional p.d.f. of Q t is given by for any fixed t. By leveraging the conditional independence property of {Q t } t≥0 , we further write the log-likelihood function with respect to parameter θ as where˜ t (θ) = log( f t (θ)).
Next, we impose two assumptions of the model that we are investigating in.

Assumption 1.
Assume the parameter space Θ is a compact set of Θ s . Suppose the observations {Q t } n t=1 are generated from a stationary and ergodic DCW process with true parameter θ 0 being an interior point of Θ.
Due to the compactness of Θ, there exists a uniform upper and lower bound of the sequence (σ t (θ), α t (θ)) and (σ t (θ),α t (θ)) with θ ∈ Θ, which are denoted as (σ U , α U ) and (σ L , α L ) respectively. We next make an assumption on the lower bound of our sequence α t .

Assumption 2. The uniform lower bound α L is larger than 2.
It is noted that [64] studied the consistency and asymptotic normality of irregular MLEs from a group of distributions, including the three-parameter Weibull distribution, using i.i.d. observations. We extend the existing results to a dynamic model with dependent observations. In addition, Assumption 2 coincides with the results given by [64] that the classical asymptotic properties holds only if α > 2 under their static settings. We next theoretically characterize the consistency property of the local maximizer of likelihood functionL n (θ) in the following Theorem 2.

Theorem 3. (Asymptotic normality) Under the same assumptions in Theorem
where θ n is given in Theorem 2, and M 0 is the Fisher information matrix with the value estimated at θ 0 . Furthermore, the variance of the plugged-in estimated score Although the existence of the θ n and their asymptotic distributions are shown in Theorem 2 and Theorem 3 respectively, the uniqueness of MLE remains open. Proposition 1 provides a segmentary answer to the uniqueness of MLE. Proposition 1. (Asymptotic uniqueness) Define the set V n = {θ ∈ Θ|µ ≤ µ 0 + n } with Θ given in Theorem 2 and n = O p (n −α ), 1/α L < α < 1/2. Under the the same assumptions in Theorem 2, there exists a sequence of θ n = arg max θ∈V nL n (θ) such that we have θ n − θ 0 ≤ τ n , and P( θ n is the unique global maximizer of Note that, given observations {Q t } t≥1 , the parameter space of θ is defined as Θ n = {θ ∈ Θ|µ < Q n,1 } after ranking {Q t } t≥0 . One is able to see that V n ⊆ Θ n since we have Q n,1 − µ 0 ≥ O p (n −1/α L ). In addition, Proposition 1 states that with the probability tending to 1, θ n = arg max θ∈ V nL n (θ) is a unique consistent estimator of θ 0 over V n . The formal proof of this Proposition 1 can be found in Appendix A.5.
In Section 4, we use simulation examples to illustrate the numerical evidence of the established theory, and then in Section 5, we study smog dynamics in the Beijing-Tianjin-Hebei region.

Numerical Studies Using Simulations
In this section, we present two simulation examples. Example 1 does not involve weather variables. Example 2 contains weather variables. In each simulation, we generate the process using parameter values from real applications and i.i.d. standard Weibull random variates with lengths of 2000 and 5000 respectively. Then we fit the parameters by conditional MLE, starting with the paired values which we used to get the true MLE. The codes for simulation studies in this Section 4 and for model estimation and prediction in Section 5, which were developed in R software, have been put in GitHub for free assessments. (URL is https://github.com/MaxineYu/DynamicConditionalWeibullModel-DCWcodes.) Table 2, we list all parameter values (taken from the estimated values for a real application based on model (10)-(12) using data from 2014-2016, and report the estimated mean values and standard deviations from 500 repetitions of time series with lengths of 2000 (scenario SC1) and 5000 (scenario SC2) respectively.

Example 1. Simulations without weather factors: In
The following Table 2 presents our estimation results for simulations without considering weather factors. One can see that the mean values of 500 estimates are very close to their corresponding true values. The corresponding standard deviations also illustrate the significance of our estimation. Table 2 also presents the ratios between the standard errors with n = 2000 and those corresponding ones with n = 5000. It is noticed that the ratios are close to 0.632 = √ 2000/5000, which is consistent with our theoretical justification in Theorem 3 with √ n( θ n − θ 0 ) → d N(0, M −1 0 ). Next, we study simulations involving weather factors.

Example 2.
Simulations with weather factors. In Table 3, we list all parameter values (taken from the estimated values for a real application based on model (16)-(18) using data from 2014-2016), and report the estimated mean values and standard deviations from 500 repetitions of time series with lengths of 2000 and 5000 respectively.
We note that in the simulations, we used the observed weather values of the maximum temperature, minimum humidity, mode of wind-level, and modes of wind-direction (see Section 5 for more details) of a day (24 h).
Comparing the results related to parameter estimation and standard errors in Tables 2 and 3, we can see that the common parameters like θ = (µ, β 0 , β 1 , β 2 , β 3 , γ 0 ) in Examples 1 and 2 have comparable estimation accuracy. Moreover, the estimated parameter values associated with the weather variables are close to their corresponding true values, while the Monte Carlo standard deviations of the parameters associated with modes of wind-directions are relatively large with n = 2000, which suggests low estimation accuracy, though they are improved when n = 5000. This phenomenon is understandable, as the corresponding observations are zero inflated.
In summary, two simulation examples confirm that our estimation procedure (the conditional MLE) works for our proposed model parameter estimation, which provides an empirical support in our theoretical results and real data applications. The computational times of Examples 1 and 2 along with the computing environment are listed in Appendix B.

Inference without Weather Factors
With the established model and notation in Section 3, we first fit model (5)-(7) using the data from 2014-2019. Given the fitted parameters in the above model, we generate and plot the fitted σ t and α t dynamics in Figure 7. In this process, for every t ≥ 1, we apply real values of smog data Q t−1 in Equations (5) and (6) to generate fitted (σ t , α t ). In addition, initial values of (σ 1 , α 1 ) can be arbitrarily chosen (greater than zero), as their effects on the generated sequences are negligible by our theoretical justification given in Appendix A.  Figure 7a depicts dynamics of σ t which represents the scale of the maximum extreme smog. It is shown that σ t enjoys smaller values during the second and third season and enhances greatly during the first and fourth seasons. This seasonality matches one of the extreme smog values, which means that when the extreme smog goes severe, the maximum value among the whole region fluctuates more. The observed phenomenon is similar to the volatility sequence in GARCH (generalized autoregressive conditional heteroscedasticity) model ( [65]) which reflects the local volatilities. The σ t in DCW plays the same role as the volatility in GARCH model. From Figure 7b, one can see that values of α t converge to a constant very quickly once a fluctuation occurs. Moreover, considering the fact that the parameter γ 2 is neither significant nor stable in Monte Carlo simulations, the parameter γ 2 is set as zero here. As a result, the equation of α t , which only contains γ 0 and γ 1 this time, also converges to a constant. Based on the above arguments, we constrain γ 1 = γ 2 = 0 in Equation (6), which simplified to the following model (11), and our final model without the weather factor becomes (10)- (12): The estimated results are presented in Table 4. It is worth noting that the simplified model (10)-(12) and the model (5)-(7) have very closed maximum likelihood values (−5.84223 and −5.84348 respectively). The likelihood ratio statistic testing the two constraints (γ 1 = γ 2 = 0) equals 0.0025, which is much smaller than the critical value χ 2 (2) = 5.991. This confirms the validity of the two constraints; i.e., the simplified model should be used. Using the estimated parameters given in Table 4, we generate a sequence of fitted Q t (notated asQ t thereafter) based on Equation (12), in which σ t and α t are obtained by Equations (10) and (11) with the same procedure we used to plot Figure 7. QQ-plot(quantile-quantile plot) of the real data Q t against the generated oneQ t is presented in Figure 8a and the line graphs of these two series are given in Figure 8b. It is observed that points in Figure 8a distribute around the line of 45-degree, implying that the distribution of fitted dataQ t is close to the one of real Q t . Tendencies ofQ t also fits well with Q t in Figure 8b.
An alternative approach is also considered here. As discussed in Section 3, one may apply the AcF model proposed in [51] to capture the heavy-tailedness of financial data. To illustrate the validity of the AcF model in its application on smog data, it is also fitted here then the estimated parameters are used to generate a fitted sequenceQ t . The density curves of the generated valuesQ t from both the AcF model and our DCW model combined with the histogram of real data Q t are depicted in Figure 9a. The QQ-plot based on the AcF model are also given in Figure 9b. As shown in Figure 9a, the density curve of generated valuesQ t from the DCW model approximate that of real values better than that from the AcF model in both the tail length and location of the peak. Taking the tail as an example, simulated values from the AcF model which exceed 1500 or even 2000 µg/m 3 show much more frequency than that of real data. The same argument can also be put forward according to QQ-plot in Figure 9b, in which the points are obviously lower than the 45-degree line. This bias in fitting extreme high values does not happen in our DCW model in Figure 8a.
It is also worth noting that in Figure 9a, the histogram shows a salient at value 500 µg/m 3 , which is caused by the recording mechanism. Under some unknown circumstances, during the period from the end of 2015 to the start of 2016, the PM 2.5 values larger than 500 are truncated to 500; that causes an unusually high frequency at 500 in the histogram. Even so, the overall fitting efficiency of our DCW model looks reasonably acceptable using those truncated values 500 µg/m 3 by seeing the generated valuesQ t displayed by the red density curve in Figure 9a.

Inference with Weather Factors
As mentioned before, the smog extremes Q t often relate to weather factors. We extend Equation (5) to the following (13); then model (5)-(7) becomes (13)- (15): where and Wdd i,t represent the maximum temperature, the minimum humidity, the mode of wind-level (taking values: 1, 2, 3, 4, . . . ), and the mode of wind-directions of a day (24 h) of the related city respectively. Seven dummy variables are used to represent eight different wind-directions. Considering the lagged effects and non-negligible impacts that weather factors have on the scale parameter of smog data, the values of weather factors from the last day are used to generate current σ t . For simplicity, the linear function of weather variables is used here. It can also be extended to non-linear functions to gain higher modeling efficiency. Adding weather factors in model Equation (6) can also be studied, but may lead to a much more difficult parameter estimation process due to the increase of model complexity and dimension of the parameter space. The advanced modeling and analysis will be deferred to our future projects. After fitting model (13)-(15), the fitted parameter values are used to generate (σ t , α t ) by plugging in real values of the (t − 1)th day in model (13) and (14), where initial values of (σ 1 , α 1 ) are chosen to be greater than zero. In Figure 10a, the estimated σ t shows similar seasonality as mentioned in Section 5.1. Besides, from the tendency of estimated σ t from 2014-2019, it can be seen that both the value and fluctuation of σ t decrease after 2017 (except that the values larger than 500 during the period from the end of 2015 to the start of 2016 are truncated), showing improvements in air quality. Figure 10b shows that the sequence of α t shrinks to a constant value quickly. Based on this observation, we set both γ 1 and γ 2 to be zero in Equation (14), which is simplified to the following Equation (17) (the same to Equation (11)). The final model with weather factors is specified as follows:  Table 5 (The computational times of Tables 4 and 5 along with the computing environment are listed in Appendix B). The last term in Equation (16) is an increasing function of Q t−1 and a decreasing function of (Tp t−1 , Hu t−1 , Wds t−1 ), which coincides with the fact (see Section 2.4) that the scale of the smog tends to be higher when the prior day has higher extreme PM 2.5 values, together with lower temperature, lower minimum humidity, and lower wind-speed than a normal day. The estimated parameter values are also used to generate a sequence of fitted valuesQ t using the same procedure as is the case without weather factors; thenQ t and real Q t are plotted in Figure 11 with the left panel being the QQ-plot and the right panel being the line graph. The QQ-plot in Figure 11a shows that the simulated sequence and its true values almost distribute in a line of 45-degrees. Moreover, the scale ofQ t in Figure 11b fits the real values better than it does in Figure 8b, showing more precise description on the overall pattern of the real scenarios. The root mean squared errors betweenQ t and real values are also computed in order to compare the two models. They are 155 for the model without weather factors, and 153 for the model with weather factors. From the above, it can be concluded that the model, including weather factors, performs better than the model without weather factors. Considering the truncated extremes during the period from the end of 2015 to the start of 2016, the fitted values from the models with weather factors may be good choices for those missing values. It is also interesting to notice that the fitted σ t in Figure 10a andQ t in Figure 11b have very similar variation tendencies.
Finally, the predictability of our introduced models is explored. Two estimated models (10)-(12) and (16)- (18) are used respectively to forecast the regional daily PM 2.5 extreme values from January to March in 2020.
Similar results are obtained from the two models. The results from the second model are used to illustrate the prediction process. For the given t-th day, the predicted σ t (see Figure 12a) is generated via Equation (16) with the one-step-ahead prediction method (the (t − 1)-th day's real Q t−1 and weather factors are used). Then we predict the t-th day's smog value Q t based on our obtained scale σ t and fitted parameters µ, γ 0 according to Equation (18). The mean values of the 500 repetitions based on the simulated standard Weibull random variables of Y t are taken as our final predicted values, shown as the red line in Figure 12b.
Compared Figure 12a with the fitted σ t from 2014-2019 (Figures 7a and 10a); both the value and fluctuation of predicted σ t in 2020 decrease to some extent, showing that the fluctuations of extreme PM 2.5 become smaller. Figure 12b shows that our results give a relatively good prediction of the future variation of extreme smog in that the real values almost lie in the 95% prediction intervals. Specifically, our predictions capture the characteristics of extreme smog lying in 200 µg/m 3 -500 µg/m 3 well, although under certain circumstances, the real values exceed this bound due to irregular random factors. These two figures show a similar tendency; i.e., the predicted σ t and the regional extremes co-move to some extent. Moreover, the values and fluctuations of both series decline more in February and March than in January 2020.

Conclusions and Discussion
The Beijing-Tianjin-Hebei region is one of the key regions suffering from the severest extreme smog in China. However, it is difficult to model all the PM 2.5 monitoring stations at the same time; more importantly, the regional extremes are what really matters for this region in order to conduct a joint control strategy. To describe the potential dynamic variation of the regional extremes, this paper integrated classical extreme value modeling and dynamic modeling into a dynamic conditional Weibull distribution modeling and analysis framework, in which the worst scenarios observed among multiple locations in each day during 2014-2019 are described. In addition, weather factors were introduced in the model to gain higher modeling efficiency. The proposed model performs more precisely on fittings compared with other previous models dealing with maxima with autoregressive parameter dynamics (taking the AcF model as an example).
Using the proposed model, the fitted scale parameter and fitted regional smog extremes are obtained and given to show the variation tendency during 2014-2019. It indicates that the extreme smog in the Beijing-Tianjin-Hebei region shows strong seasonality that both its extreme values and fluctuation are higher in the first and fourth season than the second and the third season. It can also be seen that the extreme values and fluctuations during 2014-2017 are almost the same; i.e., the regional extremes did not improve much during this period, although other works show that a lot cities in the Beijing-Tianjin-Hebei region have experienced lower PM 2.5 levels since 2014 [35]. However, the extreme values and fluctuations decrease after 2017, showing some improvements in air quality. These findings imply that if the central/local government wants to conduct coordinated joint PM 2.5 control, the strict treatment strategy must be maintained as long as the regional extremes remain at a high level. It is worth noting that although the regional extremes larger than 500 µg/m 3 during the period from the end of 2015 to the start of 2016 are truncated in the original data, they can be fitted by our model, which might be a promising choice for estimating the missing data.
The proposed models can be used to predict the maximum PM 2.5 level among all monitoring stations in the Beijing-Tianjin-Hebei region. The results of one-step-ahead prediction show that the real values almost lie in the 95% prediction intervals and our predictions capture the variation tendency of extreme smog lying in 200 µg/m 3 -500 µg/m 3 well. Considering that the widely-used meteorologic models for forecasting often require more computation complexity, the one-step-ahead prediction of our model is especially suitable for the short-term forecast and quick response due to its simplicity, operability, and accuracy. More importantly, the predictions of regional extremes rather than single stations are greatly useful to the regional joint early warning system. Under this mechanism, once the future PM 2.5 level of a single station rather than the average level of the whole region exceeds the breakpoints of a certain grade, the early warning will be activated and the corresponding treatment measures will be taken. In fact, in practical applications, when the current PM 2.5 values rise rapidly, it will attract public attention. We suggest that most strict measures should be taken when our prediction is close to 500 µg/m 3 . In this relatively stricter way, the regional coordinate PM 2.5 prevention and control can be better performed.
This paper gives some theoretical implications as well. Our DCW model can be extended to contain log σ t−q 1 and log α t−q 2 , which is very similar to the GARCH(q 1 , q 2 ) model. In our application, we have already fit the data very well; there is no need to increase the number of the parameters in our model, which may increase the instability. The advanced method may be more useful in some other situations.
There also exist some needs to investigate dynamics of the sequence of µ, and to add some smooth penalty functions onto the equation in order to guarantee that µ t is always below the value of Q t .
It is worth mentioning that existing non-parametric techniques based on GAMs (generalized additive models) or MARS (multivariate adaptive regression splines) discussed in [66][67][68][69][70][71] are also widely used for fitting nonlinear time series data. In addition, a discussion about joint modeling of the scale and shape parameters of the Weibull distribution with GAMs methodology can be found in [68,71]. Certainly, some comparisons of our model with the existing non-parametric techniques based on GAMs or MARS can give the readers more information about the model suitability in applications. We will implement such comparisons in future projects. In our simplified model, we did not consider the interactions between the weather factors. However, in reality, interaction effects can exist. Sometimes, they also play an important part. There still remains more work to be done to study the interaction effects. We only added the weather factors onto the equation of scale parameter because they always play a key role. It would be interesting to add the weather factors onto the dynamic tail parameter equation. We shall explore this idea in a different project.  In this subsection, we would like to give our proof for Theorem 1 and our demonstration is built upon some conclusions in [72].
Proof. Without loss of generality, the parameter µ is set as 0, and our model (DCW) defined in (5)-(7) (up to the sign filippings of parameters which does not affect our conclusion) becomes: Under assumptions given in Theorem 1, without loss of generality, we further assume parameters in Equations (A1) and (A2) satisfy β 1 , γ 1 , β 2 , γ 2 > 0. We then have that which holds for any z 1 , z 2 that satisfies 0 < z 1 < β 2 and 0 < z 2 < γ 2 . Next, after denoting X t = (log σ t , log α t ), we then define: Hence, we obtain where {Y t } t≥0 is a sequence of i.i.d. unit Weibull random variables. Following the terminologies of [72], we obtain that T(·) has a compact attractor Λ=( , γ 0 −z 2 1+γ 1 ). In other words, for any x ∈ R 2 , we have T n (x) → Λ as n → ∞. Further, we set G 0 as the area G 0 = ( β 0 −β 2 which is an open area in R 2 . Then we are able to prove that the process {X t } meets five conditions given in Theorem 1 of [72]. Condition (a: Λ has a dense orbit) is proved, since for any x in R 2 , we have T n (x) → Λ as n → ∞ by our argument above. Conditions (c: Lipschitz continuous over G 0 ) and (e: E[S(X n , Y n )|X n = x] is uniformly bounded on G 0 ) are satisfied because the area G 0 is bounded and S(X n , Y n ) is continuous in Y n . Next, we will verify the condition (b: exponentially attracting) by leveraging our conclusion from the following Lemma A1. 2 Lemma A1. The area G 0 we have constructed above is an absorbing area for X t .
Proof. The proof of Lemma A1 is deferred to Appendix A.1.1. 2 For the remaining part, we need to check the condition (d), which is demonstrated by our Lemma A2 below.

Lemma A2.
For all x ∈ G 0 , 0 is in the support set of S(x, Y t−1 ) . Then for all x ∈ G 0 , there exists a positive constant r, s.t. the second step transition probability of X t , P 2 (x, dy) has an absolutely continuous component, of which the probability density function is positive over B(T 2 (x), r) with B(x, r) being the open ball in G 0 with center x and radius r.
Proof. The detailed proof of Lemma A2 is given in Appendix A.1.2.
2 Now we have proved those five conditions given in [72]. Therefore, the Markov chain {X t = (log σ t , log α t )} t≥0 defined in (A1) and (A2) possesses the characteristic of geometrically ergodic. Moreover, {X t } t≥0 defined in G 0 is irreducible; the Markov chain {X t } t≥0 is also stationary in G 0 . Thus, we claim our conclusion of Theorem 1.
Next, we prove the following Lemmas A1 and A2, which are two major ingredients in proving our Theorem 1.
Proof. It is easy to verify this conclusion, if we let log α t−1 < We have for any fixed X t , that there always exists a 0 Considering the value of z 1 and z 2 we have set above, we obtain that given X t , Y t is the only solution to S(X t , Y t ) = 0. Therefore, for any x ∈ G 0 , 0 is always in the support of |S(x, Y t−1 )|. Thus, we have completed the first part of Lemma A2.
By the inverse function theorem, one knows that there exists an open neighborhood at X t+1 = F X t−1 (Q , Q ) = T 2 (X t−1 ), denoted by B(T 2 (X t−1 ), r(X t−1 )), and another open neighborhood at (Q , Q ) s.t. there is a bijection between them. It is worth noting that the value of X t−1 does not affect the radius of that open neighborhood, since it is determined by the landscape around point (Q , Q ). Thus, for all X t−1 , we can write B(T 2 (X t−1 ), r(X t−1 )) as B(T 2 (X t−1 ), r) for some constant r.
Then given X t−1 ∈ G 0 , we prove that there exist one-to-one maps between every value in B(X t+1 , r) with values in some open neighborhood of (Q , Q ), so that it can also form a bijection with the neighborhood of (Y t−1 , Y t ). According to the formulation of the density function of Weibull distribution, we see that the function P 2 has a positive density over the neighborhood of (Y t−1 , Y t ) given X t−1 , and due to the existence of the bijection, P 2 (x, dy) has an absolutely continuous component whose probability density function is positive over B(T 2 (x), r). Then we complete our proof of the second part of Lemma A2.
2 Before proceeding our proof for Theorems 2 and 3 and Proposition 1, we would like to remind the readers of some assumptions and notation given in Section 3.2. We assume the true parameter θ 0 is the interior point in Θ, a compact subset of Θ s ; and the observations are generated from a stationary and ergodic DCW with the true parameter θ 0 . In addition, we denote (σ t (θ), α t (θ)) as the sequence which is generated from the true initial value (σ 0 1 , α 0 1 ) and (σ t (θ),α t (θ)) as the one generated from an arbitrary initial value (σ 1 ,α 1 ) and θ ∈ Θ. Moreover, due to the compactness of Θ, there exist uniform upper and lower bounds of the sequence (σ t , α t ) which are denoted as (σ U , α U ) and (σ L , α L ) respectively. Next, in the following part of our proof, we use notation L n (θ) to denote the conditional likelihood of θ and notationL(θ) defined in Equation (9). In the next Appendix A.2, we will first prove several technical lemmas which build blocks for proving Theorems 2 and 3 and Proposition 1.

Appendix A.2. Technical Lemmas
First, we will illustrate the identifiability of our model by the following Lemma A3.
s. for all t, then θ = θ 0 . Here a.s. is for the infinite product space generated by {. . . , Proof. The proof of Lemma A3 is the same as the proof of Lemma 3 in [51]. 2 In the following Lemma A4, we will discuss some characteristics of the score function as well as the Fisher information matrix, based on the true initial value (σ 0 1 , α 0 1 ), at the true parameter θ 0 .

Lemma A4. Under the conditions in Theorem 2, we get E
∂θ∂θ T l t (θ 0 )], in which M 0 is the Fisher information matrix at θ 0 . M 0 is also well defined and positive definite.
Proof. For the first part: E θ 0 [ ∂ ∂θ l t (θ 0 )] = 0, after interchanging the integration operator with differential operator we obtain Note that for any x ∈ (0, 1) and α L > 2 (we will assume this in the next lemma) there exists a c > 0 s.t. | log(x)| ≤ c x ; then it is easy to find a g(q t ) s.t. | ∂ f t (q t ,θ|σ t ,α t ) ∂θ | ≤ g(q t ) and g(q t )dq t < ∞ for all θ ∈ (θ 0 − , θ 0 + ) and some > 0. Then we get ∂ f t (q t ,θ 0 |σ t ,α t ) ∂θ dq t = ∂ ∂θ f t (q t , θ 0 |σ t , α t )dq t = 0 by dominate convergence theorem, which gives: E θ 0 [ ∂ ∂θ l t (θ 0 )] = 0. Next, after assuming those regular conditions (we can change the integration with the second derivative for the p.d.f) are satisfied by our model, we get: in which we have As the sequence ∂ 2 ∂θ∂θ T l t (θ 0 ) is strictly stationary when t tends to infinity, M 0 is independent with t and is also well defined (M 0 < ∞).
In order to prove that M 0 is positive definite, we observe that there does not exist a c ∈ R 9 s.t. c T ∂ ∂θ l t (θ 0 ) = 0 a.s. by following our Lemma A3. 2 According to our expression of the first and second order derivatives of L n (θ 0 ) given in Appendix A.6, the following Lemma A5 shows that their expectations exist.
Lemma A5. Under the assumptions given in Theorem 2, we have (a) for any α > 0, 1 Proof. Here we leverage the fact that the scale sequence {σ t } and the tail sequence {α t } enjoy the boundedness condition stated in Section 3.2. We obtain ). Then (a) is established by the pointwise ergodicity Theorem [73]. As for any positive integer k. Then (b) can also be proved by using pointwise ergodicity Theorem [73].
Next, we will prove (c). We know that follows the unit Fréchet distribution. After restricting 2 < α L ≤ α U , we obtain our conclusion by utilizing the fact that E[Y −r t ] exists when 0 < r < 1 . 2 The following Lemma A6 states that the convergence rate of min(Q t ) − µ 0 is larger than n −r , r > 1/α L . Lemma A6. Under the conditions given in Theorem 2, we have Q n,1 − µ 0 ≥ O p ((n) −1/α L ).
Proof. For the proof of (a), we obtain |S α α max{(Q n,k − µ n ) α−1 , (Q n,k − µ 0 ) α−1 }, from the remark given above. Thus, we conclude that (a) |S α n (µ n ) − S α n (µ 0 )| ≤ O p (τ n ) holds with probability going to 1. The Proof of (b) is similar with the corresponding part in Lemma 7 given in [51]. For (c), when m = 1, first, we separate the inequality into two parts, |S α Then for the first part, we get 1 As for the second part, we obtain 1 . This is the case when µ n ≥ µ 0 . For µ n < µ 0 , the process of our proof is similar so we just omit the details here.
for the first part, we obtain 1 For the second part, we next use some conclusions that we got in the process of proving the case of m = 1. So the second part is proved as follows: in which P j (x) denotes a polynomial of order j.
It is easy to see that the first part can be regarded as O p (τ n ). As for the second part, we have . Then the proof for Lemma A7 is completed after applying the Holder's inequality. 2 In the next Lemmas A8 and A9 we will prove that the supremum of the difference between the first n values of σ t and σ 0 t (so as α t and α 0 t ), which are generated by using arbitrary parameter θ in the neighborhood of the true value and the true one θ 0 respectively, converges at the rate of τ n . This convergence rate also holds for their partial derivatives. Lemma A8. Denote Φ = (γ 0 , γ 1 , γ 2 , γ 3 ) and Φ 0 = (γ 0 0 , γ 0 1 , γ 0 2 , γ 0 3 ), if Φ − Φ 0 < τ n and τ n 0. under the conditions in Theorem 2, we have: Proof. Here we briefly illustrate our proof of (a), the proofs of (b), (c) are almost the same. The domain of α t is bounded so the function exp(·) defined on a compact set is Lipschitz continuous. Then it is equivalent to prove: sup 1≤t≤n | log α t − log α 0 t | = O(τ n ). As we can express log α t as The rest parts are similar with the corresponding parts in Lemma 8 [51]. 2 Lemma A9. We denote Ψ = (β 0 , β 1 , β 2 , β 3 ) and Ψ 0 = (β 0 0 , β 0 1 , β 0 2 , β 0 3 ). Under the conditions in Theorem 2, if we have Ψ − Ψ 0 < τ n and τ n 0, we obtain Proof. The proof of this Lemma is almost the same of the proof of Lemma A8. 2 The next Lemma A10 will build blocks for the proof of Lemma A11.
Lemma A10. Suppose we have τ n ∼ n −r and sup 1≤t≤n |α t − α 0 t | = O(τ n ), where {α t } and {α 0 t } represent two different sequences of tail index that are generated basing on different parameters (Φ, Φ 0 and Φ − Φ 0 < τ n ) with the true initial value. Under the conditions in Theorem 2, we have uniformly over |µ n − µ 0 | < τ n . The same result also holds for Proof. We will only give our proof for the case of 1 , the proofs of other two cases are similar. Without loss of generality, we assume α 0 t > α t , then we obtain in which α * ∈ (α t , α 0 t ). 2 Together with our conclusions from Lemma A7-A10, we prove L n ( θ n ) − L n (θ 0 ) → p 0 when θ n − θ 0 = O p (τ n ) in the following Lemma A11.
Under the conditions in Theorem 2, for all second order derivatives of L n ( θ n ) we have ∂ 2

Proof.
Here we just give our proof for the case of ∂ ∂µ 2 L n ( θ n ); the proofs of remaining cases are similar. Note that the first and second order of the partial derivatives of L n (·) are measurable functions of the stationary and ergodic series {Q t } t≥0 , so they are also ergodic and strictly stationary. By the pointwise ergodicity Theorem [73], we have ∂ ∂θ i ∂θ j L n (θ 0 ) → p −m θ i θ j (θ 0 ). Thus, we next need to prove that ∂ ∂µ 2 L n ( θ n ) − ∂ ∂µ 2 L n (θ 0 ) → p 0 holds. By definition, we obtain If I is greater than zero, we have When µ 0 ≤ µ n , we get From Lemma A6, we have (nY n,1 ) 1/α L → p 1 and τ n ∼ n −r , 1/α L < r < 1/2, so it holds that Lemma A13. Under the conditions of Theorem 2, we have 1 n ∑ n t=1 |(Q t − µ n ) α t − (Q t − µ n )˜α t | → p 0, uniformly over |µ n − µ 0 | < τ n , where τ n ∼ n −r , r > 0. The same result holds for Proof. By mean value theorem, we have Thus, we claim our conclusion for Lemma A13. 2 Next, we would like to discuss Lemma A14 which can be utilized to prove Theorems 2 and 3.

Lemma A14.
Under the conditions in Theorem 2, we have (a) : for all second order derivatives ofL n ( θ n ), we Proof. For the proof of part (a), first, we see that ∂ 2 ∂θ i ∂θ jL n (θ) − ∂ 2 ∂θ i ∂θ j L n (θ) → p 0 holds for any θ by using our conclusions from Lemmas A12 and A13. In addition, we also know that ∂ 2 For the proof of part (b), here we just prove the case of ∂ ∂µL n (θ 0 ) and the proof of remaining parts are similar. For convenience, we set g(σ t , α t ) = α t σ α t t , then we are able to verify that |g(σ t , α t ) − g(σ t ,α t )| ≤ C · C t−1 b holds by utilizing Lemma A12 after assuming σ L is greater than zero. Next, we obtain The rest parts of our proof of Lemma A14 are similar with the corresponding proof (Lemma 14) in [51], so we omit the details.
2 We will leverage on the martingale difference to prove the following Lemma A15 which displays the asymptotic distribution of the score function. Proof. We use CLT for martingale difference [74], then we have: So for any λ ∈ R 9 , {λ ∂l t (θ 0 ) ∂θ , F t } t is a square-integrable stationary martingale difference. Note that the sequences σ t , α t and Q t are both stationary and ergodic, so the sequences ∂α t ∂Φ , ∂σ t ∂Ψ are also strictly stationary and ergodic. We also know that ∂l t (θ 0 ) ∂θ i (for i = 1, · · · ,9) are generated from α t , σ t , Q t , ∂α t ∂Φ , ∂σ t ∂Ψ so they also follow the properties of strict stationarity and ergodicity. Then by CLT and Wold-Cramer device in [74], we can finally get the conclusion that Lemma A15 holds. 2 Given these technical Lemmas, we next give our proof of our Theorems 2 and 3 and Proposition 1.
According to the Lemma 5 given in [64], we obtain that there is a local maximum over the open area t 2 + y 2 < 1 with probability going to 1. Thus, there exists a sequence of local maximizer θ n of L n (θ) s.t. θ n → p θ 0 and θ n − θ 0 ≤ τ n , where τ n ∼ n −r and 1/α L < r < 1/2. . The last inequality follows from the fact n 0, then with probability going to 1, we have {φ|φ ∈ Θ µ n } ⊆ {φ|φ ∈ Θ δ/2 }. Following similar proof procedures given in Lemmas A13 and A14, we are able to see thatL n (θ 0 ) = L n (θ 0 ) + o p (1) → p E θ 0 [l 1 (θ 0 )] holds. Then the rest proof of the first part follows from the proof of Proposition 2 in [48].
Utilizing our conclusion from (I), we obtain that the global maximizer ofL n (θ) over V n is located in Θ δ * c n . It is known from Theorem 2 that with probability going to 1, there exists a sequence of θ n of local maximizer ofL n (θ) s.t. θ n − θ 0 ≤ τ n , where τ n = O p (n −r ), and 1/α L < α < r < 1/2. So we have P( θ n ∈ Θ δ * c n ) → 1. Combining with our proof of (II) and using the conclusion of Theorem 2.6 in [75], we finally claim the conclusion of this proposition.
2 In the next subsection we will illustrate expressions of the first order as well as the second order derivatives of our likelihood function.
Appendix A.6. First and the Second Order Partial Derivatives of l t (θ) In this section we denote Φ = (γ 0 , γ 1 , γ 2 , γ 3 ) and similarly, we set Ψ = (β 0 , β 1 , β 2 , β 3 ) and the likelihood function as follows: The first order partial derivatives of l t (θ) are give by: The second order partial derivatives of l t (θ) are given by:

Appendix B. Algorithms Computation Details
The computational time for representative algorithms are listed in Table A1.  Table 4 Algorithms for Table 5 Computational All the algorithms are running on a server with Xeon 2.4GHz and 16GB of memory. Especially, algorithms for Tables 4 and 5 are running using parallel computing of 16 processes.