Precipitation Modeling for Extreme Weather Based on Sparse Hybrid Machine Learning and Markov Chain Random Field in a Multi-Scale Subspace

: This paper proposes to apply a Markov chain random ﬁeld conditioning method with a hybrid machine learning method to provide long-range precipitation predictions under increasingly extreme weather conditions. Existing precipitation models are limited in time-span, and long-range simulations cannot predict rainfall distribution for a speciﬁc year. This paper proposes a hybrid (ensemble) learning method to perform forecasting on a multi-scaled, conditioned functional time series over a sparse l 1 space. Therefore, on the basis of this method, a long-range prediction algorithm is developed for applications, such as agriculture or construction works. Our ﬁndings show that the conditioning method and multi-scale decomposition in the parse space l 1 are proved useful in resisting statistical variation due to increasingly extreme weather conditions. Because the predictions are year-speciﬁc, we verify our prediction accuracy for the year we are interested in, but not for other years. Author Contributions: Conceptualization, M.-H.L. and Y.J.C.; methodology, Y.J.C.; software, Y.J.C.; validation, M.-H.L. and Y.J.C.; formal analysis, M.-H.L. and Y.J.C.; investigation, M.-H.L. and Y.J.C.; resources, M.-H.L.; data curation, writing—original and Y.J.C.;


Introduction
Precipitation modeling is important for a wide range of applications. Knowing future water conditions is key to many fields. For example, in agriculture, high kinetic rainfall during the flowering stage will devastate subsequent crop yields. Predicting the precipitation distribution and foreseeing such events can facilitate cultivation planning to prevent yield loss. In this case, accurate rainfall profiling is more important than just estimating the recurrence period. However, most precipitation models fail for the occurrence of extremity in weather time series. Therefore, a reliable year-round prediction algorithm is required in the emerging extreme weather for applications that rely on precipitation prediction.
Statistical prediction is possible only when certain repetitive patterns or invariant properties are observed in specific subspaces. Therefore, the occurrence of precipitation may exhibit their similarity in certain subspaces but dissimilarity in other subspaces. Such versatile similarities in subspaces should be treated differently.
Two types of precipitation modeling are often envisioned: a history-dependent time series and a time-independent probability distribution. The time series model aims to reconstruct precipitation patterns for a period, while the probability model concentrates on the occurrence of precipitation events in a period. Classical Neyman-Scott clustering theory incorporates the two precipitation models into a stochastic point process [1,2]. Future rainfall patterns can be predicted through simulation based on the established mathematical structure and estimated parameters of the prediction model in a stochastic point process [3][4][5]. Stochastic time-series are frequently modeled using Markov processes in the Neyman-Scott model. Rainfall simulation is often built on variations of hidden Markov models [6][7][8]. Therefore, a research gap exists in the prediction method under extreme weather conditions. The process of modeling the precipitation under extreme rainfalls, time-independent probability distribution approaches are often used because the assumptions of stationary and time-invariant distribution between years for ordinary time-series analysis are difficult to be maintained. Precipitation analysis under extreme conditions often apply depthduration-frequency curves [9,10], extremograms [11], internal climate variability [12], and selected covariates [13] for the probabilities of rainy events. For agricultural and hydrological applications, inter-monthly variability needs to be handled in time series prediction. The synthetic weather generator can be assisted by a Markov chain for precipitation occurrence in the auto-regressive model for daily and monthly generators [14]. However, such methods still cannot make a specific prediction on a date basis. Without specific prediction on a particular date, a water resource manager is difficult to make a working schedule.
Our method of learning ensembles takes inputs from the vectorized yearly precipitation. When the value of a time series element is a vector, then the time series is a functional time series. A functional can be deemed a "function of a function"; that is, it is a mathematical structure for which an entire function is a value in a Hilbert space [15]. Therefore, predictions obtained from a functional time series perfectly fits our purpose of year-round prediction [16]. Prediction in functional time series possesses the advantage of exploiting the coherency within the subgroups of a time series. That is, the Markov behavior depends not only on the immediate successors but also on the repetitive patterns of the previous years. Although the applications of functional time series have been developed for a long time [17][18][19], the exploitation of coherency property is recent. Exploiting the subgroup covariance structure of the functional auto-regression, drought intervals can be suitably predicted [20].
As the escalated variations in the ratio of dry to wet days, the advantages of conventional functional time series in coherency become insufficient to accommodate extreme weather conditions. The yearly functional time-series can be decomposed into smooth seasonal components and non-smooth stochastic components, which are no-where differentiable manifolds or non-stationary time series. By introducing a self-similarity predictor in the functional time series, the prediction can overcome certain non-stationary time series [21,22] The non-stationary part of time series has been addressed mathematically. The nonsmooth manifolds in Lipschitz and Riemannian domains have been developed [23] and applied to algorithm levels. For example, Tsaig and Donoho [24] optimize wavelet coefficients in a multi-resolution non-smooth space. Ledyaev and Zhu [25] and Chen [26] develop a computation method to approximate a non-smooth manifold on a smooth manifold. With the mathematical foundation, low dimensional smooth manifolds can be projected to high dimensional non-smooth manifold [27].
In addressing the non-stationary part of the time series in extreme weather conditions, simulation in Markov chain random field (MCRF) and transiograms can be used to condition the manifold and improve the consistency of subsequent predictions. The method of MCRF has been used widely in geostatistics for modeling categorical/discrete fields in two or three spatiotemporal dimensions. This study treats the time-frequency domain as a spatiotemporal domain. That is, when a time series is transformed into a series of wavelet coefficients, the time series then has spatial neighbors in time and scale. The non-smooth manifold can be imputed in this time-frequency domain [24]. MCRF is an ideal tool to make the imputation. However, because of the property of Markov chain, the wavelet coefficients must be quantized to categorical values before application of MCRF, and the neighbors are grids of in the time-frequency coordinate. By such conversion, the transitional probabilities can be estimated empirically as a standard process of a continuous Markov chain Li and Zhang [28]. For increasing the predictability, the neighborhood can be further defined on an abstract mathematical structure, called multi-point statistics [29]. In estimating the transiograms, multi-points and auxiliary variables can significantly increase the kriging performance [30,31]. Similar to our Ising dipole structure, Ding et al. [32] develop a generic dipole lattice on a graph and significantly accelerate the learning speed of the random field.
At the same time, our two-stage method involves machine learning models for precipitation modeling. The ensemble method is a hybrid method with multiple learners on subgroups of training datasets. Ensemble learning with diversified datasets and learners is considered useful [33]. Hydrological applications have widely adopted this machine learning method [34][35][36]. The long-term prediction under the situation of climate changes [37][38][39] and rainfall downscaling under extreme weather conditions, as in Pham et al. [40], Diez-Sierra and del Jesus [41], has also been investigated. The neural network method in convolutional and long short-term memory is also popular [42,43]. However, due to the special characteristics of neural network, our MCRF cannot fuse with this method. A technique in ensemble learning is called bootstrap aggregating (bagging), which is effective for establishing diversity, but rare events still cause significant bias in the dataset [44].
To make prediction stable, we dynamically split the time series from all weather stations into disjoint bands. The ensemble learning method is used to forecast on a sparse multi-scale functional time series, for which the learning goal is toward an optimally weighted combination of goodness-of-fit (GOF) values for future precipitation patterns.
Our findings suggest that the proposed MCRF conditioning approach resists statistical variation under increasingly extreme weather conditions. Based on a conditioned decomposition in the sparse l 1 space, the machine learning model correctly predicts the precipitation distribution at the target year. The results reveal that our year-round prediction is accurate and specific. That is, differing from simulation methods, our prediction is generated only for the target year, not on any-other years. This year-specific accuracy continuously remains high with increasingly extreme weather conditions.

Two-Stage Prediction and Conditioning
Given a sequence of observations {y 1,1 , . . . , y 1,365 , . . . , y τ,1 , . . . , y τ,365 }, in the dual index set by (year, day), the observations can be expressed as the following functional time series: The functional Y(t) = {y 1 (t), . . . , y 365 (t)} at epoch t takes values on V, where t represents years, and V represents the vector values taken on a fixed period of 365 days for years {1, 2, . . . , T}. The functional Y(·) is considered an element in the l 2 Hilbert space over V [45,46]. Multiple methods are applied in the ensemble method to maximize the accuracy of prediction and validated on multiple repetition bands for each subspace. This study develops an ensemble of methods in a series of operations, as illustrated in Figure 1.

Prediction Models in the First Stage
The first stage prediction model f is established to model the functional Y(t) in (1). Therefore, the functional at time t + 1 is a function of Y at time t plus a fine-scale, uncorrelated modeling error (t): Model f in (2) is an l 2 −operator that needs to be estimated from the sequence of observation data {Y(t)|t = 1, . . . , T}. The model f can be express by a kernel function β and Y i , the i-th component in the functional Y, and Let where ψ k (t) and c k are the k-th basis function and coefficient vector in V, respectively. In addition, let β ij = T s=0 ψ i (s)b tij ψ j (s)ds. The basis function can be bind to a matrix Ψ(t) = (ψ 1 , . . . , ψ K ) and the component of kernel function can be bind to a matrix B = {b ij |i, j ∈ {1, . . . , 365}}. Therefore, we write (3) to where C = (c 1 , . . . , c 365 ) and Z = ( ..,365 . Estimation of these coefficients in B is straightforward in an equivalent Yule-Walker equation [47] or a Nadaraya-Watson kernel estimator [48]. The method of partial least squares can reduce computation time [49]. This study adopts the standard method of partial least squares.

Learning Ensemble in the Second Stage
Multiple methods are hybridized into an ensemble for each subspace band to ensure that the transformation results are stable. Learners are designed to take either the point day-valued or the functional year-valued precipitation time series from the first stage. To avoid over-fitting during the training phase, the combined regression at the second stage includes all the prediction results from the first stage.
Ensemble prediction modeling is also multi-scaled in the form of discrete wavelet bands. Therefore, individual learners in each band can adapt to the specific data characteristics. Each learner is trained to fit the model matrix Ψ i , based on the pairs of records (x i , y i ) in training and testing results of the first stage model (5). The challenge encountered by individual learners during estimation involves the near-singular moment matrix X X; this problem is often resolved through regularization by adding a positive perturbation constant or a Lagrange multiplier λ to the diagonals.
To avoid over-fitting, we further generalized the second stage optimization problem as follows (5) [50].
The optimization (5) is regulated by the Lasso (Least Absolute Shrinkage and Selection Operator) penalty (α = 1) or the ridge penalty (α = 0). The optimization in (5) takes advantage of the l 1 norm in evaluating solutions for an ill-posed problem.
Due to the properties of the l 1 space, an optimization (e.g., min x ||x|| 1 , s.t. (4)) will minimal non-zero solutions, and it will yield strong reconstruction performance if several sparsity properties, such as restricted isometry and incoherence properties, are satisfied [51][52][53]. In our multivariate transformation matrix Ψ i , the lasso penalty tends to reduce the coefficients of less important covariates to zero, thus generating more zero solutions and fitting our assumption about the ratio of wet to dry days.

Markov Chain Random Field Conditioning
In the problem of extreme weather conditions, the coherency among periods of the time series must be established before applying the learning algorithm. Conventional prediction methods rely on the assumption that weather conditions repeat themselves with a certain pattern. Due to the long-memory and fractal properties of precipitation, its occurrence may exhibit similarity in certain subspaces. However, because this repetitive pattern may vary even within the proximal region, the model must be carefully fine-tuned to ensure that the particular realizations of a stochastic process will correctly express the estimation. Increased variation hiders the prediction task under extreme weather conditions. Thus, this study proposes a special method to restore coherency and mitigate the influence of extreme precipitation conditions.
Most statistical approaches establish the model as sampling is performed. Because the process of modeling itself is also an unknown process, some invariant properties could be misguided by a particular pattern of sampling or a realization of precipitation. This sampling variation is mostly observed in high-frequency subspaces or in non-smooth manifolds. A yearly functional time-series can be decomposed into smooth seasonal components and non-smooth stochastic components, which are no-where differentiable manifolds. If the data in these manifolds can be "resample" in certain sense, the prediction consistency can be improved. That is, the validation accuracy will close to the training accuracy. This study utilizes Markov chain random fields (MCRFs) and transiograms to condition the sample values. The transition probabilities of the MCRF are trained based on the entire data history, and the distance lag is defined in the spatio-temporal space. Therefore, MCRF conditioning stabilizes the coherency property, based on the distribution of dry and wet days in the source years on a multi-scale time-frequency decomposition space.
The conditioning method employs the method of MCRF. An MCRF models the spatial dependencies between the vicinity nodes as a Markov chain and applies Bayesian updating to the neighbourhood of a lattice structure. The purpose of MCRF conditioning is to impute missing information over the realizations of the stochastic yearly precipitation. In the subsequent regression steps, the coherence between years may be destroyed by the realizations of random events in the yearly time series. Without conditioning, the predictor may have insufficient information to capture the coherency for the prediction. In this study, the Markov transition probabilities in the MCRF are trained to enhance the accuracy of regression predictors in the functional time series analysis. This conditioning method improves the effectiveness of ensemble learning and accommodates extreme weather conditions. By using MCRF conditioning, the distribution of dry and wet days in the source years is modeled on a multi-scale time-frequency decomposition space. Let {Z(u) : u ∈ D} be a finite random field on a fixed subset D of a d-dimensional space with state space S = (1, . . . , n), in which M nodes reside on the observable manifold and the remaining N unobserved nodes are to be estimated in the unobservable manifold, where |D| = K = M + N. The quantized states over the random field are visualized as shown in Figure 2. Let y i (t) = ∑ m j=1 α ij κ(x ij (t), x j (t)) + φ i (t) + ξ i (t), for i = 1, · · · , L, and let Λ = {j ∈ S : |j| ≤ K} be a symmetric finite hypercube on S. Then, a probability measure can be defined on the set of Borel σ-fields as B(Ω Λ ), where Ω Λ is the configuration space of all sequences ω = {ω j } j∈Λ ; that is, Ω Λ = {1, . . . , n} Λ .  For each ω ∈ Ω Λ , the product measure in S is given by µ Λ {ω} = 2 −|Λ| , where |Λ| = 2N + 1. The Hamiltonian in ω ∈ Ω Λ is defined as H Λ,h (ω) = − 1 2 ∑ i,j∈Λ Φ({i, j})ω i ω j − h ∑ j∈Λ ω j , where Φ is a non-negative potential function on S, and h is a real number that describes gives the strength of an external influence acting at each site in Λ. A control parameter is defined as β = 1/T > 0. The probability measure or finite-volume Gibbs state µ Λ,β,h on B(Ω Λ ) is defined as where Z(Λ, β, h) is the partition function defined as follows: From the invariance requirement, it follows that the conditional probabilities can be determined by K + 2 parameters. Therefore, the transition matrix can be identified using a conventional machine learning method based on the maximal likelihood algorithm. The random field D is designed as a space of multi-scale time-frequency representation, which is obtained from a continuous wavelet transformation (CWT) X(a, t) = 1 √ a ξ τ−t a x(τ)dτ, with mother wavelets ξ(·, ·) performing dilation in scaling a and shifting t.
The wavelet time-frequency representation reflects the local properties of the time series. For example, the CWT of yearly precipitation clearly exhibits self-similarity in the subspace (Figure 3). The vertical axis in Figure 3 represents the period (frequency) of the CWT in a log scale, starting from the 1-year period at the top to the 1-day period at the bottom.   Figure 2a indicates that the transitional probabilities or transiograms are learned based on wavelet coefficients, which are quantized into discrete intervals and served as the state spaces of each Markov chain of all nodes. We take 60 levels for the quantization. In the control of the balance between a stochastic realization and the MCRF resampling, an imputation mask is provisioned on the ridge curves and the areas of high frequency components, as shown in Figure 2b,c. Due to the semi-periodic nature of storm activities, the ridge amplitudes after imputation are stronger for components in the vicinity of storm seasons, as shown in Figure 2d. Wavelet ridge analysis is used to compute ridge points over the time-frequency representation of a time series x(t). Then, an amplitude ridge point and phase ridge point form a time-scale pair (a, t), with a strict local maximum of {ln |W φ (a, t)|} on the a axis and ∂ ∂t ln |W φ (a, t)| = ω φ /a the t axis. The sets of ridge points collectively form wavelet ridge curves. Notably, Figure 2 shows only the amplitude of the complex numbers for ease of explanation, and the modulus of the complex wavelet coefficients is not shown in the graph.
Power spectra are compared in Figure 4, and subgraph (a) shows the original power spectrum in a particular year. Figure 4b shows the imputed result after adding coherence components in the spectrum according to the simulation of the trained transiograms and the sampling masks obtained through ridge analysis. Figure 4c demonstrates and compares the rainfall distribution after MCRF conditioning. The conditioned distribution differs from the original one because the constant trend has been suppressed, and repetitive information from other years has been imputed. Therefore, as shown in the right-hand-side of the Equation (2), Y(t) has additionally inserted rich information for parameter estimation.

Data Provisioning
The data source for this study was collected from the Central Weather Bureau (CWB) of Taiwan. From 1994 to 2020, structured data have been recorded at 54 stations across the island (area of 36,193 km 2 ). Because a large portion of the territory has high mountains, data collection is difficult in remote areas. Therefore, the recorded data may not be continuous or consistent because stations and equipment may fail or switch in each location. Data cleaning and screening must be performed to provision and populate the data cubes.
Original precipitation data records hourly. The values are aggregated into a daily format. Data from stations with a sufficient number of years of successive data and similar climate conditions are retained in this study. Finally, 26 years of precipitation data from 12 stations are selected for model training. As required for machine learning processes, the data are divided into training and validation sets. Validation is performed based on a hold-out year. Notably, in the validation, high GOF of prediction in the target year and low GOF in non-target years is expected. The functional bandwidth is naturally selected to be 365 days, and each functional is split into three discrete wavelet transform bands for the application of suitable learners for each band.

Results and Discussion
The proposed algorithms are tested using CWB data covering 26 years and 12 stations. In the quantification of the prediction stability for various levels of weather extremity, experiments are designed to verify whether the prediction accuracy drops significantly as the weather extremity is controlled to be increased. For providing a compatible level of comparing experiment, the prediction algorithms are performed for three levels of weather extremity, which are established based on controlled logarithmic and exponential transformations of the raw data. As shown in Figure 5, the original precipitation data are labeled original, and the smooth and extreme precipitation data are logarithmic and exponential transformed, respectively, while maintaining the same level of yearly mean.  Applying MCRF conditioning, as described in (6), significantly improved the prediction accuracy. As shown in Figure 6, MCRF conditioning improved prediction accuracy for not only the training years, but also the validation years. In the ensemble method in the second stage of precipitation prediction modeling, multiple learners are used to combine the intermediate predictions from the first stage, including ordinary generalized matrix inverse, ridge, lasso, and elastic net multivariate regression. For comparison, a deep convolutional neural networks (VGG19 [54]) is also included in the learner ensemble. The experiments demonstrated that the combination of lasso and ridge estimation effectively reconstructed the precipitation time series in the training and validation phases. Kling-Gupta efficiency (KGE) values are applied to assess the GOF for the developed hydrological models [55,56].
In Equation (9), the statistics between simulated and observed time series are compared and exhibit a tendency of similarity. A positive KGE value is firmly considered acceptable. To meet the goal of year-specific prediction, the notion of KGE is extended to assess GOF over all testing years. We define a difference KGED i = KGE i − max{KGE j |∀j = i} for a metric in the target year i. This quantity reflects the fact that an accurately targeted prediction must be of high GOF at KGE i but of low GOF for all other KGE j . Therefore, a high KGED i indicates a good prediction for the year i.
For a water resource manager or a farmer, without our analysis, she only can roughly guess a wide range of precipitation for a large period of time by averaging past several years. For example, in Figure 7, we illustrated March's daily precipitation from 1996 to 2019. If a water resource manager lives in the beginning of year 2018 and wants to plan for March, she can only ponder the decisions from past experiences without a suitable tool. Looking up the past years, we found that most years had high precipitation at the beginning of March. However, it is challenging to know that when the peak rainfall arrives. Without a tool, it is not easy to make a plan. The manager needs a specific day for planting the crops because a large amount of human resources and machinery need to be booked in advance. Our prediction offers the manager a handy tool to schedule her jobs through a clear prediction of raining amount at each specific date in March.
The last row in Figure 7 showed our predictions. The 2016 and 2017 belonged to the training set, and we list them here only for reference. The prediction performance should look at the accuracy of new data from 2018 and 2019. The data of 2018 and 2019 were pre-excluded from the training set for validation. Therefore, the data of 2018 and 2019 were entirely new to the predictor.
After all testing iterations, the accuracies converged to a stable result for the validation years. Based on the prediction results in Table 1, MCRF conditioning significantly improved the prediction accuracy (KGE) and specificity (KGED) in the proposed two-stage prediction ensemble learning model. The improvement is particularly prominent in the case of extreme weather conditions. Figure 8 illustrated a prediction result for the proposed ensemble approach. Based on the GOF articulated in Table 1, three predictions are demonstrated in the yearly functional form. The trained model can predict the precipitation of the entire next year. Even in the case of extreme weather conditions, the prediction still maintained robustness without losing statistical details.
Our algorithm is superior to other algorithms because we avoid over-fitting, and the prediction is made specifically for a target year. For example, despite small residuals for the unseen 2018 data, this prediction was still specific for 2018, not for any other years. Our goal of prediction satisfies two criteria: statistical error should be small for the unseen years, and should not close to other years (the error difference between target and non-target years should be large). For one more future year, 2019, consistent with our expectations, the prediction became inaccurate for a longer future. Therefore, the prediction in 2019 is less trustful than expected.  Note also that our algorithm remains consistent even when weather conditions become increasingly extreme. Therefore, in Figure 8c, when weather conditions become extreme, the manager still can rely on the tool to schedule the work without being interfering by weather extremity.    Table 1, the graphs demonstrate that the prediction is accurate for 2019 (the third is similar to the second) but not for 2020 (the third is not similar to the fourth). Even the weather condition becomes extreme, the prediction and KGE in (c) still remains accurate for 2019.

Conclusions
This study proposes conditioning and ensemble learning methods to perform forecasting on multi-scaled functional time series over a sparse l 1 space. In contrast to other long-range simulations that can yield only a generic rainfall distribution, the proposed long-range prediction method has high yearly-specificity; that is, the performance metric in the present study includes the difference of the GOF of the target year to the GOFs of other years.
The present finding suggests that the MCRF conditioning approach resisted statistical variation due to increasingly extreme weather conditions. The machine learning model accurately predicts the precipitation distribution for the conditioned decomposition in the sparse l 1 space. In the results, year-specific accuracy remained high despite increasingly extreme weather conditions. This long-range prediction algorithm may be helpful in certain applications, such as maximizing water usage and increasing yields in agriculture.
Farmers can foresee the precipitation on the planned planting dates. Our tool is even more helpful when the weather conditions become extreme. The difference between with and without the tool reflects on the stability of prediction. The past experiences easily fail at extreme weather. However, we demonstrated that our prediction remains stable when weather conditions become extreme. Therefore, farmers can rely on our method to make their production schedule without being interfering by extreme weather conditions. Because this study uses conditioning method in conventional kernel model to address the prediction problem of functional time series, the modeling suitability is limited by the Gaussian assumption in the kernel model. We will relax the Gaussian assumption in the future.