A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting

: Accurate forecasting of future meteorological elements is critical and has profoundly affected human life in many aspects from rainstorm warning to ﬂight safety. The conventional numerical weather prediction (NWP) sometimes leads to unsatisfactory performance due to inappropriate initial state settings. In this paper, a short-term weather forecasting model based on wavelet packet denoising and Catboost is proposed, which takes advantage of the fusion information combining the historical observation data with the prior knowledge from NWP. The feature selection and spatiotemporal feather addition are also explored to further improve performance. The proposed method is evaluated on the datasets provided by Beijing weather stations. Experimental results demonstrate that compared with many deep-learning or machine-learning methods such as LSTM, Seq2Seq, and random forest, the proposed Catboost model incorporated with wavelet packet denoising can achieve shorter convergence time and higher prediction accuracy.


Introduction
Weather prediction is of great importance and can affect some aspects of daily life, such as air quality, travel plans, energy supply, and so on [1][2][3][4]. A conventional prediction method is numerical weather prediction (NWP) method, which solves the numerical solutions of atmospheric hydro-thermo dynamic equations to predict meteorological dynamics [5][6][7]. However, unsatisfactory prediction results are obtained if inappropriate initial states are set [7,8]. Moreover, conventional NWP-based approaches do not take full advantage of vast amount of existing historical observation data [9]. As the observation techniques develop and the historical meteorological data grow bigger and bigger, some purely data-driven approaches are expected to be introduced into weather forecasting [10]. They do not solve complex differential equations, but quickly model the meteorological dynamics by learning from a large amount of historical datasets.
It is well known that several machine-learning methods including artificial intelligence (AI) methods have been demonstrated to be powerful methods for weather forecasting [9][10][11][12][13]. For one-dimensional timeseries meteorological data, some data-driven machine-learning approaches have been presented for weather forecasting. In [14], autoregressive integrated moving average artificial neural networks (ARIMA-ANN) combined with ARIMA-Kalman is proposed to predict the wind speed and show the effectiveness of the hybrid model. In [15], global radiation is effectively predicted by a hybrid ARIMA/ANN model. In [16],

Problem Statement
In this work, historical meteorological observations from 10 weather stations for more than three years and RMAPS-based preliminary weather forecast data from NWP [22] are supplied. They can be defined as follows.
Historical meteorological observation datasets denoted as where the variable o i (t) is one of N 1 meteorological features, for t = 1, . . . T o . Here, N 1 = 9. Preliminary weather forecast timeseries obtained by NWP as Atmosphere 2021, 12,1618 3 of 17 where the variable f i (t) is one of N 2 NWP features, for t = T o + 1, . . . T o + T f , T f is forecasting step number. In this work, N 2 = 29. Additionally, a forecast time period is required from 3:00 to 15:00 (UTC) of the next day, then T f = 37. Ground truth of target meteorological variables Y (t) and their predictions Y (t) where the variable y i (t) is truth value of N 3 meteorological variables, for t = T o + 1, . . . T o + T f . In this work, relative humidity at 2 m (rh2m), wind at 10 m (w10m), and temperature at 2 m (t2m) are target forecasting variables, and therefore, N 3 = 3.

Data Processing 2.2.1. Missing Values
In this work, two kinds of missing values, including local missing (local non-continuous) and block missing, are included. For local missing, linear interpolation is employed. As for block missing, the data of those days are deleted directly, and therefore, 40-day data are deleted from a total of 1188 day dataset. Figure 1 shows the historical statistic means of meteorological variation t2m and w10m. From this figure, compared with other weather stations, t2m means in station ID 7 has obvious differences and w10m means in station ID 7 and 6 follow different trends. Thus, station ID should be added as a category variable. In this work, the one-hot encoding method is employed to depict the spatial feature. Preliminary weather forecast timeseries obtained by NWP as

Additional Spatiotemporal Feathers
where the variable (t) is one of NWP features, for = + 1, ⋯ + , is forecasting step number. In this work, = 29. Additionally, a forecast time period is required from 3:00 to 15:00 (UTC) of the next day, then = 37.
Ground truth of target meteorological variables ( ) and their predictions ( ) where the variable ( ) is truth value of meteorological variables, for = + 1, ⋯ + . In this work, relative humidity at 2 m (rh2m), wind at 10 m (w10m), and temperature at 2 m (t2m) are target forecasting variables, and therefore, = 3.

Missing Values
In this work, two kinds of missing values, including local missing (local non-continuous) and block missing, are included. For local missing, linear interpolation is employed. As for block missing, the data of those days are deleted directly, and therefore, 40-day data are deleted from a total of 1188 day dataset. Figure 1 shows the historical statistic means of meteorological variation t2m and w10m. From this figure, compared with other weather stations, t2m means in station ID 7 has obvious differences and w10m means in station ID 7 and 6 follow different trends. Thus, station ID should be added as a category variable. In this work, the one-hot encoding method is employed to depict the spatial feature. As for the temporal features, the historical variation of target variable t2m is presented in Figure 2. From this figure, strong periodicity and seasonality can be observed, and the temporal features should be considered. If the time features such as month and hour are also set by the one-hot encoding method, a large feature space will be occupied. Moreover, the temporal continuity such as between December and January will be destroyed if the month feature (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12) is directly mapped into the interval of 0-1. In this work, As for the temporal features, the historical variation of target variable t2m is presented in Figure 2. From this figure, strong periodicity and seasonality can be observed, and the temporal features should be considered. If the time features such as month and hour are also set by the one-hot encoding method, a large feature space will be occupied. Moreover, the temporal continuity such as between December and January will be destroyed if the month feature (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12) is directly mapped into the interval of 0-1. In this work, clock projection is utilized to extract the temporal features. Specifically speaking, the month feature and the hour feature are transformed to the order of 1-12 on the clock, which can be seen in Figure 3. The projected horizontal and vertical coordinate values (within 0-1) will clock projection is utilized to extract the temporal features. Specifically speaking, the month feature and the hour feature are transformed to the order of 1-12 on the clock, which can be seen in Figure 3. The projected horizontal and vertical coordinate values (within 0-1) will be deemed as new temporal features. Test results demonstrate that the prediction accuracy can be significantly improved by the additional spatiotemporal features.

Feature Selection
In this work, 38 meteorological features are provided for weather forecast in the historical observed dataset and NWP forecasting dataset. Considering the prediction accuracy and the training computational consumption, an ensemble selection method integrating three methods are employed for feature selection, including RFE, correlation matrix, and tree model. In this work, SVM-based recursive feature elimination (RFE) is adopted [37]. The correlation matrix is also called the correlation coefficient matrix, and the map is  clock projection is utilized to extract the temporal features. Specifically speaking, the month feature and the hour feature are transformed to the order of 1-12 on the clock, which can be seen in Figure 3. The projected horizontal and vertical coordinate values (within 0-1) will be deemed as new temporal features. Test results demonstrate that the prediction accuracy can be significantly improved by the additional spatiotemporal features.

Feature Selection
In this work, 38 meteorological features are provided for weather forecast in the historical observed dataset and NWP forecasting dataset. Considering the prediction accuracy and the training computational consumption, an ensemble selection method integrating three methods are employed for feature selection, including RFE, correlation matrix, and tree model. In this work, SVM-based recursive feature elimination (RFE) is adopted [37]. The correlation matrix is also called the correlation coefficient matrix, and the map is

Feature Selection
In this work, 38 meteorological features are provided for weather forecast in the historical observed dataset and NWP forecasting dataset. Considering the prediction accuracy and the training computational consumption, an ensemble selection method integrating three methods are employed for feature selection, including RFE, correlation matrix, and tree model. In this work, SVM-based recursive feature elimination (RFE) is adopted [37]. The correlation matrix is also called the correlation coefficient matrix, and the map is presented in Figure 4. Moreover, the Catboost model is also used for the ranking of feature importance. By the ensemble selection model, the less important features are screened out. presented in Figure 4. Moreover, the Catboost model is also used for the ranking of feature importance. By the ensemble selection model, the less important features are screened out.

Model Architecture
The proposed short-term weather forecasting model is based on wavelet packet denoising and Catboost. The block diagram of the proposed forecasting model is presented in Figure 5.

Model Architecture
The proposed short-term weather forecasting model is based on wavelet packet denoising and Catboost. The block diagram of the proposed forecasting model is presented in Figure 5. presented in Figure 4. Moreover, the Catboost model is also used for the ranking of feature importance. By the ensemble selection model, the less important features are screened out.

Model Architecture
The proposed short-term weather forecasting model is based on wavelet packet denoising and Catboost. The block diagram of the proposed forecasting model is presented in Figure 5.  with above-mentioned data preprocessing, including data cleaning, spatiotemporal features addition, and feature selection. Then, the wavelet packet denoising method is proposed and achieved as a "pre-learning task", which can improve the accuracy and shorten the convergence time. At last, the Catboost model is utilized for training and weather forecast.

Wavelet Packet Denoising Principle
Signal denoising, especially for sensor noises in historical meteorological observation datasets, is of great importance. This "pre-learning task" will be verified to effectively shorten the subsequent training convergence time and enhance forecast accuracy. For noise reduction in non-stationary signals in meteorological datasets, it is reasonably effective to remove the noises by the wavelet-based denoising methods due to their multi-resolution and multi-scale analysis property [38]. Wavelet transforms have been successfully used in many scientific fields such as image compression, image denoising, and signal processing, to name only a few. The window size in wavelet transform is fixed, while its shape is adjustable. Both the time and frequency window can be adjusted. Hence, it has a strong ability to extract local features of signals and detect singularity in signals [39]. The denoised signals can be obtained by employing the threshold function to filter the wavelet coefficients and conduct the wavelet reconstruction [40].
In wavelet-based denoising methods, the wavelet basis function and the threshold function have great influence on the sparsity of wavelet representation coefficients. In this work, Daubechies (dbN) basis function is adopted due to its fine regularity. It means that it is difficult to perceive the smooth error introduced by this wavelet as a sparse basis, and then, the denoised signal can be much smoother [40]. Moreover, the soft-threshold method is employed here. It has good continuity and makes a more smooth process to the wavelet coefficients.
For a signal f (t) ∈ L 2 (R), the wavelet decomposition coefficients can be obtained as where ψ a,τ (t) is the mother wavelet; the parameter a indicates the scale index, and τ represents the time shifting. In order to extract local features of signals by more fine-grained details and enhance the time-frequency resolution, the wavelet packet denoising method is utilized in this paper. The recursive formula of wavelet packet transform is as follows.
The signal W 1 (t) will go across the orthogonal filter combining the high-pass filter h k and low-pass filter g k [39] and wavelet packet decomposition can divide the signal into different frequency spans layer-by-layer. The width of the frequency span ∆ f can be obtained as where f s is the sampling frequency; i is the decomposition layer. The expected frequency span can be obtained when layer i is set appropriately. The original signal can be separated from noise and interference signals when every frequency is wide enough. The recursive reconstruct formula is It is beneficial that we can choose all the frequency bins or some parts of them and set the others (noise or random interruption) as zero. When the signal is decomposed into different frequency bins, it is easy to extract noise by the reconstruction. Figure 6 illustrates the three-layer wavelet packet decomposition.
It is beneficial that we can choose all the frequency bins or some parts of them and set the others (noise or random interruption) as zero. When the signal is decomposed into different frequency bins, it is easy to extract noise by the reconstruction. Figure 6 illustrates the three-layer wavelet packet decomposition. In addition, for the choice of the wavelet packet threshold, the three-layer wavelet packet and the soft-threshold is utilized. The threshold ℎ is designed to be self-adaptive with the decomposition layers and takes advantage of the mean and standard variance of decomposition coefficients, that is where ℎ ( ) is the wavelet packet threshold, combining with the mean and variance.
, ( ), = 1, 2, … , 8 is the wavelet packet transform coefficients, and is the coefficient In addition, for the choice of the wavelet packet threshold, the three-layer wavelet packet and the soft-threshold is utilized. The threshold Th wp is designed to be self-adaptive with the decomposition layers and takes advantage of the mean and standard variance of decomposition coefficients, that is where Th wp (p) is the wavelet packet threshold, combining with the mean and variance. C 3,p (j), p = 1, 2, . . . , 8 is the wavelet packet transform coefficients, and M is the coefficient length. When coefficients C 3,p (j) are greater than the threshold Th wp (p), set C 3,p (j) as zero; otherwise, set C 3,p (j) as sign C 3,p (j) * C 3,p (j) − Th wp (p) , which makes the C 3,p (j). The historical meteorological observation signals are denoised by wavelet packet transform. Figure 7 shows the original observed and denoised curves of 2 m relative humidity.

Learning Model: Categorical Boosting
Gradient boosting is an effective and powerful machine-learning technology for solving problems with complex dependencies, noisy data, and heterogeneous characteristics. It has a theoretical explanation on how iteration combines weak models through gradient descent in function space [32] and has demonstrated most advanced performance in a variety of practical tasks, such as ET0 estimation [30], global solar radiation prediction [33], and web searching [35]. Catboost, proposed by Yandex Company, is a novel gradient boosting algorithm [32], which makes many improvements to overcome the model overfitting and deal with parallelism. Thus, the layout can be completed in less time. Generally, high prediction accuracy is the key point; however, good stability and less computational workload should also be laid emphasis when employing machine-/deep-learning models. Some models are inherently unstable and will obtain fewer precision estimates when with new datasets [30].
Catboost can deal with categorical features well and is employed here as a learning model for short-term weather forecast. It has the following advantages [32]: Category features: In order to reduce overfitting and utilize the whole dataset for training, an efficient strategy called target statistics (TS) with minimum information loss is employed in the Catboost. Specially, for the input example sets = {( , )} ,⋯, , a plurality of random permutation is performed. Then, the average label values will be calculated for the sequence with the same category value. Finally, all classification features will be substituted with the following formula:

Learning Model: Categorical Boosting
Gradient boosting is an effective and powerful machine-learning technology for solving problems with complex dependencies, noisy data, and heterogeneous characteristics. It has a theoretical explanation on how iteration combines weak models through gradient descent in function space [32] and has demonstrated most advanced performance in a variety of practical tasks, such as ET0 estimation [30], global solar radiation prediction [33], and web searching [35]. Catboost, proposed by Yandex Company, is a novel gradient boosting algorithm [32], which makes many improvements to overcome the model overfitting and deal with parallelism. Thus, the layout can be completed in less time. Generally, high prediction accuracy is the key point; however, good stability and less computational workload should also be laid emphasis when employing machine-/deep-learning models. Some models are inherently unstable and will obtain fewer precision estimates when with new datasets [30].
Catboost can deal with categorical features well and is employed here as a learning model for short-term weather forecast. It has the following advantages [32]: Category features: In order to reduce overfitting and utilize the whole dataset for training, an efficient strategy called target statistics (TS) with minimum information loss is employed in the Catboost. Specially, for the input example sets D = {(x k , y k )} k=1,...,n , a plurality of random permutation is performed. Then, the average label values will be calculated for the sequence with the same category value. Finally, all classification features will be substituted with the following formula: where the parameter β > 0, namely, the weight of the prior, can dampen the low frequency category noise. P is a prior value. y k ∈ R is the target, and x k = x 1 k , . . . , x M k is a random vector of M features. Feature combinations: A greedy way is utilized for Catboost when the tree constructs a new split. No combination is considered for the first split. However, for the subsequent splits, Catboost contains all the combination and classification features in the current tree of the dataset with all categorical features. Moreover, all splits selected in the tree are treated as categories with two values and are similarly utilized in the combination.
Unbiased boosting: In Catboost, the ordered boosting is developed by theoretical analysis to solve the gradient bias, which is inevitable when the traditional GBDT employs the TS method to convert categorical features into numerical values. Moreover, multiple permutations of the training data are employed to enhance the robustness. Different permutations will be utilized for training distinct modals, which can deal with the overfitting problem.
Fast scorer: Oblivious trees, which are balanced and less inclined to overfitting, are used as base predictors in Catboost. Moreover, in order to calculate the model predictions, each leaf index in the Catboost model evaluators is encoded as a binary vector, whose length is equal to the depth of the tree. Figure 8 shows the structure of the Catboost algorithm.
where the parameter > 0, namely, the weight of the prior, can dampen the low frequency category noise. is a prior value. ∈ ℝ is the target, and = ( , ⋯ , ) is a random vector of features. Feature combinations: A greedy way is utilized for Catboost when the tree constructs a new split. No combination is considered for the first split. However, for the subsequent splits, Catboost contains all the combination and classification features in the current tree of the dataset with all categorical features. Moreover, all splits selected in the tree are treated as categories with two values and are similarly utilized in the combination.
Unbiased boosting: In Catboost, the ordered boosting is developed by theoretical analysis to solve the gradient bias, which is inevitable when the traditional GBDT employs the TS method to convert categorical features into numerical values. Moreover, multiple permutations of the training data are employed to enhance the robustness. Different permutations will be utilized for training distinct modals, which can deal with the overfitting problem.
Fast scorer: Oblivious trees, which are balanced and less inclined to overfitting, are used as base predictors in Catboost. Moreover, in order to calculate the model predictions, each leaf index in the Catboost model evaluators is encoded as a binary vector, whose length is equal to the depth of the tree. Figure 8 shows the structure of the Catboost algorithm.

Performance Analysis and Comparisons
In this section, the performance analysis and comparisons of the proposed weather forecasting model with other machine-learning and deep-learning methods are illustrated. Firstly, the evaluation metrics are shown.

Statistical Evaluation
The root mean squared error (RMSE) for three objective variables from 10 stations is calculated as daily evaluation.
where y i,s (t) and y ml i,s (t) are the ground truth and the forecasting value (by the proposed machine-learning method) of the objective variable i (here, i denotes t2m, rh2m, and w10m) of station s at time t, respectively. Similarly, RMSE i,NWP can be obtained, which uses the predicted value F(t) by NWP method.
Then, the associated skill score S i is employed to compare the forecasting improvement with the classic NWP method.
S i > 0 means that the proposed machine-learning method can obtain lower RMSE and better prediction accuracy than the NWP method. The higher S i is, the better the forecasting performance by the proposed method is.
S day is the average skill score of the three objective variables and is the ultimate prediction criterion for each day.

Baselines and Experimental Settings
In this work, the machine-learning method random forest and two deep-learning methods LSTM and Seq2Seq are implemented as baselines for comparisons. Seq2Seq method is well known in the field of natural language processing and is effective to solve timeseries prediction problems.
The tests were performed on a GPU server with GTX 1080Ti GPU, 11GB of video memory, and a Pytorch programming environment. As mentioned above, 38 meteorological features are provided from observed and NWP datasets. The three least correlated characteristics are removed, and the spatiotemporal feathers are added in the phase of data preprocessing. The set hyperparameter T o = 28 means that the previous 28 h observation data are selected to modal the recent meteorological dynamics. The data from 1148 days (during 1 March 2015 to 1 June 2018) are used for the training set, and data from 87 days (during 3 June 2018 to 29 August 2018) are for the validation set. Since the evaluation is based on online daily forecasting, the test day index is 1. The main parameters in the random forest method and Catboost method are the maximum depth and the number of trees, which are optimized by the grid search method.
The pseudocode are added to describe the whole methodology, as shown in Algorithm 1.

Performance Analysis
This work is based on an attended online competition dataset for daily weather forecasting. The forecasting daily evaluation score S day and average score S avg for continuous five competition days are presented in Table 1, which are based on incremental data released daily according to real-world forecasting processes. In this table, LSTM, Seq2Seq, and random forest ("RF" for brevity) are implemented for comparison. "ST" means that spatiotemporal feathers are added. "FS" denotes that feature selection is employed. "WPD" indicates that the wavelet packet denoising method is combined. First, comparing "LSTM + FS" with "LSTM + FS + ST", it can be clearly observed that the prediction accuracy by adding spatiotemporal features has been remarkably improved. Similarly, it can be obtained that the feature selection is also effective by comparing "Catboost + ST + FS" with "Catboost + ST". The less important features are screened out by the ensemble selection model, which can reduce redundant information. In order to verify the importance of wavelet packet denoising, the results of "RF + FS + ST" and "RF + FS + ST + WPD" as well as "Catboost + ST + FS" and "Catboost + ST + FS +WPD" could be taken into consideration and comparisons. Obviously, wavelet packet denoising can greatly promote the prediction scores. Wavelet packet denoising also takes effect as the "pre-learning" process and can effectively shorten the entire learning convergence time. For example, after wavelet packet denoising the data, Catboost only needs an approximately 40 epoch convergence time, which is rather short in contrast with the original one, about 100 epoch.
Second, considering the effect of information fusion, the effectiveness of fusing NWP forecasting can be validated by comparing "Catboost + ST + FS + WPD" with "Catboost + ST + FS + WPD + noNWP", where the NWP forecasting is masked by zero values. Similarly, comparing "Catboost + ST + FS + WPD" with "Catboost + ST + FS + WPD + noOBS" demonstrates the advantage of modeling recent meteorological dynamics. Moreover, the S day > 0 in the "Catboost + ST + FS + WPD + noOBS" indicates that the machinelearning method can enhance the NWP alone performance. Meanwhile, the S day > 0 in the "Catboost + ST + FS + WPD + noNWP" without NWP information illustrates that the proposed method has superiority in modeling meteorological data. Catboost successfully handles categorical features and uses a new schema for calculating leaf values when selecting the tree structure, which helps to reduce overfitting. These comparisons clearly exhibit that conducting information fusion is better, and modeling alone with OBS or NWP is not good enough. NWP forecasting contains important prior knowledge.
Moreover, the raw RMSE values for temperature, relative humidity, and wind speed with all the methods are presented in Tables 2-4, respectively. It is also clear that the proposed "Catboost + ST + FS +WPD" method can also achieve the best RMSE values.  Table 5. S day01−20 is the average S value of the first day to the twentieth day in the test set. S avg is the average S value of the first day to the one hundred and twentieth day in the test set. From this table, the proposed "Catboost + ST + FS +WPD" method can also achieve the best average S values for the new cross-validation dataset. For demonstrating the effect of varying important hyper-parameters, the proposed method with different "iterations" (number of trees) is tested, and the results are shown in Table 6. "Iterations" is an important hyper-parameter in the Catboost method. Catboost 500 means the proposed method with "iterations" = 500 is employed. It is verified that hyperparameters are also important for prediction performance. In addition, apart from the accuracy, it is also vital for model construction to minimize the complexity of models [30]. To illustrate the superiority of the proposed algorithm in convergence computing efficiency, a convergence computational time cost experiment under three levels of datasets was implemented. Table 7 presents the computational time costs of the proposed method and LSTM, RF, and Seq2seq models for different amounts of stations. We test three datasets with different levels of data size, including Level 1 (data from one station), Level 2 (data from five stations), and Level 3 (data from 10 stations). It can be seen that the average computing time was algorithm specific, and the average time consumed by the proposed algorithm was much less than that of the LSTM, RF, and the Seq2seq algorithm, especially for the two deep leaning methods. A more noteworthy observation is that the computing time costs of LSTM, RF, and the Seq2seq will increase rapidly as the training data size increases, but it does not change too much for the proposed method. For the Level 2 and Level 3 datasets (just five and 10 stations), the computational costs of LSTM and Seq2Seq were 128.7-13,200.8 and 320.1-77,421.1 times the cost of the proposed method, which possesses a tremendous time cost advantage when predicting the meteorological elements of hundreds of weather stations in practice. The tree-based algorithms are generally competent to build decision trees in parallel, which would more or less decrease the computing time [30]. Therefore, the proposed method conducts apparent optimization in time complexity, especially when the size of the input meteorological dataset is large. Furthermore, Figure 9a-c illustrates a forecasting instance 2 m temperature, 2 m relative humidity, and 10 m wind speed curves at one station on a competition day. The horizontal coordinate values indicate 37 h, and the vertical coordinate values represent the corresponding unit and the value range of each variable. In each sub-figure, the left blue line is the observed meteorological value during the previous 28 h, the right blue line is the ground truth, the brown line is the Seq2Seq prediction, and the red line is the proposed prediction. It is also clear that the proposed method can achieve higher prediction accuracy.

Conclusions
Based on the historical meteorological observation datasets and numerical weather prediction (NWP) provided by Beijing weather stations, a short-term weather forecasting model based on wavelet packet denoising and Catboost is put forwarded. Correlation heat map and tree method are combined for feature selection. Moreover, specific spatiotemporal features are extracted for the periodicity and spatial differences of weather features. Then wavelet packet denoising is utilized to provide more effective denoised features, which processes a part of the "learning" task in advance. Test results have demonstrated that, compared with the conventional LSTM, random forest, and Seq2Seq methods, the proposed method incorporating wavelet packet denoising with Catboost can significantly shorten the convergence time of the learning model and decrease the computational cost, as well as notably improve the prediction accuracy.
In this paper, some hyper-parameters (such as the soft-threshold Th wp in the wavelet packet denoising and the number of trees "iterations" in the Catboost) are important for prediction performance and need to be carefully selected. Future works include making the effort to automatically tune hyper-parameters (e.g., self-adaptive soft threshold) or try some ensemble methods. Moreover, the effect of historical observation sequence length and the fusion strategy of NWP forecast data (e.g., attention mechanism), as well as integrating some deep-learning model (e.g., transformer structure) in the prediction model can be further explored to enhance prediction performance.