Next Article in Journal
Soil Moisture-Boundary Layer Feedbacks on the Loess Plateau in China Using Radiosonde Data with 1-D Atmospheric Boundary Layer Model
Next Article in Special Issue
Multi-Scale Object-Based Probabilistic Forecast Evaluation of WRF-Based CAM Ensemble Configurations
Previous Article in Journal
Reactivity of a Carene-Derived Hydroxynitrate in Mixed Organic/Aqueous Matrices: Applying Synthetic Chemistry to Product Identification and Mechanistic Implications
Previous Article in Special Issue
An Object-Based Method for Tracking Convective Storms in Convection Allowing Models
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting

School of Automation, Southeast University, Nanjing 210096, China
Key Laboratory of Measurement and Control of CSE, Ministry of Education Research Laboratory, Nanjing 210096, China
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Institute of Meteorology and Oceanography, PLA University of Science and Technology, Nanjing 211101, China
School of Suzhou United Graduate College, Southeast University, Nanjing 210096, China
Author to whom correspondence should be addressed.
Atmosphere 2021, 12(12), 1618;
Submission received: 21 October 2021 / Revised: 30 November 2021 / Accepted: 1 December 2021 / Published: 4 December 2021


Accurate forecasting of future meteorological elements is critical and has profoundly affected human life in many aspects from rainstorm warning to flight safety. The conventional numerical weather prediction (NWP) sometimes leads to unsatisfactory performance due to inappropriate initial state settings. In this paper, a short-term weather forecasting model based on wavelet packet denoising and Catboost is proposed, which takes advantage of the fusion information combining the historical observation data with the prior knowledge from NWP. The feature selection and spatiotemporal feather addition are also explored to further improve performance. The proposed method is evaluated on the datasets provided by Beijing weather stations. Experimental results demonstrate that compared with many deep-learning or machine-learning methods such as LSTM, Seq2Seq, and random forest, the proposed Catboost model incorporated with wavelet packet denoising can achieve shorter convergence time and higher prediction accuracy.

1. Introduction

Weather prediction is of great importance and can affect some aspects of daily life, such as air quality, travel plans, energy supply, and so on [1,2,3,4]. A conventional prediction method is numerical weather prediction (NWP) method, which solves the numerical solutions of atmospheric hydro-thermo dynamic equations to predict meteorological dynamics [5,6,7]. However, unsatisfactory prediction results are obtained if inappropriate initial states are set [7,8]. Moreover, conventional NWP-based approaches do not take full advantage of vast amount of existing historical observation data [9]. As the observation techniques develop and the historical meteorological data grow bigger and bigger, some purely data-driven approaches are expected to be introduced into weather forecasting [10]. They do not solve complex differential equations, but quickly model the meteorological dynamics by learning from a large amount of historical datasets.
It is well known that several machine-learning methods including artificial intelligence (AI) methods have been demonstrated to be powerful methods for weather forecasting [9,10,11,12,13]. For one-dimensional timeseries meteorological data, some data-driven machine-learning approaches have been presented for weather forecasting. In [14], autoregressive integrated moving average artificial neural networks (ARIMA–ANN) combined with ARIMA–Kalman is proposed to predict the wind speed and show the effectiveness of the hybrid model. In [15], global radiation is effectively predicted by a hybrid ARIMA/ANN model. In [16], Gaussian processes method is employed to revise NWP results and achieve a more accurate prediction of the 24 h wind energy. Moreover, a deep hybrid model in [17] is utilized to predict a series of weather related elements. In [18], the relationship between variables is captured by auto-encoder, and rainfall is forecasted by the multi-layer perceptron (MLP). In [19], multi-layer perceptron with spatial–temporal attention is proposed for wind speed and wind direction in Beijing. In addition, the long short-term memory (LSTM) model is also effective for weather forecasting [20,21]. Seq2Seq and its variations, which are well known in the field of natural language processing, can be migrated and applied to solve weather prediction problems [22].
On the other hand, some deep-learning methods employed the two-dimensional radar echo maps to achieve a short-term weather forecast. In [8], a short-range forecast based on radar echo maps is viewed as a spatiotemporal sequence forecasting problems, and the convolutional LSTM (ConvLSTM) method was proposed to solve it. Furthermore, the same author proposed the trajectory GRU (TrajGRU) to further improve weather forecast accuracy [23]. It utilized not only location-invariant structure in ConvLSTM, but the location-variant recurrent structure to efficiently depict the motion patterns such as rotation and scaling in radar echo maps. In addition, the memory in memory (MIM) network [24], the improved predictive recurrent neural network (PredRNN++) [25,26], and generative adversarial gated (GAN) recurrent unit models [27,28] were also proposed to handle the limitations in ConvLSTM model and enhance the prediction performance.
Although the accuracies of above machine-/deep-learning methods are high, these purely data-driven models have the limitations that they ignore the important prior knowledge from NWP and may not fully capture the spatiotemporal dynamics of diverse meteorological variables [19,22,29]. Besides, the training convergence time with some deep-learning methods, especially for abundant weather variable forecasts at large amount of weather stations, is remarkable long. It is well known that decision-tree-based methods, such as random forest methods, are quite effective to deal with big data, while the training convergence time can be decreased greatly [30,31]. The trees in the forest are definitely different and consequently provide an “expert collection” that performs better than any single tree. Among these, Catboost algorithm [32], which is an improved version of gradient-boosting decision tree (GBDT) and uses a special type of depth-first expansion called oblivious trees, has put in great performance at many data-mining applications [33,34,35,36]. In this paper, a hybrid model based on Catboost and wavelet packet denoising is proposed to predict multiple meteorological variables at multiple future steps. Some additional features are extracted and fused. Moreover, it will combine the advantages of the NWP and machine-learning method to improve forecasting accuracy, while the training convergence time can be largely decreased compared with some deep-learning methods. The rest of the paper is structured as follows. The problem statement and our method are discussed in Section 2, followed by the experiments and performance analysis. Lastly, we conclude with a brief summary in Section 4.

2. Materials and Methods

2.1. Problem Statement

In this work, historical meteorological observations from 10 weather stations for more than three years and RMAPS-based preliminary weather forecast data from NWP [22] are supplied. They can be defined as follows.
Historical meteorological observation datasets denoted as
O t =   o 1 t ,   o 2 t , o N 1 t R N 1
where the variable o i t is one of N 1 meteorological features, for t = 1 , T o . Here, N 1 = 9 .
Preliminary weather forecast timeseries obtained by NWP as
F t =   f 1 t ,   f 2 t , f N 2 t R N 2
where the variable f i t is one of N 2 NWP features, for t = T o + 1 , T o + T f , T f is forecasting step number. In this work, N 2 = 29 . Additionally, a forecast time period is required from 3:00 to 15:00 (UTC) of the next day, then T f = 37 .
Ground truth of target meteorological variables Y   t and their predictions Y   ˜ t
Y t =   y 1 t ,   y 2 t , y N 3 t R N 3
where the variable y i t is truth value of N 3 meteorological variables, for t = T o + 1 , T o + T f . In this work, relative humidity at 2 m (rh2m), wind at 10 m (w10m), and temperature at 2 m (t2m) are target forecasting variables, and therefore, N 3 = 3 .

2.2. Data Processing

2.2.1. Missing Values

In this work, two kinds of missing values, including local missing (local non-continuous) and block missing, are included. For local missing, linear interpolation is employed. As for block missing, the data of those days are deleted directly, and therefore, 40-day data are deleted from a total of 1188 day dataset.

2.2.2. Additional Spatiotemporal Feathers

Figure 1 shows the historical statistic means of meteorological variation t2m and w10m. From this figure, compared with other weather stations, t2m means in station ID 7 has obvious differences and w10m means in station ID 7 and 6 follow different trends. Thus, station ID should be added as a category variable. In this work, the one-hot encoding method is employed to depict the spatial feature.
As for the temporal features, the historical variation of target variable t2m is presented in Figure 2. From this figure, strong periodicity and seasonality can be observed, and the temporal features should be considered. If the time features such as month and hour are also set by the one-hot encoding method, a large feature space will be occupied. Moreover, the temporal continuity such as between December and January will be destroyed if the month feature (1–12) is directly mapped into the interval of 0–1. In this work, clock projection is utilized to extract the temporal features. Specifically speaking, the month feature and the hour feature are transformed to the order of 1–12 on the clock, which can be seen in Figure 3. The projected horizontal and vertical coordinate values (within 0–1) will be deemed as new temporal features. Test results demonstrate that the prediction accuracy can be significantly improved by the additional spatiotemporal features.

2.2.3. Feature Selection

In this work, 38 meteorological features are provided for weather forecast in the historical observed dataset and NWP forecasting dataset. Considering the prediction accuracy and the training computational consumption, an ensemble selection method integrating three methods are employed for feature selection, including RFE, correlation matrix, and tree model. In this work, SVM-based recursive feature elimination (RFE) is adopted [37]. The correlation matrix is also called the correlation coefficient matrix, and the map is presented in Figure 4. Moreover, the Catboost model is also used for the ranking of feature importance. By the ensemble selection model, the less important features are screened out.

2.3. Model Architecture

The proposed short-term weather forecasting model is based on wavelet packet denoising and Catboost. The block diagram of the proposed forecasting model is presented in Figure 5.
In this model, the historical meteorological observation dataset O T o =   O 1 , O 2 , O T o ϵ R T o × N 1 and NWP forecasting dataset F T f =   F 1 , F 2 , F T f ϵ R T f × N 2 will be firstly dealt with above-mentioned data preprocessing, including data cleaning, spatiotemporal features addition, and feature selection. Then, the wavelet packet denoising method is proposed and achieved as a “pre-learning task”, which can improve the accuracy and shorten the convergence time. At last, the Catboost model is utilized for training and weather forecast.

2.3.1. Wavelet Packet Denoising Principle

Signal denoising, especially for sensor noises in historical meteorological observation datasets, is of great importance. This “pre-learning task” will be verified to effectively shorten the subsequent training convergence time and enhance forecast accuracy. For noise reduction in non-stationary signals in meteorological datasets, it is reasonably effective to remove the noises by the wavelet-based denoising methods due to their multi-resolution and multi-scale analysis property [38]. Wavelet transforms have been successfully used in many scientific fields such as image compression, image denoising, and signal processing, to name only a few. The window size in wavelet transform is fixed, while its shape is adjustable. Both the time and frequency window can be adjusted. Hence, it has a strong ability to extract local features of signals and detect singularity in signals [39]. The denoised signals can be obtained by employing the threshold function to filter the wavelet coefficients and conduct the wavelet reconstruction [40].
In wavelet-based denoising methods, the wavelet basis function and the threshold function have great influence on the sparsity of wavelet representation coefficients. In this work, Daubechies (dbN) basis function is adopted due to its fine regularity. It means that it is difficult to perceive the smooth error introduced by this wavelet as a sparse basis, and then, the denoised signal can be much smoother [40]. Moreover, the soft-threshold method is employed here. It has good continuity and makes a more smooth process to the wavelet coefficients.
For a signal f t L 2 R , the wavelet decomposition coefficients can be obtained as
W f a , τ   = f t , ψ a , τ t   = 1 a f t ψ t a τ d t
where ψ a , τ t is the mother wavelet; the parameter a indicates the scale index, and τ represents the time shifting.
In order to extract local features of signals by more fine-grained details and enhance the time-frequency resolution, the wavelet packet denoising method is utilized in this paper. The recursive formula of wavelet packet transform is as follows.
W 2 n t   = k h k W n 2 t k
W 2 n + 1 t   = k g k W n 2 t k
The signal W 1 t will go across the orthogonal filter combining the high-pass filter h k and low-pass filter g k [39] and wavelet packet decomposition can divide the signal into different frequency spans layer-by-layer. The width of the frequency span Δ f can be obtained as
Δ f = f s 2 i + 1
where f s is the sampling frequency; i is the decomposition layer. The expected frequency span can be obtained when layer i is set appropriately. The original signal can be separated from noise and interference signals when every frequency is wide enough. The recursive reconstruct formula is
W 2 2 t   = k h 2 k + 1 W 2 n t k   + k g 2 k + 1 W 2 n + 1 t k  
W 2 2 t 1   = k h 2 k W 2 n t k   + k g 2 k W 2 n + 1 t k  
It is beneficial that we can choose all the frequency bins or some parts of them and set the others (noise or random interruption) as zero. When the signal is decomposed into different frequency bins, it is easy to extract noise by the reconstruction. Figure 6 illustrates the three-layer wavelet packet decomposition.
In addition, for the choice of the wavelet packet threshold, the three-layer wavelet packet and the soft-threshold is utilized. The threshold T h w p is designed to be self-adaptive with the decomposition layers and takes advantage of the mean and standard variance of decomposition coefficients, that is
T h w p p   = 1 M j = 1 M C 3 , p j   + 1 M j = 1 M C 3 , p j 1 M j = 1 M C 3 , p j 2  
C ˜ 3 , p j   = sign C 3 , p j C 3 , p j T h w p p ,                 C 3 , p j T h w p p         0                                                                                                                                               C 3 , p j < T h w p p  
where T h w p p is the wavelet packet threshold, combining with the mean and variance. C 3 , p j ,   p = 1 ,   2 , ,   8 is the wavelet packet transform coefficients, and M is the coefficient length. When coefficients C 3 , p j are greater than the threshold T h w p p , set C 3 , p j as zero; otherwise, set C 3 , p j as sign C 3 , p j C 3 , p j T h w p p , which makes the C ˜ 3 , p j .
The historical meteorological observation signals are denoised by wavelet packet transform. Figure 7 shows the original observed and denoised curves of 2 m relative humidity.

2.3.2. Learning Model: Categorical Boosting

Gradient boosting is an effective and powerful machine-learning technology for solving problems with complex dependencies, noisy data, and heterogeneous characteristics. It has a theoretical explanation on how iteration combines weak models through gradient descent in function space [32] and has demonstrated most advanced performance in a variety of practical tasks, such as ET0 estimation [30], global solar radiation prediction [33], and web searching [35]. Catboost, proposed by Yandex Company, is a novel gradient boosting algorithm [32], which makes many improvements to overcome the model overfitting and deal with parallelism. Thus, the layout can be completed in less time. Generally, high prediction accuracy is the key point; however, good stability and less computational workload should also be laid emphasis when employing machine-/deep-learning models. Some models are inherently unstable and will obtain fewer precision estimates when with new datasets [30].
Catboost can deal with categorical features well and is employed here as a learning model for short-term weather forecast. It has the following advantages [32]:
Category features: In order to reduce overfitting and utilize the whole dataset for training, an efficient strategy called target statistics (TS) with minimum information loss is employed in the Catboost. Specially, for the input example sets D = x k , y k k = 1 , , n , a plurality of random permutation is performed. Then, the average label values will be calculated for the sequence with the same category value. Finally, all classification features will be substituted with the following formula:
x ˜ k j = j = 1 n x j i = x k i · y j + β · P j = 1 n x j i = x k i + β
where the parameter β > 0 , namely, the weight of the prior, can dampen the low frequency category noise. P is a prior value. y k is the target, and x k = x k 1 , , x k M is a random vector of M features.
Feature combinations: A greedy way is utilized for Catboost when the tree constructs a new split. No combination is considered for the first split. However, for the subsequent splits, Catboost contains all the combination and classification features in the current tree of the dataset with all categorical features. Moreover, all splits selected in the tree are treated as categories with two values and are similarly utilized in the combination.
Unbiased boosting: In Catboost, the ordered boosting is developed by theoretical analysis to solve the gradient bias, which is inevitable when the traditional GBDT employs the TS method to convert categorical features into numerical values. Moreover, multiple permutations of the training data are employed to enhance the robustness. Different permutations will be utilized for training distinct modals, which can deal with the overfitting problem.
Fast scorer: Oblivious trees, which are balanced and less inclined to overfitting, are used as base predictors in Catboost. Moreover, in order to calculate the model predictions, each leaf index in the Catboost model evaluators is encoded as a binary vector, whose length is equal to the depth of the tree. Figure 8 shows the structure of the Catboost algorithm.

3. Performance Analysis and Comparisons

In this section, the performance analysis and comparisons of the proposed weather forecasting model with other machine-learning and deep-learning methods are illustrated. Firstly, the evaluation metrics are shown.

3.1. Statistical Evaluation

The root mean squared error ( R M S E ) for three objective variables from 10 stations is calculated as daily evaluation.
R M S E i , m l = s = 1 10 t = T o + 1 T o + T f y i , s t y i , s m l t 2 10 · T f
where y i , s t and y i , s m l t are the ground truth and the forecasting value (by the proposed machine-learning method) of the objective variable i (here, i denotes t 2 m , rh 2 m , and w 10 m ) of station s at time t , respectively. Similarly, R M S E i , N W P can be obtained, which uses the predicted value F t by NWP method.
R M S E i , N W P = s = 1 10 t = T o + 1 T o + T f y i , s t F i , s t 2 10 · T f
Then, the associated skill score S i is employed to compare the forecasting improvement with the classic NWP method.
S i = R M S E i , N W P R M S E i , m l R M S E i , N W P
S i > 0 means that the proposed machine-learning method can obtain lower R M S E and better prediction accuracy than the NWP method. The higher S i is, the better the forecasting performance by the proposed method is.
S d a y = S t 2 m + S r h 2 m + S w 10 m 3
S d a y is the average skill score of the three objective variables and is the ultimate prediction criterion for each day.

3.2. Baselines and Experimental Settings

In this work, the machine-learning method random forest and two deep-learning methods LSTM and Seq2Seq are implemented as baselines for comparisons. Seq2Seq method is well known in the field of natural language processing and is effective to solve timeseries prediction problems.
The tests were performed on a GPU server with GTX 1080Ti GPU, 11GB of video memory, and a Pytorch programming environment. As mentioned above, 38 meteorological features are provided from observed and NWP datasets. The three least correlated characteristics are removed, and the spatiotemporal feathers are added in the phase of data preprocessing. The set hyperparameter T o = 28 means that the previous 28 h observation data are selected to modal the recent meteorological dynamics. The data from 1148 days (during 1 March 2015 to 1 June 2018) are used for the training set, and data from 87 days (during 3 June 2018 to 29 August 2018) are for the validation set. Since the evaluation is based on online daily forecasting, the test day index is 1. The main parameters in the random forest method and Catboost method are the maximum depth and the number of trees, which are optimized by the grid search method.
The pseudocode are added to describe the whole methodology, as shown in Algorithm 1.
Algorithm 1. Combining WPD and Catboost Method for Weather Forecasting
Input: Historical Observation Datasets O t , NWP Datasets F t , Ground Truth Y t
Output: Prrediction Y ˜ t t 2 m , r h 2 m , w 10 m
1:Data Cleaning
2: M s t      Feature Selection: O t , F t
3: T i t       Temproal Features: M o n t h , H o u r
4: S i t        Spatio Features: S t a t i o n   I D
5: M t        Concat M s t , T i t , S i t
6: M w t     Wavelet Packet Denosing: M t
7: Y ˜ t         Catboost: M w t

3.3. Performance Analysis

This work is based on an attended online competition dataset for daily weather forecasting. The forecasting daily evaluation score S d a y and average score S a v g for continuous five competition days are presented in Table 1, which are based on incremental data released daily according to real-world forecasting processes. In this table, LSTM, Seq2Seq, and random forest (“RF” for brevity) are implemented for comparison. “ST” means that spatiotemporal feathers are added. “FS” denotes that feature selection is employed. “WPD” indicates that the wavelet packet denoising method is combined.
First, comparing “LSTM + FS” with “LSTM + FS + ST”, it can be clearly observed that the prediction accuracy by adding spatiotemporal features has been remarkably improved. Similarly, it can be obtained that the feature selection is also effective by comparing “Catboost + ST + FS” with “Catboost + ST”. The less important features are screened out by the ensemble selection model, which can reduce redundant information. In order to verify the importance of wavelet packet denoising, the results of “RF + FS + ST” and “RF + FS + ST + WPD” as well as “Catboost + ST + FS” and “Catboost + ST + FS +WPD” could be taken into consideration and comparisons. Obviously, wavelet packet denoising can greatly promote the prediction scores. Wavelet packet denoising also takes effect as the “pre-learning” process and can effectively shorten the entire learning convergence time. For example, after wavelet packet denoising the data, Catboost only needs an approximately 40 epoch convergence time, which is rather short in contrast with the original one, about 100 epoch.
Second, considering the effect of information fusion, the effectiveness of fusing NWP forecasting can be validated by comparing “Catboost + ST + FS + WPD” with “Catboost + ST + FS + WPD + noNWP”, where the NWP forecasting is masked by zero values. Similarly, comparing “Catboost + ST + FS + WPD” with “Catboost + ST + FS + WPD + noOBS” demonstrates the advantage of modeling recent meteorological dynamics. Moreover, the S d a y > 0   in the “Catboost + ST + FS + WPD + noOBS” indicates that the machine-learning method can enhance the NWP alone performance. Meanwhile, the S d a y > 0   in the “Catboost + ST + FS + WPD + noNWP” without NWP information illustrates that the proposed method has superiority in modeling meteorological data. Catboost successfully handles categorical features and uses a new schema for calculating leaf values when selecting the tree structure, which helps to reduce overfitting. These comparisons clearly exhibit that conducting information fusion is better, and modeling alone with OBS or NWP is not good enough. NWP forecasting contains important prior knowledge.
Moreover, the raw RMSE values for temperature, relative humidity, and wind speed with all the methods are presented in Table 2, Table 3 and Table 4, respectively. It is also clear that the proposed “Catboost + ST + FS +WPD” method can also achieve the best RMSE values.
For cross-validation, the new training set (during 1 March 2015 to 15 August 2017), validation set (during 16 August 2017 to 30 April 2018), and test set (during 1 May 2018 to 29 August 2018, 120 days) are assigned again (ratio is about 7:2:1). The test results are shown in Table 5. S ¯ d a y 01 20 is the average S value of the first day to the twentieth day in the test set. S ¯ a v g is the average S value of the first day to the one hundred and twentieth day in the test set. From this table, the proposed “Catboost + ST + FS +WPD” method can also achieve the best average S values for the new cross-validation dataset.
For demonstrating the effect of varying important hyper-parameters, the proposed method with different “iterations” (number of trees) is tested, and the results are shown in Table 6. “Iterations” is an important hyper-parameter in the Catboost method. C a t b o o s t 500 means the proposed method with “iterations” = 500 is employed. It is verified that hyper-parameters are also important for prediction performance.
In addition, apart from the accuracy, it is also vital for model construction to minimize the complexity of models [30]. To illustrate the superiority of the proposed algorithm in convergence computing efficiency, a convergence computational time cost experiment under three levels of datasets was implemented. Table 7 presents the computational time costs of the proposed method and LSTM, RF, and Seq2seq models for different amounts of stations. We test three datasets with different levels of data size, including Level 1 (data from one station), Level 2 (data from five stations), and Level 3 (data from 10 stations). It can be seen that the average computing time was algorithm specific, and the average time consumed by the proposed algorithm was much less than that of the LSTM, RF, and the Seq2seq algorithm, especially for the two deep leaning methods. A more noteworthy observation is that the computing time costs of LSTM, RF, and the Seq2seq will increase rapidly as the training data size increases, but it does not change too much for the proposed method. For the Level 2 and Level 3 datasets (just five and 10 stations), the computational costs of LSTM and Seq2Seq were 128.7–13,200.8 and 320.1–77,421.1 times the cost of the proposed method, which possesses a tremendous time cost advantage when predicting the meteorological elements of hundreds of weather stations in practice. The tree-based algorithms are generally competent to build decision trees in parallel, which would more or less decrease the computing time [30]. Therefore, the proposed method conducts apparent optimization in time complexity, especially when the size of the input meteorological dataset is large.
Furthermore, Figure 9a–c illustrates a forecasting instance 2 m temperature, 2 m relative humidity, and 10 m wind speed curves at one station on a competition day. The horizontal coordinate values indicate 37 h, and the vertical coordinate values represent the corresponding unit and the value range of each variable. In each sub-figure, the left blue line is the observed meteorological value during the previous 28 h, the right blue line is the ground truth, the brown line is the Seq2Seq prediction, and the red line is the proposed prediction. It is also clear that the proposed method can achieve higher prediction accuracy.

4. Conclusions

Based on the historical meteorological observation datasets and numerical weather prediction (NWP) provided by Beijing weather stations, a short-term weather forecasting model based on wavelet packet denoising and Catboost is put forwarded. Correlation heat map and tree method are combined for feature selection. Moreover, specific spatiotemporal features are extracted for the periodicity and spatial differences of weather features. Then wavelet packet denoising is utilized to provide more effective denoised features, which processes a part of the “learning” task in advance. Test results have demonstrated that, compared with the conventional LSTM, random forest, and Seq2Seq methods, the proposed method incorporating wavelet packet denoising with Catboost can significantly shorten the convergence time of the learning model and decrease the computational cost, as well as notably improve the prediction accuracy.
In this paper, some hyper-parameters (such as the soft-threshold T h w p in the wavelet packet denoising and the number of trees “iterations” in the Catboost) are important for prediction performance and need to be carefully selected. Future works include making the effort to automatically tune hyper-parameters (e.g., self-adaptive soft threshold) or try some ensemble methods. Moreover, the effect of historical observation sequence length and the fusion strategy of NWP forecast data (e.g., attention mechanism), as well as integrating some deep-learning model (e.g., transformer structure) in the prediction model can be further explored to enhance prediction performance.

Author Contributions

Conceptualization, D.N.; data curation, Z.Z., H.C. and T.Z.; funding acquisition, Z.Z. and X.C.; methodology, D.N. and L.D.; project administration, D.N. and Z.Z.; software, D.N., T.Z. and L.D.; validation, H.C. and T.Z.; visualization, H.C. and T.Z.; writing—original draft, L.D.; writing—review and editing, D.N. All authors have read and agreed to the published version of the manuscript.


This research was funded by the National Key Research and Development Program of China (No. 2018YFC1506905), Natural Science Foundation of Jiangsu Province of China (No. BK20202006), Zhishan Youth Scholar Program of Southeast University, the Key R&D Program of Jiangsu Province (No. BE2019052, BE2017076).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.


The authors would like to thank Yichao Cao and Junhao Huang for helpful discussions and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.


  1. Gneiting, T.; Raftery, A.E. Weather forecasting with ensemble methods. Science 2005, 310, 248–249. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Jones, N. Machine learning tapped to improve climate forecasts. Nature 2017, 548, 379–380. [Google Scholar] [CrossRef] [PubMed]
  3. Ham, Y.G.; Kim, J.H.; Luo, J.J. Deep learning for multi-year ENSO forecasts. Nature 2019, 573, 568–572. [Google Scholar] [CrossRef] [PubMed]
  4. Ravuri, S.; Lenc, K.; Willson, M.; Kangin, D.; Lam, R.; Mirowski, M. Skillful Precipitation Nowcasting using Deep Generative Models of Radar. Nature 2021, 597, 672–677. [Google Scholar] [CrossRef]
  5. Marchuk, G. Numerical Methods in Weather Prediction; Elsevier: Amsterdam, The Netherlands, 2012. [Google Scholar]
  6. Tolstykh, M.A.; Frolov, A.V. Some current problems in numerical weather prediction. Izv. Atmos. Ocean. Phys. 2005, 41, 285–295. [Google Scholar]
  7. Juanzhen, S.; Ming, X.; James, W.W.; Zawadzki, I.; Ballard, S.P.; Onvlee-Hooimeyer, J.; Pinto, J. Use of NWP for nowcasting convective precipitation: Recent progress and challenges. Bull. Am. Meteorol. Soc. 2014, 95, 409–426. [Google Scholar]
  8. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–10 December 2015; pp. 802–810. [Google Scholar]
  9. McGovern, A.; Elmore, K.L.; Gagne, D.J.; Haupt, S.E.; Karstens, C.D.; Lagerquist, R.; Williams, J.K. Using artificial intelligence to improve real-time decision-making for high-impact weather. Bull. Am. Meteorol. Soc. 2017, 98, 2073–2090. [Google Scholar] [CrossRef]
  10. Basha, C.Z.; Bhavana, N.; Bhavya, P.; Sowmya, V. Rainfall prediction using machine learning & deep learning techniques. In Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2–4 July 2020; pp. 92–97. [Google Scholar]
  11. Imam Cholissodin, S. Prediction of rainfall using improved deep learning with particle swarm optimization. Telkomnika 2020, 18, 2498–2504. [Google Scholar] [CrossRef]
  12. Khan, M.I.; Maity, R. Hybrid deep learning approach for multi-step-ahead daily rainfall prediction using GCM simulations. IEEE Access 2020, 8, 52774–52784. [Google Scholar] [CrossRef]
  13. Sapankevych, N.I.; Sankar, R. Time series prediction using support vector machines: A survey. IEEE Comput. Intell. Mag. 2009, 4, 24–38. [Google Scholar] [CrossRef]
  14. Ling, C.; Xu, L. Comparison between ARIMA and ANN models used in short-term wind speed forecasting. In Proceedings of the Power and Energy Engineering Conference (APPEEC), Wuhan, China, 25–28 March 2011; pp. 1–4. [Google Scholar]
  15. Cyril, V.; Marc, M.; Christophe, P.; Marie-Laure, N. Numerical weather prediction (NWP) and hybrid ARMA/ANN model to predict global radiation. Energy 2012, 39, 341–355. [Google Scholar]
  16. Chen, N.; Qian, Z.; Nabney, I.T.; Meng, X. Short-term wind power forecasting using gaussian processes. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
  17. Grover, A.; Kapoor, A.; Horvitz, E. A deep hybrid model for weather forecasting. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 379–386. [Google Scholar]
  18. Hernández, E.; Sanchez-Anguix, V.; Julian, V.; Palanca, J.; Duque, N. Rainfall prediction: A deep learning approach. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Seville, Spain, 18–20 April 2016; pp. 151–162. [Google Scholar]
  19. Li, Y.; Lang, J.; Ji, L.; Zhong, J.; Wang, Z.; Guo, Y.; He, S. Weather forecasting using ensemble of spatial-temporal attention network and multi-layer perceptron. Asia-Pac. J. Atmos. Sci. 2021, 57, 533–546. [Google Scholar] [CrossRef]
  20. Salman, A. Single layer & multi-layer long short-term memory (LSTM) model with intermediate variables for weather forecasting. Procedia Comput. Sci. 2018, 135, 89–98. [Google Scholar]
  21. Fu, Q.; Niu, D.; Zang, Z.; Hao, H.; Li, D. Multi-stations’ weather prediction based on hybrid model using 1D CNN and Bi-LSTM. In Proceedings of the 2019 Chinese Control Conference, Guangzhou, China, 27–30 July 2019; pp. 3771–3775. [Google Scholar]
  22. Wang, B.; Lu, J.; Yan, Z.; Luo, H.; Li, T.; Zheng, Y.; Zhang, G. Deep uncertainty quantification: A machine learning approach Bashafor weather forecasting. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2087–2095. [Google Scholar]
  23. Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–7 December 2017; pp. 5617–5627. [Google Scholar]
  24. Wang, Y.; Zhang, J.; Zhu, H. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9154–9162. [Google Scholar]
  25. Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–7 December 2017; pp. 879–888. [Google Scholar]
  26. Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
  27. Wang, C.; Wang, P.; Wang, P.; Xue, B.; Wang, D. Using Conditional Generative Adversarial 3D Convolutional Neural Network for Precise Radar Extrapolation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5735–5749. [Google Scholar] [CrossRef]
  28. Liu, H.B.; Lee, I. MPL-GAN: Toward Realistic Meteorological Predictive Learning Using Conditional GAN. IEEE Access 2020, 8, 93179–93186. [Google Scholar] [CrossRef]
  29. Niu, D.; Huang, J.; Zang, Z.; Xu, L.; Che, H.; Tang, Y. Two-Stage Spatiotemporal Context Refinement Network for Precipitation Nowcasting. Remote Sens. 2021, 13, 4285. [Google Scholar] [CrossRef]
  30. Huang, G.; Wu, L.; Ma, X. Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions. J. Hydrol. 2019, 574, 1029–1041. [Google Scholar] [CrossRef]
  31. Karimi, S.; Shiri, J.; Marti, P. Supplanting missing climatic inputs in classical and random forest models for estimating reference evapotranspiration in humid coastal areas of Iran. Comput. Electron. Agric. 2020, 176, 105633. [Google Scholar] [CrossRef]
  32. Prokhorenkova, L.; Gusev, G.; Vorobev, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 6638–6648. [Google Scholar]
  33. Wu, L.; Huang, G.; Fan, J. Potential of kernel-based nonlinear extension of Arps decline model and gradient boosting with categorical features support for predicting daily global solar radiation in humid regions. Energy Convers. Manag. 2019, 183, 280–295. [Google Scholar] [CrossRef]
  34. Chia, M.Y.; Huang, Y.F.; Koo, C.H. Recent advances in evapotranspiration estimation using artificial intelligence approaches with a focus on hybridization techniques—A review. Agronomy 2020, 10, 101. [Google Scholar] [CrossRef] [Green Version]
  35. Kang, P.; Lin, Z.; Teng, S. Catboost-based framework with additional user information for social media popularity prediction. In Proceedings of the the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2677–2681. [Google Scholar]
  36. März, A. CatBoostLSS-An extension of CatBoost to probabilistic forecasting. arXiv 2020, arXiv:2001.02121. [Google Scholar]
  37. Huang, X.; Zhang, L.; Wang, B. Feature clustering based support vector machine recursive feature elimination for gene selection. Appl. Intell. 2018, 48, 594–607. [Google Scholar] [CrossRef]
  38. Qian, H.; Ma, J.C. Research on fiber optic gyro signal de-noising based on wavelet packet soft-threshold. J. Syst. Eng. Electron. 2009, 20, 607–612. [Google Scholar]
  39. Wu, Y.J.; Gao, G.J.; Cui, C. Improved wavelet denoising by non-convex sparse regularization under double wavelet domains. IEEE Access 2019, 7, 30659–30671. [Google Scholar] [CrossRef]
  40. Pan, Y.; Zhang, L.M.; Wu, X.G.; Zhang, K.N.; Skibniewski, M.J. Structural health monitoring and assessment using wavelet packet energy spectrum. Saf. Sci. 2019, 120, 652–665. [Google Scholar] [CrossRef]
Figure 1. The 24 h means of meteorological variations from 1 March 2015 to 21 May 2018 for 10 stations: (a) t2m, (b) w10m.
Figure 1. The 24 h means of meteorological variations from 1 March 2015 to 21 May 2018 for 10 stations: (a) t2m, (b) w10m.
Atmosphere 12 01618 g001
Figure 2. The variation of t2m from 1 March 2015 to 21 May 2018.
Figure 2. The variation of t2m from 1 March 2015 to 21 May 2018.
Atmosphere 12 01618 g002
Figure 3. The clock projection method for temporal feature extraction.
Figure 3. The clock projection method for temporal feature extraction.
Atmosphere 12 01618 g003
Figure 4. The correlation matrix heat map.
Figure 4. The correlation matrix heat map.
Atmosphere 12 01618 g004
Figure 5. The block diagram of the proposed forecasting model.
Figure 5. The block diagram of the proposed forecasting model.
Atmosphere 12 01618 g005
Figure 6. Three-layer wavelet packet decomposition.
Figure 6. Three-layer wavelet packet decomposition.
Atmosphere 12 01618 g006
Figure 7. The original observed and denoised curves of 2 m relative humidity.
Figure 7. The original observed and denoised curves of 2 m relative humidity.
Atmosphere 12 01618 g007
Figure 8. The Catboost algorithm structure.
Figure 8. The Catboost algorithm structure.
Atmosphere 12 01618 g008
Figure 9. A test sample at one station is chosen to visualize the forecasting of 3 target variables in the future 37 h. (a) 2 m temperature. (b) 2 m relative humidity. (c) 10 m wind speed.
Figure 9. A test sample at one station is chosen to visualize the forecasting of 3 target variables in the future 37 h. (a) 2 m temperature. (b) 2 m relative humidity. (c) 10 m wind speed.
Atmosphere 12 01618 g009
Table 1. The S performance of different methods on 5 days.
Table 1. The S performance of different methods on 5 days.
Method S d a y 1 S d a y 2 S d a y 3 S d a y 4 S d a y 5 S a v g
LSTM + FS0.02030.11120.05030.13030.31430.1253
LSTM + FS + ST0.10000.29780.18210.25960.34880.2377
RF + FS + ST0.11460.19620.35820.42700.38590.2964
RF + FS + ST + WPD0.18210.23580.39900.45130.45530.3447
Seq2Seq + FS + ST + WPD0.25870.37930.46060.53200.51020.4282
Catboost + ST0.29330.36920.46100.51360.54060.4355
Catboost + ST + FS0.29920.39210.46540.51750.54340.4435
Catboost + ST + FS + WPD0.32730.40880.49080.54470.55300.4649
Catboost + ST + FS + WPD + noNWP0.1830.1390.1950.1970.2070.1842
Catboost + ST + FS + WPD + noOBS0.0160.3140.3430.3960.4010.294
Table 2. The RMSE performances of t2m with different methods.
Table 2. The RMSE performances of t2m with different methods.
Method R M S E d a y 1 R M S E d a y 2 R M S E d a y 3 R M S E d a y 4 R M S E d a y 5 R M S E a v g
LSTM + FS4.31674.58714.82215.10093.96714.5588
LSTM + FS + ST2.41132.50972.62142.77192.64442.5917
RF + FS + ST2.31212.35841.70021.47211.56511.8816
RF + FS + ST + WPD2.11672.18711.66721.43231.53331.7873
Seq2Seq + FS + ST + WPD1.71891.72281.26881.25541.41171.4755
Catboost + ST1.16991.73371.25611.30191.18871.3300
Catboost + ST + FS1.16421.69931.23431.29821.18321.3158
Catboost + ST + FS + WPD1.05961.67381.09771.12761.12531.2168
Catboost + ST + FS + WPD + noNWP2.10023.22782.51803.09884.43333.0756
Catboost + ST + FS + WPD + noOBS4.96632.85342.17212.25641.54772.7592
Table 3. The RMSE performances of rh2m with different methods.
Table 3. The RMSE performances of rh2m with different methods.
Method R M S E d a y 1   R M S E d a y 2   R M S E d a y 3   R M S E d a y 4   R M S E d a y 5   R M S E a v g  
LSTM + FS13.233213.836113.312314.524112.112313.4036
LSTM + FS + ST11.73418.47238.212111.08529.45169.7910
RF + FS + ST12.194612.16786.52079.03038.61719.7061
RF + FS + ST + WPD11.134811.46816.29138.92348.45889.2553
Seq2Seq + FS + ST + WPD9.92718.56615.70266.86987.39507.6921
Catboost + ST9.46948.68836.67717.37757.02207.8469
Catboost + ST + FS9.33678.34336.66517.33797.01127.7388
Catboost + ST + FS + WPD9.22958.31066.41387.13866.96387.6113
Catboost + ST + FS + WPD + noNWP10.445612.33148.027811.412312.997311.0429
Catboost + ST + FS + WPD + noOBS13.45849.88777.785410.55438.577110.0526
Table 4. The RMSE performances of w10m with different methods.
Table 4. The RMSE performances of w10m with different methods.
Method R M S E d a y 1   R M S E d a y 2   R M S E d a y 3   R M S E d a y 4   R M S E d a y 5   R M S E a v g  
LSTM + FS2.12441.31221.46671.18831.26711.47174
LSTM + FS + ST1.60000.92600.96700.87500.81401.0364
RF + FS + ST1.55271.04060.76380.82420.88621.0135
RF + FS + ST + WPD1.54881.00910.75050.80090.88010.99788
Seq2Seq + FS + ST + WPD1.53570.95070.73920.73400.89820.97156
Catboost + ST1.48890.93550.73110.75020.86630.9544
Catboost + ST + FS1.40190.90310.72290.74420.85470.92536
Catboost + ST + FS + WPD1.36310.88830.71780.73650.81140.90342
Catboost + ST + FS + WPD + noNWP1.54391.02180.95120.93381.33041.15622
Catboost + ST + FS + WPD + noOBS2.28830.98020.88560.91170.88341.18984
Table 5. The statistical average S performances with different methods.
Table 5. The statistical average S performances with different methods.
Method S ¯ d a y 01 20   S ¯ d a y 21 40   S ¯ d a y 41 60   S ¯ d a y 61 80   S ¯ d a y 81 100   S ¯ d a y 101 120   S ¯ a v g  
LSTM + FS0.09920.22730.13720.13910.13350.17630.1521
LSTM + FS + ST0.31090.27550.28590.26790.16770.20560.2522
RF + FS + ST0.37710.39020.35610.29970.19840.25590.3129
RF + FS + ST + WPD0.43580.43760.39970.32080.21100.32780.3555
Seq2Seq + FS + ST + WPD0.51190.50050.42670.33290.26550.37920.4028
Catboost + ST0.51340.51370.42150.35170.32870.41160.4234
Catboost + ST + FS0.52640.55280.42930.36230.33190.44790.4418
Catboost + ST + FS + WPD0.56720.58370.43700.38780.35360.47320.4671
Catboost + ST + FS + WPD + noNWP0.4490.3410.2690.1880.1090.1650.2535
Catboost + ST + FS + WPD + noOBS0.0750.4650.3730.2950.2370.3170.2937
Table 6. The S performances of the proposed method with different “iterations”.
Table 6. The S performances of the proposed method with different “iterations”.
Method S d a y 1   S d a y 2   S d a y 3   S d a y 4   S d a y 5   S a v g  
C a t b o o s t 500 0.31090.39130.47550.52260.53370.4396
C a t b o o s t 1000 0.31880.39910.48370.53160.54110.4549
C a t b o o s t 2000 0.32290.40250.48690.53650.54690.4591
C a t b o o s t 3000 0.32730.40880.49080.54470.5530.4649
C a t b o o s t 4000 0.32460.40530.48890.54140.54920.4619
C a t b o o s t 5000 0.32380.40310.4880.53770.54710.4599
Table 7. Computational costs of the four algorithms at different levels of input data size.
Table 7. Computational costs of the four algorithms at different levels of input data size.
MethodLevel 1 Dataset
(1 Station)
Level 2 Dataset
(5 Station)
Level 3 Dataset
(10 Station)
RF39 s222 s724 s
LSTM1791 s8754 s37,774 s
Seq2Seq87,291 s897,654 s9,135,680 s
The proposed method37 s68 s118 s
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Niu, D.; Diao, L.; Zang, Z.; Che, H.; Zhang, T.; Chen, X. A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting. Atmosphere 2021, 12, 1618.

AMA Style

Niu D, Diao L, Zang Z, Che H, Zhang T, Chen X. A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting. Atmosphere. 2021; 12(12):1618.

Chicago/Turabian Style

Niu, Dan, Li Diao, Zengliang Zang, Hongshu Che, Tianbao Zhang, and Xisong Chen. 2021. "A Machine-Learning Approach Combining Wavelet Packet Denoising with Catboost for Weather Forecasting" Atmosphere 12, no. 12: 1618.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop