Prediction Skill of Extended Range 2-m Maximum Air Temperature Probabilistic Forecasts Using Machine Learning Post-Processing Methods

: The extended range temperature prediction is of great importance for public health, energy and agriculture. The two machine learning methods, namely, the neural networks and natural gradient boosting (NGBoost), are applied to improve the prediction skills of the 2-m maximum air temperature with lead times of 1–35 days over East Asia based on the Environmental Modeling Center, Global Ensemble Forecast System (EMC-GEFS), under the Subseasonal Experiment (SubX) of the National Centers for Environmental Prediction (NCEP). The ensemble model output statistics (EMOS) method is conducted as the benchmark for comparison. The results show that all the post-processing methods can e ﬃ ciently reduce the prediction biases and uncertainties, especially in the lead week 1–2. The two machine learning methods outperform EMOS by approximately 0.2 in terms of the continuous ranked probability score (CRPS) overall. The neural networks and NGBoost behave as the best models in more than 90% of the study area over the validation period. In our study, CRPS, which is not a common loss function in machine learning, is introduced to make probabilistic forecasting possible for traditional neural networks. Moreover, we extend the NGBoost model to atmospheric sciences of probabilistic temperature forecasting which obtains satisfying performances.


Introduction
Subseasonal or extended range weather forecasts are used to predict heat waves, extreme cold events, thunderstorms, droughts and floods as far as four weeks ahead. Subseasonal forecasts can deliver relevant weather information, such as the timing of the onset of a rainy season and the risk of extreme rainfall events or heat waves. However, there is a well-known gap in current numerical prediction systems for the subseasonal timescale of 10 days to one month. This gap falls between medium-range weather forecasts (up to 10 days) and seasonal climate predictions (longer than one month). Medium-range weather forecasts are influenced by the initial conditions of the atmosphere, whereas predictions of the seasonal climate are more affected by slowly evolving surface boundary conditions, such as the sea surface temperature and soil moisture content [1][2][3][4][5][6]. Predictions on the subseasonal timescale have made progress in some regions and seasons [1,7,8], although the full potential of their predictability requires further exploration. atmosphere-land model forced with prescribed sea surface temperatures. This base model contributes 11 ensemble members to the SubX reforecasts. The group provides reforecasts for the 1999-2016 period with weekly initialization and a lead time of 1-35 days. The anomaly correlation coefficients of precipitation and the 2-m temperature for week three, the anomaly correlation coefficients of the Real-time Multivariate Madden-Julian Oscillation indices and the North Atlantic Oscillation index all prove the superior prediction skills of the EMC-GEFS model [9].
We focus on the 2-m maximum air temperature forecasts over East Asia  • N, 70-140 • E) with a resolution of 1 • × 1 • for a lead time of 1-35 days. To ensure that our verification procedure mimics operational conditions, we set aside the data for the year 2016 as a validation set. The observations used for verification are obtained from the National Oceanic and Atmospheric Administration Climate Prediction Center (ftp://ftp.cdc.noaa.gov/Datasets/cpc_global_temp/) [9]. For the 2-m maximum air temperature over the land, the Climate Prediction Center provides a maximum daily temperature (T max ) dataset with a horizontal resolution of 0.5 • × 0.5 • [28]. The verification data are re-gridded to a coarser EMC-GEFS model resolution of 1 • × 1 • .

Ensemble Model Output Statistics
EMOS is a variant of the model output statistics method and regression techniques designed for probabilistic forecasting [29,30]. In the simple frame of model output statistics, which only uses the ensemble member forecasts x or x 1 , . . . , x k as predictors and multiple linear regression as the transfer function, the predictand y can be written as: where a and b 1 , . . . , b k (or denoted by b) are the regression coefficients; K is the number of ensemble members. There has been little research into the application of regression techniques to probabilistic forecasting [31]. Following Gneiting et al. [11], the conditional distribution of the predictand y based on the ensemble member forecast xcan be modeled by a single parametric distribution P θ with parameters θ ∈ R d : y x ∼ P θ (x) (2) When the distribution of the weather variable of interest y (e.g., temperature) is Gaussian, Equation (2) can be written as: where µ is the mean and σ is the standard deviation. By applying regression theory to probabilistic forecasting, the EMOS method attempts to reduce the bias between the predictive mean/variance and the regression estimate by using a bias-corrected weighted average. Hence we use a linear function with the ensemble mean and spread to fit the predictive mean and variance: where S 2 is the ensemble spread and c and d are nonnegative coefficients. Combining Equations (3) and (4), the Gaussian predictive distribution can be written as: These EMOS coefficients θ= (a, b, c, d) are estimated by minimizing the correct scoring rule (e.g., the continuous ranked probability score (CRPS) or the maximum likelihood estimation (MLE)) during the training period. Our EMOS experiments are implemented in R using the scoringRules package [32].

Neural Networks
As in the flowchart of neural networks shown in Figure 1, the neural networks consist of nodes organized in layers. The first (input) layer contains the input features, whereas the last (output) layer represents the output targets. The layers between the input and output layers are referred to as hidden layers. Apart from the input features, each node in the network can be computed as: where f is an activation function, z l j is the value of the jth node in the lth layer and a l j is the weighted average of the outputs z l−1 i in the previous layer. Additionally, a l j can be written as: w l ji is the weight between the ith node in the (l − 1)th layer and the jth node in the lth layer; b is a bias term and is usually constant. The activation function f is usually a nonlinear function which allows the network to be more complex and robust and the rectified linear unit (ReLU) is used here apart from in the output layer. The weights and biases are optimized to reduce the loss function using the Adam optimization method [33]. The topological structure of the neural networks is of great importance for its performance in which the number of nodes in each hidden layer is usually optimized via cross-validation. In our study, we take the 11 raw ensemble forecasts as the inputs and the predictive parameter µ and σ as the targets of the neural networks. Then we use an analogous-linear search method to test the optimal configuration of the neural networks where we build several models which consist of different number of nodes (i.e., 8, 16, 32, . . . , 256, 512, 1024) in the hidden layers. After the assessments (not shown), we finally build the neural networks model consisting of two hidden layers which contains 64 and 256 nodes, respectively.
Atmosphere 2020, 11, x FOR PEER REVIEW 4 of 17 represents the output targets. The layers between the input and output layers are referred to as hidden layers. Apart from the input features, each node in the network can be computed as: where is an activation function, is the value of the th node in the th layer and is the weighted average of the outputs −1 in the previous layer. Additionally, can be written as: = is the weight between the th node in the ( − 1)th layer and the th node in the th layer; is a bias term and is usually constant. The activation function is usually a nonlinear function which allows the network to be more complex and robust and the rectified linear unit (ReLU) is used here apart from in the output layer. The weights and biases are optimized to reduce the loss function using the Adam optimization method [33]. The topological structure of the neural networks is of great importance for its performance in which the number of nodes in each hidden layer is usually optimized via cross-validation. In our study, we take the 11 raw ensemble forecasts as the inputs and the predictive parameter μ and σ as the targets of the neural networks. Then we use an analogous-linear search method to test the optimal configuration of the neural networks where we build several models which consist of different number of nodes (i.e., 8, 16, 32, …, 256, 512, 1024) in the hidden layers. After the assessments (not shown), we finally build the neural networks model consisting of two hidden layers which contains 64 and 256 nodes, respectively. or 1 , 2 , … , 11 , the raw ensemble forecasts and as the inputs; or 1 , , … , , the weights of nodes; or 1 , 2 , … , , the weighted average of the outputs in the previous layer; or 1 , 2 , … , , the outputs with the rectified linear function, ReLU, applied in ; and , the two target parameters: mean and standard deviation; , the verification or the observation; CRPS, the continuous ranked probability score. Here, is the number of the nodes.
Neural networks can be applied to a range of problems but have rarely been applied to probabilistic forecasting [24]. The difficulty lies in building the correct loss function for probabilistic forecasting. Here, we use a closed form expression of the CRPS (see Section 3.4) with a Gaussian distribution for temperature probabilistic forecasting. The experiment described the conditional distribution of the observation given the ensemble member forecast as the input. Figure 2 presents the flowchart of NGBoost. NGBoost is a supervised learning method for probabilistic forecasting. The approach uses the natural gradient to address the technical challenges that are difficult in generic probabilistic forecasts with existing gradient boosting methods. The origins of the natural gradient can be traced to the field of information geometry [34], where it was . . , f 11 , the raw ensemble forecasts and as the inputs; W or W 1 , W, . . . , W m , the weights of nodes; a or a 1 , a 2 , . . . ,a m , the weighted average of the outputs in the previous layer; z or z 1 , z 2 , . . . , z m , the outputs with the rectified linear function, ReLU, applied in a; µ and σ, the two target parameters: mean and standard deviation; 0, the verification or the observation; CRPS, the continuous ranked probability score. Here, m is the number of the nodes.

Natural Gradient Boosting
Neural networks can be applied to a range of problems but have rarely been applied to probabilistic forecasting [24]. The difficulty lies in building the correct loss function for probabilistic forecasting. Here, we use a closed form expression of the CRPS (see Section 3.4) with a Gaussian distribution for temperature probabilistic forecasting. The experiment described the conditional distribution of the observation y given the ensemble member forecast x as the input. Figure 2 presents the flowchart of NGBoost. NGBoost is a supervised learning method for probabilistic forecasting. The approach uses the natural gradient to address the technical challenges that are difficult in generic probabilistic forecasts with existing gradient boosting methods. The origins of the natural gradient can be traced to the field of information geometry [34], where it was initially defined for the statistical manifold with the distance measure using Kullback-Leibler divergence [35]. The generalized natural gradient is the direction of steepest ascent in Riemannian space and is formed as:

Natural Gradient Boosting
Atmosphere 2020, 11, x FOR PEER REVIEW 5 of 17 divergence [35]. The generalized natural gradient is the direction of steepest ascent in Riemannian space and is formed as: where ℐ ( ) is the Riemannian metric and ∇ ( , ) is the ordinary gradient of a scoring rule . The natural gradient is invariant to parametrization, which distinguishes it from the ordinary gradient and helps to reflect how the space of distribution moves when updating. The algorithm has three main components: (1) the proper scoring rule (e.g., MLE or CRPS); (2) the parametric probability distribution (e.g., normal or Laplace); (3) the base learner (e.g., decision tree or linear). Different configurations are required for different kinds of problems. or , , … , , the base learner; and , the two target parameters: mean and standard deviation; , the verification or the observation; CRPS, the continuous ranked probability score. Here, is the number of base learners.
By assigning a score between a forecasted probability density at the verifying observation , the scoring rule ( , ) represents the bias between the forecast and the true distribution [36]. Proper scoring rules prompt the model to obtain more calibrated probabilities during training as loss functions. The most commonly used proper scoring rules are MLE and CRPS, although CRPS is generally considered more robust than MLE [37]. Therefore, we chose CRPS as the target scoring rule in our experiments (see Section 3.4).
We focused on temperature probabilistic forecasting. The input contains raw ensemble forecasts and the target is the corresponding observation. Assuming that the variable temperature follows a normal distribution, = (µ, ) are appointed as the predicted parameters and linear learners are used here as the base learner to speed up the calculation. The total number of training linear learners is set up to 100.
To obtain the predicted parameters for the input , each base learner ( ) ( = 1,…, ) first makes an independent prediction based on the same and then the results are integrated. Note that there are two base learners ( ) = ( ) , ( ) per stage for a normal distribution with parameters µ and .
We use a stage-specific scaling factor ( ) and a common learning rate to scale the predicted outputs in th iteration: where Atmosphere 2020, 11, x FOR PEER REVIEW 5 of 17 divergence [35]. The generalized natural gradient is the direction of steepest ascent in Riemannian space and is formed as: where ℐ ( ) is the Riemannian metric and ∇ ( , ) is the ordinary gradient of a scoring rule . The natural gradient is invariant to parametrization, which distinguishes it from the ordinary gradient and helps to reflect how the space of distribution moves when updating. The algorithm has three main components: (1) the proper scoring rule (e.g., MLE or CRPS); (2) the parametric probability distribution (e.g., normal or Laplace); (3) the base learner (e.g., decision tree or linear). Different configurations are required for different kinds of problems. or , , … , , the base learner; and , the two target parameters: mean and standard deviation; , the verification or the observation; CRPS, the continuous ranked probability score. Here, is the number of base learners.
By assigning a score between a forecasted probability density at the verifying observation , the scoring rule ( , ) represents the bias between the forecast and the true distribution [36]. Proper scoring rules prompt the model to obtain more calibrated probabilities during training as loss functions. The most commonly used proper scoring rules are MLE and CRPS, although CRPS is generally considered more robust than MLE [37]. Therefore, we chose CRPS as the target scoring rule in our experiments (see Section 3.4).
We focused on temperature probabilistic forecasting. The input contains raw ensemble forecasts and the target is the corresponding observation. Assuming that the variable temperature follows a normal distribution, = (µ, ) are appointed as the predicted parameters and linear learners are used here as the base learner to speed up the calculation. The total number of training linear learners is set up to 100.
To obtain the predicted parameters for the input , each base learner ( ) ( = 1,…, ) first makes an independent prediction based on the same and then the results are integrated. Note that there are two base learners ( ) = ( ) , ( ) per stage for a normal distribution with parameters µ and .
We use a stage-specific scaling factor ( ) and a common learning rate to scale the predicted outputs in th iteration: is the Riemannian metric and ∇S(θ, y) is the ordinary gradient of a scoring rule S. The natural gradient is invariant to parametrization, which distinguishes it from the ordinary gradient and helps to reflect how the space of distribution moves when updating. The algorithm has three main components: (1) the proper scoring rule (e.g., MLE or CRPS); (2) the parametric probability distribution (e.g., normal or Laplace); (3) the base learner (e.g., decision tree or linear). Different configurations are required for different kinds of problems.
Atmosphere 2020, 11, x FOR PEER REVIEW 5 of 17 divergence [35]. The generalized natural gradient is the direction of steepest ascent in Riemannian space and is formed as: where ℐ ( ) is the Riemannian metric and ∇ ( , ) is the ordinary gradient of a scoring rule . The natural gradient is invariant to parametrization, which distinguishes it from the ordinary gradient and helps to reflect how the space of distribution moves when updating. The algorithm has three main components: (1) the proper scoring rule (e.g., MLE or CRPS); (2) the parametric probability distribution (e.g., normal or Laplace); (3) the base learner (e.g., decision tree or linear). Different configurations are required for different kinds of problems. . or 1 , 2 , … , 11 , the raw ensemble forecasts and as the inputs; or 1 , 2 , … , , the base learner; and , the two target parameters: mean and standard deviation; , the verification or the observation; CRPS, the continuous ranked probability score. Here, is the number of base learners.
By assigning a score between a forecasted probability density at the verifying observation , the scoring rule ( , ) represents the bias between the forecast and the true distribution [36]. Proper scoring rules prompt the model to obtain more calibrated probabilities during training as loss functions. The most commonly used proper scoring rules are MLE and CRPS, although CRPS is generally considered more robust than MLE [37]. Therefore, we chose CRPS as the target scoring rule in our experiments (see Section 3.4).
We focused on temperature probabilistic forecasting. The input contains raw ensemble forecasts and the target is the corresponding observation. Assuming that the variable temperature follows a normal distribution, = (µ, ) are appointed as the predicted parameters and linear learners are used here as the base learner to speed up the calculation. The total number of training linear learners is set up to 100.
To obtain the predicted parameters for the input , each base learner ( ) ( = 1,…, ) first makes an independent prediction based on the same and then the results are integrated. Note that there are two base learners ( ) = ( ( ) , ( ) ) per stage for a normal distribution with parameters µ and . We use a stage-specific scaling factor ( ) and a common learning rate to scale the predicted outputs in th iteration: . . , f 11 , the raw ensemble forecasts and as the inputs; L or L 1 , L 2 , . . . , L m , the base learner; µ and σ, the two target parameters: mean and standard deviation; 0, the verification or the observation; CRPS, the continuous ranked probability score. Here, m is the number of base learners.
By assigning a score between a forecasted probability density f at the verifying observation y, the scoring rule S(f, y) represents the bias between the forecast and the true distribution [36]. Proper scoring rules prompt the model to obtain more calibrated probabilities during training as loss functions. The most commonly used proper scoring rules are MLE and CRPS, although CRPS is generally considered more robust than MLE [37]. Therefore, we chose CRPS as the target scoring rule in our experiments (see Section 3.4).
We focused on temperature probabilistic forecasting. The input x contains raw ensemble forecasts and the target y is the corresponding observation. Assuming that the variable temperature follows a normal distribution, θ = (µ, log σ) are appointed as the predicted parameters and linear learners are used here as the base learner l to speed up the calculation. The total number M of training linear learners is set up to 100.
To obtain the predicted parameters θfor the input x, each base learner l (k) (k = 1, . . . , m) first makes an independent prediction based on the same xand then the results are integrated. Note that there are two base learners l (k) = l log σ per stage for a normal distribution with parameters µ and log σ.
We use a stage-specific scaling factor ρ (k) and a common learning rate η to scale the predicted outputs in ith iteration: The predicted parameters are updated to θ (i) by adding to each θ (i−1) the scaled projected natural gradient ρ (k) ·l (k) (x) and scaled by a small learning rate η. The learning algorithm starts with a random common θ (0) and minimizes the sum of the scoring rule S by training all the training samples. More details are available in Duan et al. [25].

Verification Methods
Probabilistic forecasting aims to maximize the sharpness of the predictive distributions through calibration [38]. For calibration, the forecast probability density functions (PDFs) and verifications are expected to be consistent with each other and, for sharpness, the prediction intervals are expected to be narrowed by post-processing methods to obtain a more concentrated and sharper forecast PDF.
The Talagrand diagrams, also known as the verification rank histogram [39][40][41][42] and the probability integral transform (PIT) histograms are often adopted to evaluate the calibration of the ensemble forecast. They are analogous, but the Talagrand diagrams tend to assess the spread, whereas the PIT histograms assess the PDF forecasts. An ideal ensemble forecast usually behaves as a uniform distribution. However, in most cases, it exhibits U-shaped PIT histograms due to the under-dispersive forecasts.
Here, PIT histograms have been selected to discuss the calibration of the post-processing methods.
In addition to showing PIT histograms, we computed the coverage and average width of the 88.33% central prediction interval, which is chosen from the range of an 11-member ensemble. The coverage represents the accuracy while the average width is used to assess the sharpness.
To verify the deterministic forecasts, we computed the mean absolute error (MAE) and the root-mean-square error (RMSE). Denoting µ s,t as a deterministic forecast and y s,t as the observation, the MAE is defined as: and the RMSE can be written as: where N is the number of cases in the training set. We also computed the CRPS for the assessment of predictive PDFs to address the calibration and sharpness simultaneously. The CRPS is the integral of the Brier scores at all possible threshold values for a continuous predictand [43]. Denoting F θ as the predictive cumulative distribution function (CDF) with parameters θ and y as the observations, the CRPS is defined as: when F θ is the CDF of a normal distribution with θ (µ, σ 2 ), Equation (11) can be written as: where ϕ and Φ denote the PDF and the CDF, respectively. The CRPS is negatively oriented in a similar manner to the MAE, where a smaller value is better.
The continuous ranked probability skill score (CRPSS) is used to measure the probabilistic skill relative to a reference predictive distribution F ref : CRPS F re f , y The CRPSS is positively oriented, which means that positive values indicate an improvement on the reference forecast. We use the raw ensemble forecasts as the reference forecasts.
Here, we assess the application of EMOS and two machine learning post-processing methods to 1-35 day forecasts of the 2-m maximum air temperature for all land grid points over East Asia (15-60 • N, 70-140 • E) using the raw ensemble forecast from EMC-GEFS described by Zhu [44].
The calendar year 2016 is used as the validation period. Noted that the EMC-GEFS model initializes once a week so the actual validation period is 53 days. Table 1 provides the summary measures of the deterministic-style forecast accuracy. For a better visualization and understanding, we average the MAE and RMSE with a lead time of one week instead of one day. The MAE in the first lead week (week 1) is the average MAE of each grid point in the study area during the validation period, with a lead time of 1-7 days. The deterministic neural network forecast clearly gives the best performance, with the mean MAE 24% and 9% reduced than the mean of the raw ensemble and the EMOS forecasts, respectively, for lead times of 1-35 days. The NGBoost method shows similar results, whereas the neural network method performs slightly better for longer lead times. The comparison of the CRPS between these three post-processing methods and the raw ensemble is presented in Figure 3. It shows that the raw ensemble forecasts are of great uncertainty and all the post-processing methods are able to reduce the CRPS values relative to the raw ensemble for all lead times, especially the two machine learning methods. The improvement in performance could be divided into two parts: weeks 1-2 and weeks 3-5. At a lead time of weeks 1-2, the prediction skill of all the post-processing methods and the raw ensemble decreases with increasing lead time in terms of the CRPS. All the post-processing methods reduced the forecast bias sharply. The neural network and NGBoost methods perform best and increase the available lead time for about 10 days. EMOS achieves similar results to the two machine learning methods in week 1 but performs relatively poorly following week 2. The prediction skill of the post-processing methods and the raw ensemble tends to be stable with little variance as the lead time increases over week 3.  Table 2 presents the CRPS and the coverage and average width of the 88.33% central prediction intervals to assess the accuracy and sharpness, respectively. NGBoost reduces the CRPS score by 39% and 7.5%, respectively, compared with the raw ensemble and the EMOS. The neural network achieves similar results in terms of the CRPS, whereas the NGBoost prediction intervals show better coverage. The raw ensemble prediction intervals are narrow and the accuracy and coverage are unsatisfactory. By contrast, the NGBoost prediction intervals are not much wider than those of the raw ensemble, presenting better CRPS and coverage scores. CRPSS is useful in probabilistic forecasts of multi-category events. Positive values of CRPSS indicate that the forecasts are better than the reference prediction. Figure 4 shows the mean CRPSS of different post-processed forecasts compared with the raw ensemble in different lead weeks. The two machine learning post-processing methods achieve the best and similar results, with mean CRPSS values >0.18 at each lead week. EMOS has a poorer performance, although the mean CRPSS  Table 2 presents the CRPS and the coverage and average width of the 88.33% central prediction intervals to assess the accuracy and sharpness, respectively. NGBoost reduces the CRPS score by 39% and 7.5%, respectively, compared with the raw ensemble and the EMOS. The neural network achieves similar results in terms of the CRPS, whereas the NGBoost prediction intervals show better coverage. The raw ensemble prediction intervals are narrow and the accuracy and coverage are unsatisfactory. By contrast, the NGBoost prediction intervals are not much wider than those of the raw ensemble, presenting better CRPS and coverage scores. CRPSS is useful in probabilistic forecasts of multi-category events. Positive values of CRPSS indicate that the forecasts are better than the reference prediction. Figure 4 shows the mean CRPSS of different post-processed forecasts compared with the raw ensemble in different lead weeks. The two machine learning post-processing methods achieve the best and similar results, with mean CRPSS values >0.18 at each lead week. EMOS has a poorer performance, although the mean CRPSS values are >0.08. These results suggest that all the post-processing methods improve the raw ensemble probabilistic forecasts and that the two machine learning methods show the best performance.

Overall Performance of EMOS, the Neural Network and NGBoost
Atmosphere 2020, 11, x FOR PEER REVIEW 9 of 17 values are >0.08. These results suggest that all the post-processing methods improve the raw ensemble probabilistic forecasts and that the two machine learning methods show the best performance. The PIT histograms of the raw and post-processed forecasts used to assess the calibration are shown in Figure 5. The raw ensemble forecasts are under-dispersed, as indicated by the U-shaped verification rank histogram at almost all the selected lead days. This means that the observations often fall outside the range of the raw ensemble. By contrast, the two machine learning post-processed forecast distributions are better calibrated and the corresponding PIT histograms show much smaller deviations from uniformity at each selected lead day. From this perspective, by calibrating the underdispersive raw ensemble forecasts with nonlinear functions in the neural networks and NGBoost frames, we obtain more idealized ensemble forecasts which reduce the uncertainty and improve the accuracy. The poor performance of EMOS in the PIT histograms with the increasing lead days may reflect the disadvantages of the traditional method that is based on the multi-linear regression techniques. The PIT histograms of the raw and post-processed forecasts used to assess the calibration are shown in Figure 5. The raw ensemble forecasts are under-dispersed, as indicated by the U-shaped verification rank histogram at almost all the selected lead days. This means that the observations often fall outside the range of the raw ensemble. By contrast, the two machine learning post-processed forecast distributions are better calibrated and the corresponding PIT histograms show much smaller deviations from uniformity at each selected lead day. From this perspective, by calibrating the under-dispersive raw ensemble forecasts with nonlinear functions in the neural networks and NGBoost frames, we obtain more idealized ensemble forecasts which reduce the uncertainty and improve the accuracy. The poor performance of EMOS in the PIT histograms with the increasing lead days may reflect the disadvantages of the traditional method that is based on the multi-linear regression techniques.   Figure 6 shows that the raw ensemble has the poorest performance at each selected lead day. The post-processed forecasts are consistent with the raw ensemble, but perform much better in reducing the CRPS of the raw ensemble, particularly for shorter lead times. The two machine learning methods achieve the best and similar results on most days during the validation periods. The improvements are stable even at long lead times.   Figure 6 shows that the raw ensemble has the poorest performance at each selected lead day. The post-processed forecasts are consistent with the raw ensemble, but perform much better in reducing the CRPS of the raw ensemble, particularly for shorter lead times. The two machine learning methods achieve the best and similar results on most days during the validation periods. The improvements are stable even at long lead times.  Figure 7 shows the spatial distributions of the CRPSS over East Asia for the 2-m maximum air temperature using different post-processing methods, taking the raw ensemble as the reference forecast. Great improvements are achieved by all the post-processing methods in week 1. Most of the grid points in the study area obtain a positive CRPSS, especially those on the Tibetan Plateau. The two machine learning methods reduce the negative CRPSS area more than the EMOS forecast, which indicates that the NGBoost and neural network methods are practicable in broader regions. In week 2 and later, the EMOS forecast skill decreases rapidly and performs poorly in most of Russia, Mongolia and mid-eastern China. However, the two machine learning methods still perform well in those regions. Although the improvements decrease, the post-processed forecast of the two machine learning methods give a positive CRPSS for most grid points in the study area. The neural network forecasts perform better in the northwest of China and most of Russia than the NGBoost outputs. The results indicate that these two machine learning methods could make improvements in complex terrains (such as mountainous areas, deserts) which are often of poor prediction skills. We consider that the improvements describe the nonlinearity to some extent as the thermodynamics and dynamics processes often have.  Figure 7 shows the spatial distributions of the CRPSS over East Asia for the 2-m maximum air temperature using different post-processing methods, taking the raw ensemble as the reference forecast. Great improvements are achieved by all the post-processing methods in week 1. Most of the grid points in the study area obtain a positive CRPSS, especially those on the Tibetan Plateau. The two machine learning methods reduce the negative CRPSS area more than the EMOS forecast, which indicates that the NGBoost and neural network methods are practicable in broader regions. In week 2 and later, the EMOS forecast skill decreases rapidly and performs poorly in most of Russia, Mongolia and mid-eastern China. However, the two machine learning methods still perform well in those regions. Although the improvements decrease, the post-processed forecast of the two machine learning methods give a positive CRPSS for most grid points in the study area. The neural network forecasts perform better in the northwest of China and most of Russia than the NGBoost outputs.

Spatiotemporal Characteristics
The results indicate that these two machine learning methods could make improvements in complex terrains (such as mountainous areas, deserts) which are often of poor prediction skills. We consider that the improvements describe the nonlinearity to some extent as the thermodynamics and dynamics processes often have. The performance of the neural network is similar to that of NGBoost and thus the best model distribution is introduced to provide a simple comparison. The maps with the best-performing models, in terms of the mean CRPS for each land grid point at different lead weeks, are shown in Figure 8. The two machine learning methods provide the best predictions for most of the grid points. Regardless of the lead week, the two machine learning methods perform better at >90% of the grid The performance of the neural network is similar to that of NGBoost and thus the best model distribution is introduced to provide a simple comparison. The maps with the best-performing models, in terms of the mean CRPS for each land grid point at different lead weeks, are shown in Figure 8. The two machine learning methods provide the best predictions for most of the grid points. Regardless of the lead week, the two machine learning methods perform better at >90% of the grid points, with clearer advantages at longer lead times. The NGBoost method performs better in week 1, whereas the neural network method is better in week 2. The two machine learning methods are comparable for longer lead times. For most of the areas, such as the Tibetan Plateau, the Yunnan-Guizhou Plateau, the Loess Plateau, the Inner Mongolia Plateau, the central Siberian Plateau and the East Siberian mountainous area, neural networks behave as the best model despite the overwhelming performance in week 1. There are also some differences over the Tibetan Plateau, for instance, the best model over the west of it tends to be changing from the neural network forecasts to that of NGBoost. In the meantime, over the areas including the northwest of China where the terrains comprising mountains, deserts and basins (in the Xinjiang Province), the neural networks model behaves better, while the NGBoost method dominates in the basins, such as the Qaidam Basin in Qinghai Province and the Sichuan basin in the southeast center of China. For the most populated areas, such as the plains and hilly areas in China, neural networks and NGBoost perform competitively. In all, these two machine learning methods have their own benefits in different areas, but both of them are good at dealing with the complex terrains and adjusting the ensemble spread to provide more reasonable forecasts.

Conclusions
We compare the performance of the EMOS method and other two methods based on machine learning (a neural network and NGBoost) in extended range temperature probabilistic forecasts over East Asia. The EMOS method has been widely used and is already able to reduce systematic bias and is therefore used as the benchmark for the machine learning methods. Both the neural network and NGBoost improve the forecast skills compared with EMOS over the entire study domain and for each forecast lead time.
For convenience, the lead time is separated into five groups from the first lead week to the fifth lead week and the scoring rules are averaged over the lead time intervals. From the perspective of deterministic forecast, MAE and RMSE are used to assess the accuracy of forecasts from the raw ensembles, EMOS, NGBoost and the neural network. The neural network and NGBoost outperform both the raw ensemble and EMOS outputs and the neural network is slightly better than NGBoost. In terms of convenience and reliability, the neural network is more appropriate and superior for extended range temperature forecasts. However, NGBoost is more flexible and is designed to develop probabilistic forecasts. From the perspective of probabilistic forecasts, both the neural network and NGBoost are superior in terms of CRPS and coverage at the 88.33% prediction interval over the raw ensemble outputs and EMOS results. NGBoost shows advantages over the neural network from the first to third lead weeks in CRPS and the entire forecast's lead weeks in terms of coverage. The raw ensemble forecasts perform better in terms of the under-dispersive spread but give poorer scores. After calibration, EMOS, the neural network and NGBoost expand the spread into a more correct status.
The spatiotemporal distributions are investigated over multiple lead times. The raw ensemble forecasts perform the worst, offering a space for EMOS, the neural network and NGBoost to make progress. The gaps among the three methods decrease with increasing lead times from seven to 28 days. A greater improvement is seen for shorter lead times. The CRPSSs are characterized by spatial variance but are still practical and improved. All the three methods show remarkable forecast skill over the most areas, especially over the Tibetan Plateau. Both the neural network and NGBoost outputs perform better than the EMOS output over East Asia. With increasing forecast lead times, the forecast skill of EMOS decreases remarkably from the first to the second lead week, whereas the other two machine learning methods show sustainably better forecast skills.
The NGBoost and neural network give an outstanding performance in extended range probabilistic forecasts over East Asia for the variable of interest, which can be fitted by a normal distribution. The neural network and NGBoost are well matched. It is difficult to distinguish which one of these two methods is better and we therefore introduced the best model distribution to provide a simple comparison. The neural network and NGB account for almost 90% of the area and they perform better in different lead weeks. However, further investigation of the machine learning applications still required for those variables with a non-normal distribution (e.g., precipitation, wind speed and wind direction).

Discussion
This paper demonstrates the application of two machine learning methods, the neural networks and NGBoost, to the extended range 2-m maximum air temperature probabilistic forecasts. Following the EMOS frame proposed by Gneiting et al. [11], the neural networks and NGBoost are two parametric methods which directly calibrate the individual raw ensemble members by optimizing a proper scoring rule. However, neural networks and NGBoost show advantages in model robustness and the searching for optimized parameters.
As shown in the flowchart of neural networks (see Figure 1), a neural network consists of two main parts, the forward and backward propagation. For a given forecast time and station, the inputs, the 11 raw ensemble forecasts, are multiplied by m (the node number in the hidden layer) randomly initialized weights W and then activated by a nonlinear function ReLU in the forward propagation to predict the mean µ and standard deviation σ. Over the training dataset, the mean CRPS is computed as the scoring rule by the predictive µ and σ with corresponding observations. In our study, the Adam optimization is applied to gradient descent in the backward propagation for its learning rate varies with training which helps accelerate the convergence.
It is notable that the CRPS is not a common lose function in machine learning. Since the traditional neural networks are incapable for probabilistic forecasting, here we introduce a close form of the CRPS for a Gaussian distribution which helps to extend the neural networks to probabilistic temperature forecast. Furthermore, the inputs of neural networks are more arbitrary, which can add auxiliary predictors to improve prediction skills [24].
This study tackles probabilistic temperature forecasting in atmospheric sciences using NGBoost. It helps make up the gap of gradient boosting method (GBM) in generic probability prediction. The outstanding performance of NGBoost in our study confirms its ability to solve the practical problems. It is a promising machine learning method for probabilistic forecasting for its flexibility and modularization.
Different to neural networks in the forward propagation, NGBoost introduces a weak base learner (for instance, the tree model) to replace the activation function of the neural network frame. By integrating the base learners, we obtain the parameters µ and σ of the predictive PDFs. Another difference lies in the parameter optimization where the natural gradient is introduced in NGBoost to overcome the difficulty of simultaneous boosting the two predictive parameters (µ and σ) from the base learners, which is a challenge to the existing GBMs. Furthermore, the natural gradient helps to reflect how the space of distribution moves when updating [25].
For the probabilistic temperature forecast here, the results of bias correction using neural networks and NGBoost are remarkably superior to the traditional EMOS. Neural networks and NGBoost could represent a high nonlinear relationship which is deficiently described in the traditional linear models. Maybe this is the reason why machine learning methods applied in this study perform better. Plus, the machine learning methods are more flexible and the objective function could be adjusted according to the problems to be solved. Compared with the neural networks, the NGBoost is a more integrated and modular method which is especially advantaged with a small training dataset. However, the NGBoost is limited in some skewed probability distribution, for instance, precipitation and wind speed. The neural networks are more suitable for the flexible cases which have more arbitrary predictors and strong nonlinearity.