Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning

Han, Mengmeng; Leeuwenburg, Tennessee; Murphy, Brad

doi:10.3390/app14146314

Open AccessArticle

Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning

by

Mengmeng Han

¹

,

Tennessee Leeuwenburg

^2,*

and

Brad Murphy

²

¹

Bureau of Meteorology, 32 Turbot St, Brisbane City, QLD 4000, Australia

²

Bureau of Meteorology, 700 Collins St, Docklands, VIC 3008, Australia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6314; https://doi.org/10.3390/app14146314

Submission received: 11 June 2024 / Revised: 16 July 2024 / Accepted: 17 July 2024 / Published: 19 July 2024

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

Site-specific weather forecasts are essential for accurate prediction of power demand and are consequently of great interest to energy operators. However, weather forecasts from current numerical weather prediction (NWP) models lack the fine-scale detail to capture all important characteristics of localised real-world sites. Instead, they provide weather information representing a rectangular gridbox (usually kilometres in size). Even after post-processing and bias correction, area-averaged information is usually not optimal for specific sites. Prior work on site-optimised forecasts has focused on linear methods, weighted consensus averaging, and time-series methods, among others. Recent developments in machine learning (ML) have prompted increasing interest in applying ML as a novel approach towards this problem. In this study, we investigate the feasibility of optimising forecasts at sites by adopting the popular machine learning model “gradient boosted decision tree”, supported by the XGBoost package (v.1.7.3) in the Python language. Regression trees have been trained with historical NWP and site observations as training data, aimed at predicting temperature and dew point at multiple site locations across Australia. We developed a working ML framework, named “Multi-SiteBoost”, and initial test results show a significant improvement compared with gridded values from bias-corrected NWP models. The improvement from XGBoost (0.1–0.6 °C, 4–27% improvement in temperature) is found to be comparable with non-ML methods reported in the literature. With the insights provided by SHapley Additive exPlanations (SHAP), this study also tests various approaches to understand the ML predictions and increase the reliability of the forecasts generated by ML.

Keywords:

weather forecast; gradient boosting decision tree; machine learning; XGBoost; NWP post-processing; SHAP

1. Introduction

Despite continual improvements in Numerical Weather Prediction (NWP) over several decades, with high skill in forecasts of some parameters out to a week or more ahead, shortcomings in weather forecasts remain. These include model biases, random errors, and representativeness errors, and tend to grow with forecast lead time. Systematic biases can be corrected through post-processing [1], though random errors will remain so there will always be a level of uncertainty in a forecast. Deterministic, or single-value forecasts, produce one estimate of a forecast value, so ensembles of multiple forecasts, designed to capture the spread in possible outcomes, are increasingly being used to produce probabilistic forecasts [2]. NWP forecasts typically represent a discrete area at least several kilometres in size, although this is decreasing as processing power grows. Model output at any current scale may not be representative of a specific site due to complex topography, land cover, urbanisation, or other local factors. A means of calibration between model forecasts and measured values at specific sites can improve this representation of specific locations. Methods for such calibration include those which alter forecast probability density functions to match those of the observed values [3,4,5].

In addition to conventional statistical methods, various machine learning (ML) models have also been experimented with for the task of site-specific weather forecasting. Site-specific weather forecasts are optimised for a specific individual weather-observing station. The term ‘local’ weather forecast may refer to a site-specific forecast or to a small ‘local’ geographic area such as a suburb. Some of the existing studies use time series of site observation data as the only input to generate forecasts for a single time step or multiple time steps. For example, XGBoost models have been applied to predict solar irradiance [6]; Long Short-Term Memory (LSTM) and its variations such as transductive LSTM [7] and convolutional LSTM [8] have been used for temperature forecasts. These forecast models take site measurements of multiple relevant variables as input and formulate a multi-variate training dataset. Apart from purely observation data-driven models, ML can also be used as a post-processing approach for NWP—one which generates better local forecasts than NWP grid values. These models ingest both past NWP forecast and observation data as input, so the trained model can identify the systematic bias within the NWP model for the sites and generate a more accurate forecast once the latest NWP data becomes available [9,10,11].

Among the various ML models tested in the literature, XGBoost [12] stands out for its computational efficiency, robustness and accuracy over alternative models. Previous studies have shown that XGBoost delivers better accuracy on tabular data, a data form often encountered in site-based forecasting. Grinsztajn et al. [13] argues that the better performance of tree-based models (XGBoost, Random Forest, etc.) as compared to deep neural networks may be attributed to their robustness towards uninformative features and the fact that XGBoost does not suffer from the inductive bias toward smooth solutions as is the case with deep neural networks. Successful applications of XGBoost that yield satisfactory forecast accuracy have been reported for various scenarios and forecast variables, including temperature, precipitation [14], wind power [15], solar radiance [6] and wave heights and periods [16], etc. In these studies, the size of the training dataset varies from the scale of 10³ to 10⁴, and training time can be as low as less than 1.0 s. The fast training of XGBoost enables hyperparameter tuning (the optimization of the parameters of a ML model for potentially best performance) to optimize model accuracy, and allows for more frequent model updating and retraining, thereby utilising the latest available data in model training. Owing to the satisfactory performance of XGBoost as already benchmarked and tested in previous studies, and its superiority of computational efficiency in training compared with deep learning models, XGBoost is selected for the current study without further model selection.

With the constant development of bigger ML models, and the growing size of the datasets used to train them, the complexity of the models has also been growing significantly. This has continued to the level that the internal structure of the model is no longer easily probed and the predictions of the models are more difficult to interpret. The low interpretability leads to complex models being considered as black boxes and less trusted by customers, even if their accuracy has been validated on test datasets. The need to better understand, explain and interpret the outcome from ML models results in growing recent research interest in the field of Explainable Artificial Intelligence (XAI) [17,18]. For weather forecasting models, it is also necessary to investigate and confirm that the causality learned by ML models agrees with that of the physics-based models and experience from operational weather forecasters. So far, various explanative methods have been proposed. On the scale of the entire dataset, it is possible to understand the ML model in terms of global feature importance. However, the disadvantage of feature importance is that it is calculated on the entire training dataset and cannot perfectly explain the reasoning of each individual output from the model. This prompts the proposal of local explanation methods, among which SHapley Additive exPlanations (SHAP) has been prominent [19]. As a model-agnostic approach, SHAP has been used for local explanations for different model architectures and the insights provided by SHAP are also used for model debugging, model health monitoring and data drift detection [20,21]. In the area of weather forecasting, the application of SHAP or other local explanation methods is still rare, with most of the studies only presenting the evaluation metrics of the trained models. However, recent studies have started to adopt SHAP for model explanations. For example, SHAP is used to explain spatial-temporal patterns picked up by the LSTM model for drought forecasts [22,23] and the explanations are compared with physics-based models.

It should be noted that while SHAP helps explain the decision-making process towards an individual output, it does not provide insight into the correctness or accuracy of that output. The need to understand the quality of each prediction point from a ML model for better risk awareness and management prompts the study of reliable machine learning. It is understood that the imperfections of a ML model can originate from the quantity and quality of data, and model architecture [24]. With insights into data properties and ML mechanisms, it is sometimes possible to identify those individual predictions with a higher risk of being misclassified or inaccurate and therefore less reliable than a global metric implies. The principles of pointwise reliability have been proposed by [25], and various methodologies to evaluate reliability have been investigated by [26,27,28]. The real-time response mechanism to unreliable predictions is also integrated into some ML systems. For example, an abstention option is enabled in some classification models so that when a classification is identified as unreliable, the model abstains without providing an output [29].

This paper presents the application of XGBoost as a post-processing approach for NWP grid data to produce better site-specific temperature and dew point forecasts. Forecast accuracy is evaluated both on common metrics and customized ones based on customer interests, and the results show that significant improvement can be achieved compared with original gridded data. The predictions generated by XGBoost are explained by SHAP so that additional insights into the trained model can be obtained. As a further step for reliable ML, the study also explores various ways to evaluate pointwise reliability of the forecasts, so that those parts of the predictions that are subject to higher risk of inaccuracy are identified at the time the prediction is made. As shown in the literature review, studies that include a reliability check for ML results in weather forecasting are relatively rare.

2. Data Overview

2.1. Gridded Numerical Data (IMPROVER)

The gridded numerical data used in this study are from the state-of-the-art probabilistic ensemble post-processing and verification system known as IMPROVER (Integrated Model post-PROcessing and VERification). IMPROVER ingests and blends raw forecasts from various NWP models, applies gridded calibration, and generates fully probabilistic forecasts [30]. The Australian version of IMPROVER developed by the Bureau of Meteorology has a grid covering the Australia region and adopts a Cartesian coordinate system. The gridded data have a spatial resolution of 4.8 km at surface level (2 m above ground level) and 9.6 km on upper levels. The model generates hourly forecasts every 6 h (4 issues a day) with lead hours up to 192. Figure 1a shows a snapshot of IMPROVER gridded forecast temperature at screen level on 24 August 2022 with lead hour 2.0, which shows the extent of the spatial coverage of IMPROVER data.

Since this current study focuses on deterministic forecasts only, the expected value of the IMPROVER probabilistic forecast at each location is extracted for training ML models (early investigations showed that taking various straightforward ensemble approaches did not improve the skill of the forecast or properly capture the distribution. Further research is left for future work). In this study, the grid value of selected variables at 11 Australian sites (Figure 1b) up to lead day 7 are extracted from each issue forecast from IMPROVER, forming hourly time series as shown in Figure 1c. The variables selected as features are temperature at screen level (T), dew point temperature at screen level (T_d), and wind vector u and v at 10 m height. The wind vectors are selected as features because wind, particularly at coastal locations, is known to affect the temperature in the local region through both advection (transporting air from a cooler or warmer region) and through vertical processes which are not fully resolved in the upstream model. As such, conditioning the temperature and dew point predictions on the wind u and v components is likely to improve the model’s performance. Upper layer variables are not included since the spatial resolution of those variables is lower. Available IMPROVER data at time of this study range from August 2022 to May 2023 (incorporating Spring, Summer and Autumn). The sites selected for this study are scattered in different parts of Australia to ensure that ML model accuracy is not location dependent. Rainfall climatology varies significantly across these sites, including hot, dry locations, colder and wetter locations, and varying coastal proximity. Pointwise IMPROVER root-mean-square error (RMSE) of same-day forecasts on the selected sites in the studied period range from 1.2–2.3 °C for temperature, and 1.2–2.9 °C for dew point temperature.

One important property of IMPROVER forecasts that affects ML training strategy is that the IMPROVER forecast error generally grows by lead hour. This is expected as NWP models derive much of their predictability from the initial conditions, and so gradually lose accuracy with longer forecast lead time. To quantify and visualize this effect, Figure 2a shows the IMPROVER temperature forecast of a single site T_N in October with respect to the actual observed temperature for each forecast (T_O). The scatter plot shows that while numerical forecasts on both lead day 0 (2–24 h) and lead day 8 (168–192 h) exhibit a positive correlation with observed values, the forecast at lead day 0 is more accurate and less scattered. Figure 2b shows the distribution of forecast error for the same data points, which indicates that the forecast errors on lead day 8 have slightly higher absolute mean and twice as high standard deviation as compared with forecasts on lead day 0.

2.2. Site Observations

Observational data of selected sites are extracted from the Australian Data Archive for Meteorology (ADAM) database. ADAM stores meteorological observations from Bureau of Meteorology-managed observing systems over mainland Australia and from neighbouring islands, the Antarctic, ships and ocean buoys. The stored observation data are quality-controlled. While on-site measurements inevitably contain random errors, the extracted temperature and dew point data are considered ground truth in this study. The underlying instruments collect data each minute, and the data are reprocessed into ten-minute mean values, valid on-the-hour.

3. Model Training, Optimization and Explainability

3.1. Model Training and Optimization

As stated in the Introduction, XGBoost is selected for this study. The extracted time series data at each site location shown in Figure 1 can be directly used for training. However, to optimize the accuracy of the XGBoost model, proper pre-processing of the grid value from IMPROVER is necessary. Prior to training, the following pre-processing approaches are applied to the site data. It should be noted that the pre-processing techniques adopted in this study are not exhaustive and there are alternatives in the literature. The ones in this study are selected because they show the best performance in the parametric study.

Inclusion of surrounding grid values: during data extraction from the IMPROVER gridded dataset, the grid value at the site location plus those around the location are extracted and used as separate input features. More specifically, for each of the site, the values on a grid are extracted. The grid is centred on the site location and each nearby point is five grid cells away, as shown in Figure 3. Based on the orientation angle, the values of the eight nearby points are marked by subscripts: NW, W, SW… etc., while the central value has the subscription of C. Feature names hereinafter follow this convention. As shown in the parametric studies below, the inclusion of these neighbouring cell values can greatly increase the model’s accuracy. These values are also physically important since the spatial variation of a variable indicates the instant spatial flux of that variable. The spacing between the neighbouring points is selected empirically, but a general rule applies that the points being too close leads to highly correlated features being added to the training set. On the other hand, the point being too remote leads to irrelevant features in the training set.

Data selection: As shown in Figure 2a, IMPROVER data with longer lead time can be considered as containing the same information but mixed with higher random error. Since good data quality in the training set is essential in building an accurate ML model, and data are issued at four different lead times on lead day 0 (shown in Figure 1c for augmentation), only data with lead time 0 h are used for training. The model is expected to learn the essential data pattern that relates to the locational characteristics of the site, rather than the evolution of the forecast over time, and thus predictions of all lead days are generated using the same model.

Model inputs and outputs: The input variables are air temperature at ground level (including surrounding grid points), wind at 10 m at the station location, expressed as U₁₀ and V₁₀ components, and dewpoint temperature (including at surrounding grid points). The output variables are air temperature and dew point. An individual model is trained for each station and output (i.e., each model predicts a single variable for a single station).

Automatic feature selection: In Section 2.1, selected variables are extracted from IMPROVER grids as features. As shown in Figure 3, temperature and dew point are extracted at nine points, which adds up to 23 features in total. However, not all features are equally relevant for each site and may also vary between training runs. To remove irrelevant features in the training set, a pre-train with all features included is first performed for each station, and the 10 features with highest global statistical importance will be used for subsequent training while other less important features are removed. This proves to slightly increase the accuracy of the model and, more importantly, reduce the volume of data needed for validation and generating predictions.

Scaling: XGBoost models do not perform extrapolation [31] and therefore may not generate accurate predictions with feature values that are outside the bounds of the training set. This is detrimental for the prediction of extreme values, such as new high records of temperature. To improve the performance, the original data are column-wise scaled to achieve a standard distribution for training and validation sets. Note for validation sets that the mean and variation is calculated based on the data of the last week of the training period instead of the whole training period. This matches the operational use case, whereby the forecast period is in the future and, as an unknown, the exact scaling factor cannot be known in advance. Using the most recent week aims to ensure that the calculated mean is the closest estimation of the upcoming validation period. This process may prevent inaccurate mean values being applied on the validation set when daily average temperature is constantly increasing. With this process, the bound of the two sets can mostly overlap with each other. The effect of scaling is quantified in detail in Table 1.

In addition, the current study adopts a sliding window approach to continuously train new models on a nine-week training set and test on the subsequent one week. The advantage is that most recent weather patterns can be captured. Sliding window also utilizes the fast training enabled by XGBoost to avoid data drifting due to infrequent model updates. Hyperparameters are tuned with standard 5-fold cross validation.

Table 1 shows a parametric study based on the temperature and predictions of a one-month period. Each column shows the resultant change in the average hourly RMSE caused by different data pre-processing settings. The change of RMSE is calculated by considering all lead times with equal weight, and the sum in each column is the aggregation of all changes from all tested sites. The results show that the prediction with all data pre-processing techniques applied has the lowest RMSE, and omitting any of the pre-processing would lead to higher error.

As a first step towards a working prototype of a production system, a framework for automatic downloading, data pre-processing, ML training and validation framework is developed and named Multi-SiteBoost (MSB). The framework comprises three parts: (1) gridded IMPROVER data and site observation data are downloaded and stored. Data for the selected sites and features are then extracted from the large volume of data into time series; (2) an XGBoost model is automatically trained for each site and each sliding window in parallel; (3) the saved models are used to generate predictions, metric statistics and evaluate pointwise reliability on each of the predictions. Figure 4 shows the flow chart of the MSB workflow. ML results as discussed in the following sections are all generated by the MSB framework.

3.2. SHAP for ML Explainability

As discussed in the Introduction, feature importance is calculated globally and does not provide a local explanation of how every single prediction is made by the model. Therefore, SHAP value is adopted in this study for local explanation. Being an additive explanation approach, SHAP value fairly attributes the final prediction at a single sample point to each of the participating features [19]:

g (z^{'}) = ϕ_{0} + \sum_{i = 1}^{M} ϕ_{i} z_{i}^{'}

(1)

where

g (z^{'})

is a pointwise model prediction, M is total number of features,

z_{i}^{'} \in \{0, 1\}

represents the feature being missing (

z_{i}^{'} = 0

) or not (

z_{i}^{'} = 1

), and

ϕ_{i}

is the SHAP value of feature i. Referring to [19], the SHAP value can be calculated as follows, which ensures a fair contribution of feature effects among all features and a close approximation to local prediction value:

ϕ_{i} = \sum_{S \subseteq M \ \{i\}} \frac{|S|! (M - |S| - 1)!}{M!} [f_{x} (S \cup \{i\}) - f_{x} (S)]

(2)

where S is a subset of features,

M \{i\}

the set of all features,

f_{x} (S \cup \{i\})

is the model prediction with all features in S and the feature i,

f_{x} (S)

the prediction with features in S only. SHAP value method is further extended by [32] by decomposing SHAP value of each feature into a main effect and interaction effects, so that:

ϕ_{i} = ϕ_{i, i} + \sum_{j \neq i} ϕ_{i, j}

(3)

where

ϕ_{i, i}

is the main effect of feature i, and

ϕ_{i, j}

is the SHAP interaction value of a feature pair (i, j), j is a feature different from i. Each SHAP interaction value

ϕ_{i, j}

is calculated by:

ϕ_{i, j} = ϕ_{j, i} = \sum_{S \subseteq M \ \{i, j\}} \frac{|S|! (M - |S| - 2)!}{2 (M - 1)!} \nabla_{i j} (S)

(4)

\nabla_{i j} (S) = f_{x} (S \cup \{i, j\}) - f_{x} (S \cup \{j\}) - f_{x} (S \cup \{i\}) + f_{x} (S)

(5)

4. Experiments

4.1. Evaluation Metrics

The primary metric and the training objective of the XGBoost model is the mean squared error of hourly data. However, in practice there may be alternative metrics that are more important to different customer scenarios. To understand how the predictions from ML models affect other metrics, the mean absolute error (MAE) of hourly data, and RMSE of maximum and minimum of daily values are also separately calculated. A daily maximum or minimum is defined as the maximum or minimum hourly temperature within a single day. It should be noted that this metric does not consider if the model can accurately predict the timing when the daily extreme temperature occurs. In addition, a critical error rate is defined as the number of hourly predictions with absolute error higher than 2.0 degrees, divided by the total number of predictions made within the validation period.

4.2. Results

Figure 5a,b demonstrate the evaluation metrics (hourly error, daily maximum and minimum errors, and critical error rate as defined in Section 4.1) calculated on 6-month IMPROVER and MSB hourly data at a single site: Hobart (27 October 2022 to 27 April 2023). This site is selected to be presented because its results best represent the general data pattern to be discussed. For all the figures that follow, ‘TMP’ refers to temperature and ‘DEW’ refers to dew point. The results of other sites are summarized in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6. For hourly and daily extreme data, both RMSE and MAE are calculated and plotted with green and blue lines, respectively. Metrics are calculated on each forecast lead day separately. The results show that for this site, the employed ML model can reduce RMSE, MAE and critical error rate at all lead days. For RMSE of hourly data, the percentage of improvement on each lead day is similar, which shows that the ML model when trained on data with only lead day 0 can be applied to the prediction of all lead days. In most of the evaluated metrics and lead days for daily maximum and minimum predictions, the figures show that MSB has a better accuracy than IMPROVER. But exceptions also exist, such as the daily maximum dew point temperature on lead day 6 and 7. In addition, the results suggest that the improvements in RMSE and MAE have a similar trend, and therefore all subsequent discussions focus primarily on RMSE in this study.

Table 2 and Table 3 summarize the RMSE of hourly and daily maximum and minimum and percentage of critical error calculated for each site from all lead days. The change of each metric from IMPROVER is also included in Table 2 and Table 3, and a negative change means a lower RMSE or critical error percentage of MSB results compared with IMPROVER. The results show that MSB reduces hourly RMSE on all sites from up to 1.05 °C, but the effect varies widely with site location. In general, the sites where forecast error is already low in IMPROVER (e.g., Avalon Airport) benefit less from MSB than those with originally higher forecast error (e.g., Geraldton Airport). In addition, as shown in Figure 5a,b, the reduction in forecast error largely varies with forecast lead day, so the actual reduction on each lead day may be significantly different from the averaged reduction.

Table 4 and Table 5 summarize the improvement of prediction accuracy on all sites and all lead days, in terms of percentage of reduction of RMSE from MSB as compared to IMPROVER predictions. In this way, a positive percentage in the tables indicate a lower RMSE on MSB data and an improvement compared with IMPROVER. For the sake of brevity, other metrics as discussed in Section 4.1 are listed in the Appendix A. As shown in the tables, most of the calculated metrics are positive, which indicate a general forecast improvement on tested sites due to the applied ML models. However, the tables also show variation of ML skills over the selected sites. Even on hourly RMSE on lead day 0, which is the training objective, the percentage of RMSE reduction can range from 0.54% (Avalon airport, dew point temperature) to 37.13% (Geraldton airport, dew point temperature). And for Curtin Aero, the MSB predictions are worse than the input IMPROVER data, as the RMSE of MSB predictions is 6.46% higher than that of IMPROVER. Averaging on all sites and all lead days, MSB has an average reduction of hourly RMSE by 11.35% for temperature and 12.28% for dew point. Critical error is reduced by 5.60% for temperature and 6.19% for dew point. The improvement in RMSE and the percentage of reduction in temperature is comparable to post-processing studies using non-ML methods such as [33,34].

The variation of ML skills on various site locations highlights the importance of localised model training and tuning. In the data pre-processing, most of the parameters are site-agnostic and optimized based on averaged performance over all selected sites. All sites share the same pool of available features. This leads to less training time, a simpler data pipeline and ensures a good overall performance. But it is not guaranteed that the model is tuned to its maximum performance on each site. When customer needs arise for the optimization of some certain site, further experiments on input features and parameters on specific sites of concern should be performed.

Figure 6 shows the histograms of pointwise forecast error distribution of IMPROVER and MSB for a single site on lead day 0. The results show that error distribution of MSB data is more zero-centred and is narrower, which implies both a smaller mean error and less critical error rate of the dataset. However, the figure also shows that MSB error distribution is not perfectly Gaussian, which implies that systematic error still exists even though the overall performance is already improved compared with IMPROVER grid data.

5. Explaining Model Outputs with SHAP

Figure 7a shows the SHAP summary plot on the scaled training set of a temperature forecast model. In the figure and all the analysis that follows, the temperature T and dew point T_d features extracted at different orientations as shown in Figure 3 are noted as T_W, T_SE, T_dW, T_dSE, …, with subscripts indicating the orientation. In the summary plot, the X axis is the SHAP value, and the 10 features used by this model are arranged along the Y axis. On each of the rows, every point represents a sample in the training set. The points are coloured by feature value. The rows of features are arranged by relative magnitude of SHAP values in descending order. The summary plot shows that for this model, the four selected temperature measurements T_W, T_SE, T_S and T_SW have the largest overall contributions, indicated by the extent of their SHAP values. A physical explanation for the degree of contribution of T_W, T_SE, T_S and T_SW has not been explored. This agrees with the intuition that the actual temperature is most relevant to the predictions from NWP models. Other supplementary features such as hour, wind and dew point at various locations in general have lower SHAP value magnitude, which implies they give minor changes to each prediction. However, the rows of T_dSE and T_dS also show that at certain samples, a high dew point can largely decrease the final predicted values of temperature. However, this effect is asymmetrical, as a low dew point does not result in a large positive SHAP value.

Figure 7b is the global feature importance plot, calculated by averaging all SHAP value magnitudes of each sample for each feature. The resultant feature importance for this model agrees with XGBoost feature importance, calculated by maximum gain. Compared with Figure 7a, Figure 7b lacks the information of the distribution of feature importance on each sample. For example, it does not show the occasional significance of T_dSE on some certain points. This comparison highlights the importance of local explanation provided by SHAP, as supplementary information in addition to global feature importance.

Figure 8 plots the SHAP values of all samples with respect to feature values for three features: T_w, Hour (in UTC) and U₁₀ in a model trained for temperature prediction. The plots show that the SHAP value of feature T_W is approximately linear with the feature value. The vertical dispersion is minor, and each value of T_W approximately corresponds to one single SHAP value. This agrees with Figure 2a, both of which can be interpreted as the NWP grid value being linearly correlated to site observation. By comparison, the SHAP value of hour (defined as a discrete integer rather than a continuous variable) is not linearly correlated to the feature value, and one value of hour can correspond to multiple SHAP values. However, it can still be observed that hour 0–5 generally has positive SHAP while hour 15–20 generally negative. This agrees with observations and statistics, as temperature at this site at UTC hour 0:00–5:00 (10:00–15:00 local time) and 15:00–20:00 are expected to be hotter and colder than the daily average, respectively. The SHAP values of wind, by comparison, are highly nonlinear, multi-valued and difficult to interpret, which indicates a highly nonlinear relationship between wind speed and temperature, and strong interactions between wind speed and other features.

The interactions between features can be more clearly visualized by a SHAP interaction value matrix, as shown in the heatmap (Figure 9). In Figure 9, the diagonal terms are the main effect while other terms are interaction effects between two features. The cells in the figure are coloured by the magnitude of the scaled SHAP interaction values. To better show the relative magnitude of interaction effects compared with main effects for each feature, each column in the plot is scaled separately so that the diagonal terms (the main effects of each feature) are scaled to 1.0. Consequently, the matrix is no longer symmetrical. The matrix shows that the interaction effects of four temperature features T_W, T_SE, T_S and T_SW with other features are much smaller than their main effects. This agrees with Figure 7a in that the SHAP values of temperature depend almost entirely on the feature value. By comparison, the interaction effects are more significant for dew point and wind speed, as some of the interaction effect terms are of similar magnitude with the main effects.

The analysis of the relationship between each feature and its corresponding SHAP value reveals some insights into the mechanism of the temperature forecast model. The model output can be approximately described as a combination of a main component from linear combinations of selected NWP predictions of the main variable at different locations of the 3 × 3 grid, and some supplementary components of lesser magnitudes. The supplementary components are related to features that are other than the predicted variable, such as hour, dew point, wind speed, etc. The effects of these features are highly nonlinear but can be significant to the model predictions at some points. These two categories of features are hereinafter referred to as ‘linear features’ and ‘nonlinear features’, respectively. The dew point forecast models are found to exhibit similar patterns as the temperature forecast model, so that the discussions of them are not repeated here.

As previously discussed, SHAP values are used for explaining and understanding model outputs, but do not include any information on the accuracy of the model. However, the insights can be used to debug and monitor the model’s behaviour. The next section will focus on analysing model error to increase ML reliability, in which SHAP values play an important role.

6. Error Analysis and Model Reliability

ML models are imperfect and often contain systematic errors, even if the level of accuracy is satisfactory when evaluated based on given error metrics. Recall that the error plot in Figure 6 shows that the distribution of MSB error is not perfectly Gaussian. Since the trained model generates single predictions without giving information on confidence interval, it is possible that the model seems over-confident even when the prediction is likely to be inaccurate. Therefore, it is important to analyse and understand model error so that model accuracy can be consistently monitored and improved. In this section, we aim to apply several error analysis approaches to the MSB framework, so that individual predictions that are considered unreliable and may be subject to higher error than average can be discovered at the same time the predictions are made. Hereinafter, an ‘unreliable prediction’ refers to an individual prediction point with prediction error higher than overall metrics. As a benefit, when a weather forecast from MSB is potentially unreliable, customers will be notified and advised to rely on alternative sources of information for these points.

This section presents the various methods incorporated into MSB to discover unreliable and potentially less accurate predictions before the observation data are made available, so that alternative information sources can be sought in advance as a supplement to MSB predictions. The reliability study is only applied to forecasts with lead day 0 because forecasts with longer lead days can be subject to change over time as new observations become available. Note that since our goal is to provide foresight to the predictions, we do not apply posterior analysis approaches that require the knowledge of ground truth in this study. Meanwhile, we aim at keeping the process fully automatic and time efficient, so the adopted approaches are selected with consideration of their simplicity together with their effectiveness. Meanwhile, it should be noted that studies in this section aim at identifying high-error predictions only, without separating the source of error into aleatoric and epistemic uncertainty.

6.1. Unreliable Predictions from Out-of-Bound Feature Values

As already discussed in Section 3, XGBoost models do not have the ability to extrapolate, which means if a feature value is not within the range of the training set, the effect of the feature may not be accurately accounted for by XGBoost. Data scaling is applied during pre-processing to counter this problem, as previously described in Section 5. The method has proved highly effective, but does not fully eliminate the presence of out-of-bound feature values in the validation set. Especially when a high-importance feature has an out-of-bound value, the model prediction is more likely to be less accurate than average.

This error source can be best visualized by plotting feature value with respect to its SHAP value. Consistent with the convention in Figure 8, Figure 10a plots the scaled feature values vs. the corresponding SHAP values of the most important feature, T_dSW, in a dew point prediction model. The model is trained on data from 18 January 2023 to 23 March 2023 and validated on the 24 March 2023 to 31 March 2023 period for site Archerfield Airport. All points are coloured by pointwise absolute error. Similar to the exemplar model explained in Section 5, the SHAP value of T_dSW is approximately linearly correlated to its feature value. In the validation set, the samples whose feature values are within the range of the training set (around −2.0 °C to 2.5 °C) largely overlap with the samples in the training set. However, for validation samples with out-of-bound feature values (highlighted by red square region), the SHAP values no longer increase with the feature values, which means the effect of these feature values is likely to be underestimated by the model. Correspondingly, these points are coloured dark green, indicating much higher error than the rest of the samples. This results in the high-error predictions in the time series plotted in Figure 10b, in which the high error points within the red square in Figure 10a are highlighted in red. The figure shows that a sudden drop in dew point occurred on 30 March 2023 (due to a sudden change in the weather pattern) to very low values that were not experienced in the entire training period. As a result, the trained model significantly overestimates the dew point temperature (as did IMPROVER).

Based on the analysis above, we include in MSB an automatic detection of potentially unreliable predictions by examining if the value of the most important feature of any validation point is outside the range of the training set. Figure 11 plots the RMSE and critical error rate of recognized unreliable points as compared to the metrics on all data points for each site. Only predictions with lead day 0 are considered. The results show that for almost all sites, the unreliable subset has a higher error than global. For some sites, the RMSE of the unreliable subset can be twice as high as average (site Curtin, dew point temperature).

It should be noted that the appearance of this type of error is very rare. Statistically, only 0.5% of all forecasts are affected by this error. However, this error analysis does highlight the importance of data scaling, without which more out-of-range feature values would appear in the validation set.

6.2. Unreliable Predictions in Poor-Performing Data Cohorts

It is generally understood that error is often not evenly distributed among the entire dataset, and a subset (or equivalently, a “cohort”) of data could have worse accuracy than others. However, the underperforming data cohorts are sometimes difficult to find, and various approaches have been proposed in the literature to discover them as summarized by [27]. In this study, we utilize the method adopted in the Python package InterpretML [35], which involves training a single decision tree on the dataset, with scaled feature values of the data samples as features, and prediction error as labels. Figure 12 demonstrates an example of such a decision tree, trained with data samples from the same model analysed in Section 6.1 and visualized by InterpretML (installed from raiwidgets 0.36.0). The value in each node represents the MAE of the data that lies within the node, and the filling level of each node represents the proportion of errors associated with that node, coloured by magnitude of error. The figure shows that while the MAE of all data samples is 0.67, the cohort of data with feature values −0.31 < U₁₀ ≤ −0.16 and Hour ≤ 7.50 is subject to a higher MAE of 0.90. This cohort comprises 13.5% of all data, but contributes to 18.22% of all error. The example shows that a single decision tree can effectively cluster the data and divide it into high- and poor-performing cohorts.

Assuming no significant model and data drift in the validation set, the data and predictions in the validation set should have similar statistics as in the training set. Therefore, we include a second unreliable prediction identification in MSB, by training a single decision tree on the prediction error in the training set, and identify a high-error data cohort and its corresponding feature values. When data samples in validation are identified as lying within the high error cohort by their feature values, the predictions from these samples are considered unreliable. A high-error data cohort is defined as having RMSE twice as high as the global RMSE in the training set, and the minimum amount of data it covers is 1.0%. These thresholds are empirical and defined based on parametric studies.

Figure 13 demonstrates the RMSE and critical error rate of all predictions and unreliable predictions identified with this approach. The identified data comprise 7.3% of all prediction points on average. The results shown in Figure 14 indicate that the method is also widely effective among various sites. The RMSEs of unreliable predictions are 1.1–1.7 times greater than average.

6.3. Unreliable Predictions without Local Fit

To improve the reliability of a ML model, Saria and Subbaswamy [25] proposed two principles to determine the reliability of a single sample point: (1) density principle: the tested sample point should be close to the training set; (2) local fit principle: the model should predict accurately on the training points closest to the tested sample point. Based on these two principles, a prediction of the ML model can be considered reliable when the tested sample is close to the training set and lies in the region where the model prediction is accurate.

The density principle has been taken into consideration in Section 6.1, in which the distance of a sample point from the training set is measured by its main linear feature only. Subsequent experiments have shown that alternative approaches to account for the density rule do not perform better in identifying unreliable predictions. The local fit principle is partially considered in Section 6.2, in which samples with high prediction error are sorted out by a single decision tree. During the training of the decision tree, however, we must ensure each leaf contains enough samples, so very rare cases may not be separately labelled during the process. Consequently, we add an alternative approach here by identifying all test sample points whose closest points in the training set are predicted poorly by the model. The closest three points in the training set are found by nearest neighbour algorithm (KNN) [14]. Local accuracy of the three points is determined by their average absolute error. If the average error is as high as 90-percentile of all training error multiplied by 2.0, the local accuracy is considered low, and the prediction of the sample point is considered unreliable.

The high error threshold as described above is set empirically, and a higher threshold results in fewer points identified as being unreliable. Figure 14 demonstrates the RMSE and critical error rate of all predictions and unreliable predictions identified with this approach. The identified data comprise all prediction points on average. The results shown in Figure 14 indicate that the method is also widely effective amongst various sites. The RMSEs of unreliable predictions are 0.6–1.5 times greater than average.

6.4. Summary of Error Analysis and Model Reliability

In this section, different methods of unreliable prediction identification that are incorporated in the MSB framework are introduced. When evaluated separately, all these methods show satisfactory results on various sites, since the data identified as unreliable indeed has significantly higher error in general. These methods approach the same problem from different perspectives, so their results are not exclusive from each other. One single prediction can be identified as unreliable by more than one method. Note that the methods applied here are not exhaustive, and alternative methods and software packages exist in the literature. The methods and the empirical parameters are selected here based on their effectiveness and simplicity, so that the unreliability identification process does not significantly influence the efficiency of the framework.

Figure 15 shows the accuracy of unreliable predictions identified by all methods combined and its comparison to global accuracy. The results show that the overall performance of the methods is satisfactory, with the RMSE of unreliable samples 1.1–1.5 times as high as site average. The contributions of each separate method have been demonstrated in previous sections.

Figure 16a,b show the time series of IMPROVER and MSB predictions for temperature and dew point in January 2023 and their comparison with observed values. Identified unreliable predictions are highlighted as red dots along the MSB curve. Some predictions with large prediction errors have been successfully identified, such as temperature predictions at 5 January and 19 January, and dew point predictions at 5 January and 15 January. The latter are due to sudden, real changes in dew point that are difficult for the model to capture, and inaccurate input data from IMPROVER, respectively. The identification of unreliable points agrees with intuition, since predictions on 5 January are visually unstable and oscillate with time. These low-cost identification methods do not aim to accurately capture all high-error predictions, so there are misses (e.g., temperature on 31 January) and false alarms (e.g., temperature on 12 January). In addition, the identification gives no information on whether the unreliable predictions are less accurate than IMPROVER input. For example, the minimum temperature prediction on 15 January in Figure 16a is marked as unreliable, but it is still closer to the observed value than IMPROVER.

7. Conclusions and Future Work

This study focuses on the proof-of-concept of applying XGBoost, a state-of-the-art ML package for application on tabular data, to hourly site temperature and dew point forecasts up to seven days ahead, showing test results for various sites located across Australia. The resultant ML framework, Multi-SiteBoost, comprises the following three significant aspects: (1) an optimized data pre-processing pipeline, with each data transformation technique proven effective based on parametric study; (2) the application of SHAP values for model explainability, and (3) the identification of individual unreliable predictions at the moment when the prediction is made, so that extra caution can be taken for time spots when predictions are unreliable and may be subject to higher error.

The predictions generated by the proposed ML framework are evaluated on both RMSE and various customized metrics, and satisfactory prediction accuracy has been achieved on most of the metrics at various sites across Australia at various lead days. Across the selected sites, an overall reduction in RMSE of 11% for temperature and 12% for dew point is estimated on average, as compared to grid values of IMPROVER (which itself is already blended and post-processed from multiple raw NWP data sources). By improving the general forecast accuracy, improvements in other related metrics such as RMSE of daily minimum and maximum, and critical error rate, have been achieved as well—even if the ML models are not trained with those objective functions. However, model accuracy shows noticeable variance across sites, highlighting the necessity of local parameters and hyperparameter tuning instead of a global setup. This is enabled in the current framework, granting flexibility in case of customer interest in the future.

Explaining the trained ML models with SHAP values and SHAP interactive values leads to additional insights and trustworthiness of the model. The explanation identified the different effects of each variable, which largely agree with both global explanation of feature importance and human experience. The identification of main ‘linear’ features also confirms the significance of proper data scaling prior to training and paves the way for the identification of some unreliable predictions.

To add additional information to the predictions and as a forecast-auditing procedure, pointwise reliability is evaluated with three different approaches. The unreliability due to out-of-bound feature values is deduced by the inherent property of the XGBoost algorithm, while unreliability identified by high-error cohorts and local unfit rule are statistical methods based on the model’s accuracy on the training set. These approaches combined can identify around 5–10% of all predictions as unreliable, and these data have RMSE of up to 1.5 times higher than global RMSE on tested sites. The pointwise reliability provides a foresight of local prediction quality prior to the observation data becoming available. As an auditing process of the forecast, it adds additional information on confidence level to the deterministic forecast. Like the accuracy of the ML model, the effectiveness of these methods also varies with site locations.

It should be noted that while the current study explores specific data ingests, the developed methodology has general application, as the ML training, validation and explanation in this paper do not rely on specific data properties. Similarly, it is also possible to generate forecasts for other variables such as wind speed and solar radiance with the same approach used in this study. However, higher quality data and relevant features certainly improve the prediction quality given the same ML architecture. Future work includes benchmark studies on similar datasets to determine the effectiveness of XGBoost compared with conventional statistical techniques, fine-tuning on customer-specified sites with higher significance, generating probabilistic forecasts, and adding nowcast capabilities to the existing MSB framework, so that real-time observation data can be ingested to correct forecasts in short forecast windows (1 to 6 h ahead).

Author Contributions

Conceptualization, T.L.; Methodology, M.H., T.L. and B.M.; Investigation, M.H. and T.L.; Writing—original draft, M.H.; Writing—review & editing, T.L. and B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Percentage of reduction (%) of daily maximum RMSE by lead day, temperature forecast.

Sites/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	−14.36	−8.69	4.21	9.33	9.74	14.12	10.22	5.89
Archerfield Airport	33.62	33.97	33.27	29.04	29.46	29.15	32.26	35.17
Avalon Airport	10.93	15.30	3.91	6.71	8.90	10.39	−0.41	5.42
Coffs Harbour Airport	22.37	27.60	29.31	26.34	27.11	24.10	21.25	22.37
Curtin Aero	17.42	12.30	12.19	13.58	21.83	20.00	22.42	18.33
Geraldton Airport	21.88	25.51	24.72	25.24	24.91	21.30	20.31	21.67
Hobart (Ellerslie Road)	22.17	20.20	12.63	17.57	20.33	12.64	11.98	11.43
Mount Isa	9.52	10.41	13.59	17.48	8.35	7.77	12.37	18.17
Tindal RAAF	−0.99	−1.40	5.52	13.70	19.09	18.51	8.23	9.39
Townsville	40.44	40.21	39.99	40.68	34.52	29.96	33.72	38.71
Woomera Aerodrome	−14.59	−13.44	0.64	6.44	6.66	11.21	5.13	7.93

Table A2. Percentage (%) of reduction of daily maximum RMSE by lead day, dew point temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	22.34	30.57	24.47	26.64	30.07	29.10	22.10	26.08
Archerfield Airport	−22.87	−24.99	−21.63	−13.30	−16.71	−6.05	3.43	4.45
Avalon Airport	10.76	17.61	19.01	16.96	21.53	19.75	17.86	21.69
Coffs Harbour Airport	11.97	11.58	10.07	14.37	12.35	14.26	18.65	17.65
Curtin Aero	−4.10	−0.74	5.55	11.45	12.07	11.54	14.79	13.29
Geraldton Airport	56.52	58.10	62.29	60.62	60.50	58.97	56.76	55.74
Hobart (Ellerslie Road)	8.77	−5.70	−2.54	0.98	2.58	1.53	−8.63	−6.81
Mount Isa	−17.98	−12.35	−9.86	−0.34	6.55	7.97	16.46	16.59
Tindal RAAF	−8.02	−14.47	−9.09	−6.19	−2.40	−4.88	5.70	4.93
Townsville	41.35	41.40	41.61	41.98	43.87	42.99	41.05	38.62
Woomera Aerodrome	2.22	−0.21	1.63	−0.62	2.67	9.03	5.14	7.87

Table A3. Percentage (%) of reduction of daily minimum RMSE by lead day, temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	27.07	24.88	26.07	18.81	14.28	16.20	31.08	28.07
Archerfield Airport	28.32	22.11	23.56	22.77	24.67	23.85	24.89	21.16
Avalon Airport	16.67	12.56	13.00	17.94	14.91	17.68	19.25	13.74
Coffs Harbour Airport	33.92	32.81	36.97	32.35	34.03	33.46	34.86	37.92
Curtin Aero	9.45	10.37	14.83	11.45	12.66	14.54	19.71	23.85
Geraldton Airport	33.48	34.13	34.10	30.78	25.27	22.23	27.16	25.60
Hobart (Ellerslie Road)	4.28	5.19	3.68	8.67	7.37	10.29	6.01	1.66
Mount Isa	26.31	27.49	29.11	22.43	23.41	19.70	27.23	23.23
Tindal RAAF	−0.58	2.67	8.03	11.79	5.07	5.06	24.70	21.63
Townsville	1.43	4.40	3.12	6.61	2.38	1.86	1.99	11.35
Woomera Aerodrome	−10.90	−6.77	−4.91	−10.26	−11.21	2.01	3.22	0.93

Table A4. Percentage (%) of reduction of daily minimum RMSE by lead day, dew point temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	−1.01	−1.94	5.53	15.99	14.57	17.94	12.72	9.59
Archerfield Airport	47.64	42.17	39.90	41.47	38.15	34.59	35.87	35.91
Avalon Airport	3.84	9.68	9.57	12.27	14.82	13.42	5.51	6.76
Coffs Harbour Airport	13.62	12.14	18.29	19.17	15.54	19.32	17.73	19.63
Curtin Aero	−3.98	1.22	4.51	7.58	8.47	8.37	6.47	−0.52
Geraldton Airport	21.68	8.46	9.49	4.84	3.64	9.64	6.36	12.70
Hobart (Ellerslie Road)	20.57	21.22	21.69	19.79	22.14	23.02	16.99	16.53
Mount Isa	22.57	27.18	34.10	33.14	32.53	29.46	28.11	31.96
Tindal RAAF	24.14	21.34	20.64	30.12	20.13	19.69	16.34	14.90
Townsville	−6.72	−1.34	−1.47	0.54	5.46	14.01	9.25	−2.09
Woomera Aerodrome	34.81	30.51	33.03	34.27	33.56	30.85	30.09	28.04

Table A5. Reduction of critical error rates in percentage (%), hourly temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	7.39	8.01	10.66	10.40	10.06	7.85	7.84	8.53
Archerfield Airport	3.91	4.50	6.15	6.01	6.90	7.87	8.94	8.42
Avalon Airport	1.32	1.25	1.46	1.27	2.73	2.06	2.25	2.11
Coffs Harbour Airport	5.37	6.18	6.86	6.60	6.63	5.64	5.03	5.16
Curtin Aero	0.82	0.89	1.07	2.57	2.59	3.32	4.31	6.08
Geraldton Airport	16.50	16.42	15.31	15.83	14.62	14.42	12.49	12.87
Hobart (Ellerslie Road)	2.78	3.16	3.69	4.09	2.93	3.39	2.18	3.72
Mount Isa	2.73	1.46	3.21	3.76	3.88	3.22	3.42	5.10
Tindal RAAF	3.81	4.62	4.68	5.92	5.93	4.97	6.49	6.96
Townsville	2.58	3.67	4.08	4.59	4.39	4.22	5.02	5.68
Woomera Aerodrome	1.41	3.07	4.09	5.23	5.92	5.96	4.40	7.13

Table A6. Reduction of critical error rates in percentage (%), hourly dew point temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	4.46	6.27	6.77	8.83	8.02	8.40	7.71	8.09
Archerfield Airport	8.29	8.03	7.04	8.29	8.10	8.75	8.52	6.70
Avalon Airport	1.20	1.55	0.72	1.25	2.30	2.29	2.03	2.54
Coffs Harbour Airport	0.95	0.54	0.90	0.81	0.85	2.07	4.42	5.02
Curtin Aero	0.42	1.16	1.47	0.98	−0.06	1.51	1.14	2.23
Geraldton Airport	20.26	21.60	22.20	23.98	25.51	23.37	24.28	23.24
Hobart (Ellerslie Road)	1.44	2.25	2.73	2.44	4.34	4.52	4.73	4.70
Mount Isa	3.44	3.88	4.88	4.76	4.48	5.46	8.77	11.27
Tindal RAAF	0.64	2.10	3.08	2.93	2.95	2.73	1.06	0.74
Townsville	1.84	3.09	4.22	6.39	8.75	9.48	9.38	9.66
Woomera Aerodrome	9.02	10.50	9.72	8.48	5.76	4.53	4.60	3.66

References

Vannitsem, S.; Bremnes, J.B.; Damaeyer, J.; Evans, G.R.; Flowerdew, J.; Hemri, S.; Lerch, S.; Roberts, N.; Theis, S.; Atencia, A.; et al. Statistical Postprocessing for Weather Forecasts: Review, Challenges, and Avenues in a Big Data World. Bull. Am. Meteorol. Soc. 2021, 102, E681–E699. [Google Scholar] [CrossRef]
Richardson, D.S. Skill and relative economic value of the ECMWF ensemble prediction system. Q. J. R. Meteorol. Soc. 2000, 126, 649–667. [Google Scholar] [CrossRef]
Bakker, K.; Whan, K.; Knap, W.; Schmeits, M. Comparison of statistical post-processing methods for probabilistic NWP forecasts of solar radiation. Sol. Energy 2019, 191, 138–150. [Google Scholar] [CrossRef]
Yang, D. Post-processing of NWP forecasts using ground or satellite-derived data through kernel conditional density estimation. J. Renew. Sustain. Energy 2019, 11, 026101. [Google Scholar] [CrossRef]
Alerskans, E.; Kaas, E. Local temperature forecasts based on statistical post-processing of numerical weather prediction data. Meteorol. Appl. 2021, 28, e2006. [Google Scholar] [CrossRef]
Li, X.; Ma, L.; Chen, P.; Xu, H.; Xing, Q.; Yan, J.; Lu, S.; Fan, H.; Yang, L.; Cheng, Y. Probabilistic solar irradiance forecasting based on XGBoost. Energy Rep. 2022, 8, 1087–1095. [Google Scholar] [CrossRef]
Karevan, Z.; Suykens, J.A. Transductive LSTM for time-series prediction: An application to weather forecasting. Neural Netw. 2020, 125, 1–9. [Google Scholar] [CrossRef]
Kong, W.; Li, H.; Yu, C.; Xia, J.; Kang, Y.; Zhang, P. A deep spatio-temporal forecasting model for multi-site weather prediction post-processing. Commun. Comput. Phys. 2022, 31, 131–153. [Google Scholar] [CrossRef]
Donadio, L.; Fang, J.; Porté-Agel, F. Numerical weather prediction and artificial neural network coupling for wind energy forecast. Energies 2021, 14, 338. [Google Scholar] [CrossRef]
Hu, H.; van der Westhuysen, A.J.; Chu, P.; Fujisaki-Manome, A. Predicting Lake Erie wave heights and periods using XGBoost and LSTM. Ocean Model. 2021, 164, 101832. [Google Scholar] [CrossRef]
Sushanth, K.; Mishra, A.; Mukhopadhyay, P.; Singh, R. Near-real-time forecasting of reservoir inflows using explainable machine learning and short-term weather forecasts. Stoch. Environ. Res. Risk Assess. 2023, 37, 3945–3965. [Google Scholar] [CrossRef]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T. Xgboost: Extreme Gradient Boosting. R package version 0.4-2 2015; Volume 1, pp. 1–4. Available online: https://pypi.org/project/xgboost/ (accessed on 16 July 2024).
Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why do tree-based models still outperform deep learning on typical tabular data? Adv. Neural Inf. Process. Syst. 2022, 35, 507–520. [Google Scholar]
Dong, J.; Zeng, W.; Wu, L.; Huang, J.; Gaiser, T.; Srivastava, A.K. Enhancing short-term forecasting of daily precipitation using numerical weather prediction bias correcting with XGBoost in different regions of China. Eng. Appl. Artif. Intell. 2023, 117, 105579. [Google Scholar] [CrossRef]
Xiong, X.; Guo, X.; Zeng, P.; Zou, R.; Wang, X. A short-term wind power forecast method via XGBoost hyper-parameters optimization. Front. Energy Res. 2022, 10, 905155. [Google Scholar] [CrossRef]
Zheng, H.; Wu, Y. A XGBoost model with weather similarity analysis and feature engineering for short-term wind power forecasting. Appl. Sci. 2019, 9, 3019. [Google Scholar] [CrossRef]
Gunning, D.; Stefik, M.; Choi, J.; Miller, T.; Stumpf, S.; Yang, G.Z. XAI—Explainable artificial intelligence. Sci. Robot. 2019, 4, eaay7120. [Google Scholar] [CrossRef] [PubMed]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Duckworth, C.; Chmiel, F.P.; Burns, D.K.; Zlatev, Z.D.; White, N.M.; Daniels, T.W.V.; Kiuber, M. Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19. Sci. Rep. 2021, 11, 23017. [Google Scholar] [CrossRef]
Dikshit, A.; Pradhan, B. Explainable AI in drought forecasting. Mach. Learn. Appl. 2021, 6, 100192. [Google Scholar] [CrossRef]
Dikshit, A.; Pradhan, B. Interpretable and explainable AI (XAI) model for spatial drought prediction. Sci. Total Environ. 2021, 801, 149797. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Chan, F.T.; Yan, C.; Bose, I. Towards risk-aware artificial intelligence and machine learning systems: An overview. Decis. Support Syst. 2022, 159, 113800. [Google Scholar] [CrossRef]
Saria, S.; Subbaswamy, A. Tutorial: Safe and Reliable Machine Learning. arXiv 2019, arXiv:1904.07204. [Google Scholar] [CrossRef]
Schulam, P.; Saria, S. Can you trust this prediction? Auditing pointwise reliability after learning. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 1022–1031. [Google Scholar]
d’Eon, G.; d’Eon, J.; Wright, J.R.; Leyton-Brown, K. The spotlight: A general method for discovering systematic errors in deep learning models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 1962–1981. [Google Scholar]
Eyuboglu, S.; Varma, M.; Saab, K.; Delbrouck, J.; Lee-Messer, C.; Dunnmon, J.; Zou, J.; Re, C. Domino: Discovering Systematic Errors with Cross-Modal Embeddings. In Proceedings of the 2022 International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Hellman, M.E. The nearest neighbor classification rule with a reject option. IEEE Trans. Syst. Sci. Cybern. 1970, 6, 179–185. [Google Scholar] [CrossRef]
Roberts, N.; Ayliffe, B.; Evans, G.; Moseley, S.; Rust, F.; Sandford, C.; Trzeciak, T.; Abernethy, P.; Beard, L.; Crosswaite, N.; et al. IMPROVER: The New Probabilistic Postprocessing System at the Met Office. Bull. Am. Meteorol. Soc. 2023, 104, E680–E697. [Google Scholar] [CrossRef]
Malistov, A.; Trushin, A. Gradient boosted trees with extrapolation. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 783–789. [Google Scholar]
Lundberg, S.M.; Erion, G.G.; Lee, S.I. Consistent individualized feature attribution for tree ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
Delle Monache, L.; Nipen, T.; Liu, Y.; Roux, G.; Stull, R. Kalman filter and analog schemes to postprocess numerical weather predictions. Mon. Weather Rev. 2011, 139, 3554–3570. [Google Scholar] [CrossRef]
Sheridan, P.; Vosper, S.; Smith, S. A physically based algorithm for downscaling temperature in complex terrain. J. Appl. Meteorol. Climatol. 2018, 57, 1907–1929. [Google Scholar] [CrossRef]
Nori, H.; Jenkins, S.; Koch, P.; Caruana, R. InterpretML: A unified framework for machine learning interpretability. arXiv 2019, arXiv:1909.09223. [Google Scholar]

Figure 1. (a) IMPROVER gridded data, (b) selected sites and (c) extracted daily time series from gridded data for Alice Springs. Each single-day forecast contains four issues (updates) published at UTC 0, 6, 12 and 18.

Figure 2. (a) Overview of IMPROVER hourly temperature forecasts. Observed site value T_o vs. Numerical value from IMPROVER. T_N, temperature; site: Alice Springs. (b) Overview of IMPROVER hourly temperature forecast. Distribution of IMPROVER hourly temperature error, 1–30 October 2022, site: Alice Springs.

Figure 3. Positions of extracted grid values at and around a single site.

Figure 4. Multi-SiteBoost (MSB) flow chart.

Figure 5. (a) Comparison of Multi-SiteBoost results with IMPROVER grid values on various metrics. Site: Hobart (Ellerslie Road). Temperature. (b) Comparison of Multi-SiteBoost results with IMPROVER grid values on various metrics. Site: Hobart (Ellerslie Road). Dew point temperature.

Figure 6. Histograms of pointwise forecast error on lead day 0 for (a) temperature; (b) dew point temperature. Site: Hobart (Ellerslie Road).

Figure 7. Global view of SHAP values on training set for top 10 features, (a) SHAP summary plot, coloured by feature value of each point; (b) feature importance, calculated by aggregation of absolute value of SHAP of each point. Training window: 18 January–22 March 2023; trained variable: temperature; site: Archerfield Airport.

Figure 8. SHAP values of three features for all sample points in training set. Training window: 18 January–22 March 2023; trained variable: temperature; site: Archerfield Airport, (a) T_w; (b) Hour; (c) U₁₀.

Figure 9. Normalized SHAP interaction values between each pair of features (normalized at each column), coloured by SHAP interaction value. Training window: 18 January–22 March 2023; trained variable: temperature. Site: Archerfield Airport. Values below 0.01 are not considered significant but we have left them in for the reader to assess in context.

Figure 10. Example of inaccurate outputs from XGBoost due to out-of-bound feature values, (a) feature values vs. corresponding SHAP values for data samples in training set and validation set, coloured by absolute error of the prediction. Circles: training set; squares: validation set; (b) time series of XGBoost predictions in the validation period and site observations. Inaccurate predictions due to out-of-bound feature values are highlighted in red.

Figure 11. Statistics of RMSE and critical error rate of all predictions and unreliable predictions due to out-of-bound feature values. (a) Temperature; (b) dew point temperature. Lead day: 0.

Figure 12. High-error data cohort identified by a single decision tree on training set, visualized by InterpretML ErrorAnalysisBoard. Training window: 18 January–22 March 2023; trained variable: dew point temperature. Site: Archerfield Airport.

Figure 13. Statistics of RMSE and critical error rate of all predictions and unreliable predictions in high error data cohorts. (a) Temperature; (b) dew point temperature. Lead day: 0.

Figure 14. Statistics of RMSE and critical error rate of all predictions and unreliable predictions without local fit. (a) Temperature; (b) dew point temperature. Lead day: 0.

Figure 15. Summary of RMSE and critical error rate of all predictions and unreliable predictions due to all possible error types. (a) Temperature; (b) dew point temperature. Lead day: 0.

Figure 16. (a) Exemplar time series of XGBoost prediction, compared with IMPROVER grid value and site observations, unreliable data highlighted in red. Lead time: 0–24 h; site: Geraldton. Temperature. (b) Exemplar time series of XGBoost prediction, compared with IMPROVER grid value and site observations, unreliable data highlighted in red. Lead time: 0–24 h; site: Geraldton. Dew point temperature.

Table 1. Change of RMSE (°C) of temperature predictions averaged on all lead days for each site when the different pre-processing approaches are absent from the data pipeline. Negative means better prediction accuracy. Prediction period: 27 October–27 November 2022.

Sites/Preprocessing Methods	Change of RMSE, If Feature Selection Is Not Performed	Change of RMSE, If Surrounding Points Are Excluded	Change of RMSE, If Scaling Is Not Performed
Alice Springs Airport	−0.01	−0.09	0.50
Archerfield Airport	0.01	0.02	0.20
Avalon Airport	0.01	−0.03	0.13
Coffs Harbour Airport	0.03	0.12	0.19
Curtin Aero	0.00	0.12	0.24
Geraldton Airport	0.01	0.19	0.09
Hobart (Ellerslie Road)	−0.01	0.03	0.25
Mount Isa	0.01	0.04	0.68
Tindal RAAF	−0.01	−0.10	0.00
Townsville	0.02	0.09	0.28
Woomera Aerodrome	−0.02	−0.01	0.30
SUM	0.05	0.37	2.87
AVERAGE	0.005	0.034	0.261

Table 2. Metrics of each site, calculated on all lead days, temperature forecast.

Site/Metrics	Hourly RMSE, MSB (Change from IMPROVER)/°C	Daily Maximum RMSE, MSB (Change from IMPROVER)/°C	Daily Minimum RMSE, MSB (Change from IMPROVER)/°C	Percentage of Critical Error, MSB (Change from IMPROVER)/%
Alice Springs Airport	2.36 (−0.38)	2.14 (−0.13)	2.11 (−0.64)	32.94% (−8.98%)
Archerfield Airport	1.37 (−0.23)	1.51 (−0.71)	1.39 (−0.43)	12.39% (−6.59%)
Avalon Airport	1.92 (−0.07)	2.11 (−0.14)	1.77 (−0.34)	23.50% (−1.31%)
Coffs Harbour Airport	1.51 (−0.26)	1.35 (−0.46)	1.55 (−0.78)	16.64% (−5.46%)
Curtin Aero	1.79 (−0.13)	1.67 (−0.33)	1.33 (−0.24)	20.60% (−2.73%)
Geraldton Airport	2.09 (−0.60)	2.22 (−0.68)	2.09 (−0.87)	29.67% (−13.45%)
Hobart (Ellerslie Road)	1.64 (−0.12)	1.95 (−0.37)	1.34 (−0.06)	19.53% (−3.30%)
Mount Isa	2.19 (−0.21)	1.96 (−0.25)	2.00 (−0.56)	31.95% (−2.63%)
Tindal RAAF	1.82 (−0.16)	1.57 (−0.14)	1.38 (−0.15)	22.70% (−5.34%)
Townsville	1.13 (−0.20)	0.99 (−0.60)	1.15 (−0.04)	7.44% (−4.39%)
Woomera Aerodrome	1.93 (−0.11)	1.83 (−0.04)	1.75 (+0.07)	21.44% (−4.28%)

Table 3. Metrics of each site, calculated on all lead days, dew point temperature forecast.

Site/Metrics	Hourly RMSE, MSB (Change from IMPROVER)/°C	Daily Maximum RMSE, MSB (Change from IMPROVER)/°C	Daily Minimum RMSE, MSB (Change from IMPROVER)/°C	Percentage of Critical Error, MSB (Change from IMPROVER)/%
Alice Springs Airport	2.67 (−0.46)	2.40 (−0.88)	2.66 (−0.30)	38.55% (−7.40%)
Archerfield Airport	1.73 (−0.41)	1.46 (+0.13)	2.02 (−1.29)	17.25% (−7.96%)
Avalon Airport	1.64 (−0.01)	1.51 (−0.35)	1.75 (−0.18)	19.00% (−1.65%)
Coffs Harbour Airport	1.43 (−0.16)	1.36 (−0.19)	1.82 (−0.34)	11.99% (−2.11%)
Curtin Aero	1.99 (−0.02)	1.69 (−0.10)	2.14 (−0.12)	23.63% (−0.85%)
Geraldton Airport	2.39 (−1.05)	1.63 (−2.26)	2.47 (−0.27)	31.16% (−22.45%)
Hobart (Ellerslie Road)	1.80 (−0.09)	1.79 (+0.05)	2.01 (−0.49)	21.38% (−3.47%)
Mount Isa	2.68 (−0.53)	2.38 (−0.12)	2.75 (−1.16)	38.16% (−5.78%)
Tindal RAAF	1.55 (−0.12)	1.27 (0.06)	1.83 (−0.49)	14.73% (−2.17%)
Townsville	1.34 (−0.29)	1.10 (−0.80)	2.01 (−0.11)	9.24% (−6.57%)
Woomera Aerodrome	2.49 (−0.35)	2.17 (−0.11)	2.30 (−1.07)	34.54% (−7.15%)

Table 4. Percentage (%) of reduction of hourly RMSE by lead day, temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	15.27	13.66	14.38	14.67	13.93	13.94	13.29	12.07
Archerfield Airport	16.10	14.87	15.45	13.19	13.71	13.72	13.58	14.21
Avalon Airport	3.75	3.32	1.75	2.81	6.91	5.60	4.63	5.67
Coffs Harbour Airport	16.22	16.30	16.45	15.10	13.70	14.71	14.44	14.91
Curtin Aero	1.46	1.79	3.54	5.36	6.55	8.82	10.19	13.06
Geraldton Airport	27.58	27.36	26.12	24.16	22.44	21.98	21.40	21.23
Hobart (Ellerslie Road)	9.86	9.88	8.59	8.32	7.01	6.29	4.59	5.22
Mount Isa	8.78	7.69	8.68	8.76	9.28	10.22	13.64	14.14
Tindal RAAF	7.18	6.25	6.13	8.78	9.36	7.09	8.19	8.50
Townsville	15.54	17.45	17.29	16.64	12.91	12.52	12.37	15.18
Woomera Aerodrome	4.34	4.71	6.64	7.76	6.73	9.33	5.80	5.86

Table 5. Percentage (%) of reduction of hourly RMSE by lead day, dew point temperature forecast.

Site/Lead Time (Days)	0	1	2	3	4	5	6	7
Alice Springs Airport	10.70	11.74	15.30	17.89	18.30	14.66	12.89	13.30
Archerfield Airport	24.52	19.43	19.53	20.38	18.04	17.83	19.41	17.06
Avalon Airport	0.54	2.62	3.34	3.76	2.65	0.23	−1.66	−1.82
Coffs Harbour Airport	8.36	8.05	10.83	9.78	8.26	9.42	11.00	8.87
Curtin Aero	−6.46	−1.26	2.78	4.47	4.49	4.96	3.47	2.67
Geraldton Airport	37.13	33.37	31.56	30.58	30.59	30.50	29.48	28.14
Hobart (Ellerslie Road)	6.04	7.02	7.41	4.28	8.43	7.05	1.91	−0.90
Mount Isa	4.62	8.64	14.55	18.51	18.44	16.60	19.67	21.49
Tindal RAAF	3.75	6.14	8.16	8.47	7.93	9.46	6.42	7.24
Townsville	10.61	11.57	14.16	15.77	22.43	24.06	19.77	14.35
Woomera Aerodrome	16.97	15.90	14.78	13.49	14.15	11.24	10.57	7.88

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, M.; Leeuwenburg, T.; Murphy, B. Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning. Appl. Sci. 2024, 14, 6314. https://doi.org/10.3390/app14146314

AMA Style

Han M, Leeuwenburg T, Murphy B. Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning. Applied Sciences. 2024; 14(14):6314. https://doi.org/10.3390/app14146314

Chicago/Turabian Style

Han, Mengmeng, Tennessee Leeuwenburg, and Brad Murphy. 2024. "Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning" Applied Sciences 14, no. 14: 6314. https://doi.org/10.3390/app14146314

APA Style

Han, M., Leeuwenburg, T., & Murphy, B. (2024). Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning. Applied Sciences, 14(14), 6314. https://doi.org/10.3390/app14146314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Site-Specific Deterministic Temperature and Dew Point Forecasts with Explainable and Reliable Machine Learning

Abstract

1. Introduction

2. Data Overview

2.1. Gridded Numerical Data (IMPROVER)

2.2. Site Observations

3. Model Training, Optimization and Explainability

3.1. Model Training and Optimization

3.2. SHAP for ML Explainability

4. Experiments

4.1. Evaluation Metrics

4.2. Results

5. Explaining Model Outputs with SHAP

6. Error Analysis and Model Reliability

6.1. Unreliable Predictions from Out-of-Bound Feature Values

6.2. Unreliable Predictions in Poor-Performing Data Cohorts

6.3. Unreliable Predictions without Local Fit

6.4. Summary of Error Analysis and Model Reliability

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI