1. Introduction
Developing and developed countries alike are paying significant attention to the issue of atmospheric air quality and the associated climate change, as this remains one of the most critical concerns worldwide [
1], and South Africa is not an exception in this endeavour. The ever-increasing levels of atmospheric air pollutants or emissions have a negative impact on human health and the overall quality of life [
2,
3]. Numerous epidemiological studies have demonstrated that nitrogen dioxide (NO
2) is a significant atmospheric air pollutant associated with adverse health effects in humans, including respiratory and cardiovascular diseases [
4]. Nitrogen monoxide (NO) and NO
2 are the most relevant of the nitrogen oxide (NO
X) emissions to the air, causing significant pollution [
5]. About 40% of NO
X emissions are produced from point sources of electric power plant boilers [
6].
Air pollutants, particularly NO2, are a major environmental concern associated with the production of electricity from coal-fuelled power stations. Eskom, the main energy producer in South Africa, uses coal as the primary source of power in electricity generation at over 13 coal-fired power stations. This method of producing electricity is responsible for the emission of dangerous pollutants, including NO2. To safeguard human health, atmospheric air quality standards for high-priority air pollutants, including NO2, have been imposed by authorities or regulatory bodies. This obligates Eskom to ensure and maintain adequate atmospheric air quality standards by limiting its emissions.
Air quality models are essential in all aspects of atmospheric air pollution control and planning, where prediction forms a significant component [
7,
8]. Mathematical models are frequently used to predict the temporal and spatial dispersion of air pollutants and to understand the effects of different atmospheric air pollutants [
9]. Khare and Sharma [
10] highlighted three distinct methods for modelling atmospheric air quality data, namely, deterministic modelling (analytical and numerical models), statistical modelling, and physical modelling. Sophisticated air temporal and spatial models have been employed to address the challenges associated with predicting air pollutant concentrations emitted during the production of electricity from thermal power plants, resulting in various management strategies to mitigate these harmful pollutants [
11,
12]. The authors of [
11,
12] propose a high-dimensional multi-objective optimal dispatching strategy for managing the generation of electricity with the aim of reducing emissions and emission costs [
11]. The VAR-XGBoost model, based on Vector autoregression (VAR), the Kriging method, and XGBoost (Extreme Gradient Boosting), has been used to improve pollutant prediction accuracy and obtain its spatial distribution over a continuous period of time [
12].
Statistical techniques involve treating the air pollution concentration as a random variable and then identifying and estimating the most suitable probability distribution to forecast the frequency of national ambient air quality standard exceedances [
13,
14]. The inherent properties of probability density functions make them a popular choice for modelling atmospheric emission as they cater to uncertainty in the pollutant [
14].
Several statistical parent probability distributions are used to fit air pollutant concentration data. The parent statistical probability distribution fitting method involves fitting a probability distribution to emissions data, where the selected distribution serves as the “parent distribution” from which observed emissions are assumed to originate. Parent distributions concentrate their fit around where most of the data are concentrated, around the mean, mode, or median. Among these distributions are the Weibull, Gamma, Lognormal, and type V Pearson distributions [
14,
15].
Accurate modelling of NO2 emissions is essential for understanding the impact of this pollutant on air quality and, hence, in developing effective mitigation strategies. Many studies involving emissions data often assume that the data have a Normal distribution and, hence, use Normally distributed regression analysis. However, this is not invariably the case, and models based on the Normal distribution are often indiscriminately applied to data that may be better handled otherwise. This is especially true of atmospheric emissions data.
The distributional assumptions made about the emission variable can have a critical impact on the conclusions drawn. An alternative approach to the Normal distribution based regression model assumption is to use some member of the family of generalised linear models.
This study investigates and compares the performance of Lognormal and Loggamma distributed Generalised Linear Models (GLMs) as potential alternatives to the traditional simple linear regression model in predicting NO
2 emissions generated during the production of electricity from Eskom’s 13 coal-fuelled power stations in South Africa. The former models are particularly good when emissions data are skewed, with an increasing variance and a heavy tail [
16]. Furthermore, this study aims to identify significant variables and interactions that contribute information for the prediction of NO
2 emissions from Eskom’s coal-fuelled power stations.
The use of GLMs in environmental modelling offers several advantages over the Normal distribution based regression model. GLMs are flexible statistical models that can cater to response variables that are not even normally distributed and capture complex relationships between predictors and the response variable. By incorporating the Lognormal and Loggamma distributions, GLMs are well suited for modelling positive, right-skewed data with a variance that may increase with the mean, making them particularly relevant for NO2 emissions data.
2. Literature Review
The literature on NO2 emissions is very limited, especially in the South African context, but other atmospheric emissions that may be assumed to follow the same statistical distributions are discussed below.
In a study in the United Kingdom, Hadley et al. [
17] investigated the distribution of annual mean daily (sulfur dioxide) SO
2 and concluded that the Lognormal distribution was a better fit for the data compared to the Normal distribution. In another study on the distribution of particulate matter emissions in Washington, Rumburg et al. [
18] concluded that the three-parameter Lognormal distribution and a generalised extreme value distribution (GEVD) had the best fit to the data. Kao and Friedlander [
19] further observed that the frequency distribution of (non-reactive) aerosol components of particulate matter PM
10 (particulate matter of size 10 micrometres or less) and the source contributions, for most sources, followed an approximately Lognormal distribution in the South Coast Air Basin (SoCAB), United States of America.
Three theoretical distributions (Lognormal, Weibull, and type V Pearson) were used to fit PM
10 concentrations in the Belgrade urban area for the 2003–2005 period. The type V Pearson distribution was found to be the most appropriate in representing PM
10 daily average concentrations [
20]. In a similar study by Lu, the Gamma distribution was the best of the three theoretical parent frequency distributions (Lognormal, Weibull, and Gamma) in representing the distribution of high PM
10 concentrations in central Taiwan [
15]. The measured data on NO
2, PM
10, PM
2.5, and SO
2 in Santiago were fitted using the type V Pearson distribution [
21]. In their study, Zhang et al. [
22] investigated the statistical distribution of automobile emissions, namely, carbon monoxide (CO) and hydrocarbons (HCs), from various locations in the United States. The automobile emissions followed a statistically Gamma-distributed pattern.
Other techniques to model emission data have been used in the past. Sahsuvaroglu et al. [
23] used a land use regression (LUR) model to predict NO
2 concentrations in Hamilton, Ontario, Canada. In other studies, such as those by Boznar et al. [
24], Gardner and Dorling [
25], Comrie [
26], among others, a component of machine learning, namely, the multilayer neural network technique, was used to forecast emissions data.
Atmospheric emission data are rarely modelled using GLMs, and yet, data are not always Normally distributed. The current study uses Lognormal and Loggamma distributed based GLMs to fit the NO2 emissions data from Eskom’s 13 coal-fuelled power stations and compare them with a Normal distribution based regression analysis in this paper.
3. Methodology
This section discusses linear regression and the GLM models. In regression analysis, the data are assumed to be Normally distributed.
3.1. Linear Regression
The following model will be fitted stepwise to the NO2 emissions data initially. Analysis of Covariance (ANCOVA) will be used since the response variable is continuous and the explanatory variables are both continuous and categorical. The stepwise procedure implies that only the significant variables may be used in the final model. Backward stepwise variable selection is used to try and find the best model. This model selection procedure allows one to start with a more complex model that includes all possible or available variables and the interaction terms of these variable terms.
The full model in this case is given as:
where
“” is the response, i.e., NO2 emission in tons from filter in plant at time in month .
is the joint effect on NO2 emission of electricity sent out in Gigawatt-hours (by filter in plant at given age, at time in month ), of the power plant at time t (in years), the plant, filter and the month.
is the joint effect of electricity sent out in Gigawatt-hours (by filter in plant at a given age at time in month s), of the power plant at time t (in years), the plant and the month. It means the difference in effect of electricity sent out in Gigawatt-hours (by filter in plant at given age at in month ) on NO2 emission in tons depending on the of the power plant at time t (in years), the plant and the month.
is the error term.
Similar interpretations can be made for the other interaction parameters. The model includes all the variables recorded in this study, including interaction terms between the variables.
3.1.1. Model Selection
Backward Model Selection Procedure
The backward stepwise variable selection method eliminates each explanatory variable that is not statistically significant at each step. Schwarz’s Bayesian Information Criterion (BIC), the Akaike Information Criteria (AIC), the residuals, and other plots are used to check if the elimination of an explanatory variable simplifies to a good and adequate model or not; hence, they are also used to determine the best-fitting model. The lowest values of BIC and AIC are associated with the best model. The BIC and AIC formulae are given as follows [
27].
and
where
is the log-likelihood,
is the number of parameters, and
is the number of observations.
The best variable fitting model is found for the Normal, Lognormal, and Loggamma distribution models, all with identity link functions. The elimination process starts with a model containing all the following explanatory variables: power station, electricity sent out in Gigawatt-hours from the power station, age in years of the power station, type of filter used at the power station, month of emission, and all their interaction terms.
Box plots, histograms, the Kolmogorov–Smirnov test, and quantile–quantile (QQ) plots are used to check the Normality assumption of the data and residuals. A symmetric and bell-shaped histogram suggests the data follow a Normal distribution and the model residuals suggest a good fit.
The best model is to be selected between the Normal, Lognormal, and the Loggamma distribution based GLMs, as discussed later.
3.2. Generalised Linear Models
The data used in a classical regression model are assumed to be symmetric and to belong to a Normal distribution with a constant variance,
. However, in practice, it is common to find data in the form of continuous measurements, where Normality, symmetry, and homoscedasticity assumptions do not hold. In such cases, the Lognormal and Loggamma distributions may be better candidates to model the data, especially if the variance of the data is suspected to increase with the mean of the data, that is, for some constant
, if the variance of the data is of the form:
3.2.1. The Logarithmic Transformation
Generally, transforming the data has three main purposes: stabilising the variance in the response variable (homoscedasticity), improving the model fit to the data (additive effects), and making the distribution of the response variable closer to the Normal distribution (thus, approximating symmetry) [
28].
For a random variable with a variance that is proportional to the mean, as in Equation (4), one of the variance-stabilising transformations involves taking the logarithm of the response, , such that:
is a constant [
29].
This transformation also has a normalising effect on a distribution that is positively skewed [
30].
3.2.2. The Lognormal Distribution
Suppose the random variable
follows a Normal distribution with parameters
and
. Consider the transformation
or, equivalently,
; then,
. That is to say, the logarithm of a random variable follows a Normal distribution. Therefore, a random variable
with positive continuous values follows a Lognormal distribution if it has a probability density function that can be written in the following form [
16]:
where
is the Jacobian of the transformation
.
The Lognormal distribution has a constant coefficient of variation, as shown below
where
,
and
The variance of the Lognormal model increases with the mean. It increases with the mean such that the coefficient of variation is constant. Lognormal modelling can be used to compensate for such increases with the mean such that the residual distribution results in homoscedasticity.
3.2.3. The Loggamma Distribution
Let
be a random variable that has a Gamma distribution if it has a probability density function that can be written in the form
with the mean and variance given by:
and
, respectively, such that the coefficient of variation of
is:
since
.
Unlike the Normal distribution with the constant variance assumption, the Gamma model has a variance that increases with the mean. The variance increases with the mean in such a way that the coefficient of variation is again a constant. The Gamma model can be used to compensate for such increases in variance with the mean in modelling.
Considering the transformation or, equivalently, ; then, .
That is, the logarithm of a random variable follows a Gamma distribution. Taking the logarithmic transform of the data will further tame the variance of the data. Therefore, a random variable
with positive continuous values follows a Loggamma distribution if it has a probability density function that can be written in the following form [
16]:
with mean and variance given by:
, for and , for respectively.
The tail of the distribution is given by
[
31].
The Normal distribution assumes the data are symmetric. However, emission data are not always symmetric and may be skewed. In such cases, the Loggamma distribution provides a good alternative. This distribution can also cater to heavy-tailed data [
32] since it belongs to Pareto-type tail distributions; see Albrecher et al. [
32] for details.
Broadly speaking, under the Loggamma distribution, a unimodal distribution represents mid-sized data, and large observations are represented by a Pareto-like tail [
33].
3.2.4. The Exponential Family Distribution
Let
be a random variable with a distribution in the exponential family and a pdf
in the standard form:
The distribution is said to be in canonical form when .
In order to be a GLM, a model must have the three components, namely, an error distribution, a linear predictor, and a link function [
27]. The Lognormal and Loggamma distributions are in the exponential family of distributions for the following reasons:
The Lognormal distribution has independent response variables
with the pdf given in Equation (5) and rewritten as
where
,
,
, and
The Loggamma distribution assumes independent response variables
with the pdf given in Equation (8) and rewritten as
where
is the natural parameter of the distribution,
and
.
- 2.
Linear Predictor.
The linear predictor
is chosen when, for instance, the parameters
and explanatory variable vector
are such that
where
.
- 3.
Link Function.
There exists a monotone link function such that .
Box and Cox [
34] introduced a flexible family of transformations: the power transformations. For a given parameter
, the transformation is defined by
To determine the most suitable data transformation, the Box–Cox method is used to estimate the value.
Generally, transformations are used for three purposes: stabilising the response variance, making the distribution of the response variable closer to a normal distribution, and improving the fit of the model to the data. The last objective could include model simplification, say, by finding a suitable link function. Sometimes, a transformation will be reasonably effective in simultaneously accomplishing more than one of these objectives.
According to Myers et al. [
28], the natural values for
are as follows:
In the case of the dataset used in this study,
is approximated to 1 (
). Although the confidence interval [0.6, 0.7] in the Box–Cox plot in
Figure 1 does not contain a 1, the result is deemed close enough to warrant no transformation of the data. This value of
is obtained by rounding off [
35] since the confidence interval does not also contain a 0.5. For practical reasons, it makes sense to assume (
). Therefore, there is no need to transform the data further, especially in the linear regression case.
There exists a monotone link function
such that
, for the NO
2 emissions data. An appropriate link function is chosen based on the nature of the data of interest. With the response variable being continuous and positive, the link function may be chosen from the following examples,
Linear regression is a GLM with an identity link and Normally distributed data. In a generalised linear model, the mean may be transformed by the link function instead of transforming the emission itself.
The two methods of transformation can lead to quite different results; for example, the mean of log-transformed emissions is not the same as the logarithm of the mean emissions. In general, the former cannot easily be transformed into a mean emission. Transforming the mean often allows the results to be more easily interpreted, especially because the mean parameters remain on the same scale as the measured emissions. Since the Loggamma GLM is compared to the linear regression and Lognormal GLM, the identity link function is used for all three models.
3.2.5. Model Selection
Similar to the linear regression model above, forward selection, Schwarz’s Bayesian Information Criterion (BIC), residual plots, and residuals against predicted value plots are used to select variables with a significant effect in the prediction of NO2 emissions. Backward selection is considered in determining the best-fitting model with interaction variable terms.
For all GLMs, the MLE method is the main method of estimation used [
36]. In the MLE method, a standard assessment involves comparing the fitted model with a fully or saturated specified model [
37]. Suppose
is the parameter vector of the saturated model and
is the ML estimator of
.
denotes the likelihood function of the saturated model evaluated at
. For the maximum value
of the likelihood function of the model of interest, we have
and
as the associated log-likelihoods [
27], such that the deviance
is given by
The deviance against the degrees of freedom is used to determine if a distribution model is a good fit to the data or not. A model with a deviance that is smaller than the degrees of freedom is considered a good fit.
4. Results
This section gives the results for all the 13 power stations that will be used in modelling NO
2 emissions (in tons per month). The data are monthly NO
2 emissions per Eskom station, for a maximum period of 108 months (between 2005 and 2014). These data are presented in the
Supplementary Materials and include the variables, NO
2 emissions, power station itself, the amount of electricity generated from the power station, the age in years of the power station, the abatement technology (filter) used at the particular power station, and the month of the year.
Table 1 gives information on the power stations, including the installed/nominal capacity of the power station and the location [
38].
The analysis was performed using the SAS Studio application under SAS OnDemand for Academics (SAS 9.4 M8, SAS Institute, Cary, NC, USA).
4.1. Descriptive Statistics
In
Table 2, in the given period, the highest amount of monthly NO
2 emissions was 13 923.25 tons, which occurred in March 2009 at the Majuba (in Volksrust, Mpumalanga province, South Africa) power station when it was 13.25 years old. On the other hand, the lowest amount of monthly NO
2 emissions of 26.42 tons was recorded in September 2009 at the Komati (in Middelburg, Mpumalanga province) power station.
Komati power station emitted the lowest average monthly NO2 of 1422.23 tons, while Majuba power station emitted the highest at 10,433.49 tons per month.
The average monthly electricity sent out, measured in GWhs, was highest at Matimba power station (located in Lephalale, in the Limpopo province, South Africa), while Komati power station had the lowest average monthly electricity sent out.
In 2014, the age of the oldest power station was 44 years, while the youngest was 18 years, with Hendrina (in Mpumalanga province) being the former and Majuba (Volksrust, Mpumalanga) being the latter.
The efficiency of a power station is measured by calculating the relative NO
2 (tons/Gigawatt-hours) rather than just observing the amount of NO
2 emissions (in tons), calculated as follows:
Matimba power station was determined to be the most efficient out of the 13 power stations; hence, it received a rank of 1, based on its lowest average relative NO
2 emissions (tons/GWhs); see
Table 2. This indicates that Matimba produces the highest amount of electricity per emission. Kriel (in Mpumalanga province) power station was found to be the least efficient, with a rank of 13, and Komati power station had a rank of 12; see
Table 2.
In
Figure 2, the joint fabric filter and electrostatic precipitators, along with flue gas conditions, are abatement technologies identified as being associated with the highest efficiency with lowest values of average relative emissions. The electrostatic precipitators are associated with the lowest efficiency with the highest relative emissions.
Eskom has some power stations, namely, Grootvlei, Camden, and Komati, that stopped functioning and became mothballed from non-usage in the late 1980s and 1990s due to excess capacity. These plants were later returned to service in response to high demand for electricity, which Eskom could not meet without load shedding, and to reduce the demand pressure experienced in the already operating power stations. The units in Grootvlei were recommissioned between 2008 and 2011, those in Camden between 2005 and 2008, and those in Komati between 2009 and 2013 [
39,
40].
4.2. Tests for Autocorrelation and Collinearity
The Durbin–Watson (DW) test statistic and ACF and PACF plots were used to test the presence of autocorrelation in the data.
The DW test was conducted to test the presence of correlation between two successive observations in the data. The computed DW statistic is 1.325. This indicates the presence of autocorrelation in the data. However, no autocorrelation after log transforming the response was found since the DW statistic of 1.513 is between 1.5 and 2.5 [
41,
42,
43,
44,
45,
46].
The first results presented in
Figure 3 show the AFC and PACF of the linear regression model given in Equation (18), below, fitted before log transformation of the response, i.e., NO
2 emissions. The ACF and PACF results for the log-transformed NO
2 emissions are given in
Figure 4.
Checking for collinearity among paired continuous explanatory variables is crucial prior to fitting a regression model and other models. The presence of such a relationship implies that knowing one variable allows us to predict the other, resulting in both variables attempting to explain the same variability in the response variable.
To detect collinearity, the variance inflation factors
of the explanatory variables are used. If
, it is a cause for concern [
47].
For instance, considering age (in years) and electricity sent out (in GWhs), . This suggests that the two explanatory variables are not significantly dependent on each other.
4.3. Model Selection
Since there is no collinearity, one can proceed to select a model that includes only the significant explanatory variables in predicting NO2 emissions (in tons). To determine this, the BIC and AIC are used. Other variables are eliminated at each step from the model in Equation (1) using backward selection.
Backward Selection
The BIC and AIC values are used to determine if a term should be included in the final model or not. The model with the lowest BIC and AIC values is considered the best model.
In all three models, namely, the linear regression model assuming Normality of data, the Lognormal distribution model, and the Loggamma distribution model, the final model includes the interaction terms electricity × station and age × station and the explanatory variables electricity sent out (in GWhs), age of the power station (in years), and power station used. All other explanatory variables (i.e., abatement technology and month of emission) and interaction terms from the model in Equation (1) are excluded in the backward selection elimination process. This is performed so that the final model only consists of the significant predictor terms.
The age of the power station is included because of the inclusion of the upper order interaction term age × station.
Thus, the final model is given as:
The interaction terms station × filter × month and station × month were excluded from the model since they produced non-convergent results based on the software package used. This model gives a vast output, as shown in
Table 3. Similar results were produced using regression analysis.
4.4. GLM Parameter Estimation
GLM parameters were estimated using maximum likelihood estimation with Matimba as the basis for comparison since it produced the lowest volumes of average relative NO2 emissions. Hence, it is considered the most efficient.
The parameter estimates for the final Loggamma distribution model in Equation (18) above are given in
Table 3. This model includes the interaction terms electricity × station and age × station and the explanatory variables electricity sent out (in GWhs), age of the power station (in years), and power station used.
In
Table 3, the MLE coefficient for electricity sent out (in GWhs) is 0.0004, indicating that a 1 Gigawatt-hour increase in electricity sent out will result in a 0.0004 unit increase in log NO
2 emissions (in log tons). Similar interpretations can be made for other log tons estimates.
A positive estimate for the plant coefficient indicates that the corresponding power station variable in the model has a greater effect on log NO2 emissions compared to the reference station (Matimba) by the estimated value. Conversely, a negative estimate suggests that the reference station has a greater effect on log NO2 emissions compared to the associated power station by the estimated value. The lowest absolute plant coefficient signifies the least impact on emissions (in log tons of NO2) when considering the other variables in the model, while the highest absolute plant coefficient signifies the greatest impact on log NO2 emissions.
Table 3 presents the power station effect in the Loggamma model in the presence of other variables. Power stations Arnot (Middleburg, Mpumalanga), Grootvlei (Balfour, Mpumalanga), Hendrina, Tutuka (Standerton, Mpumalanga), Camden (Ermelo, Mpumalanga), and Kriel (Kriel, Mpumalanga) had negative parameter estimates, indicating a reducing effect on log emissions compared to Matimba. This happens when interaction effects are allowed. Conversely, power stations like Duvha (Witbank, Mpumalanga), Matla (Kriel, Mpumalanga), Komati, Majuba, Lethabo (in Sasolburg, Free State province), and Kendal (Witbank, Mpumalanga) had positive parameter estimates, suggesting an effect on increasing log emissions compared to the basis station (Matimba). Among them, Arnot had the greatest effect on reducing log emissions (with 1.1679 log tons less than Matimba). On the other hand, Kendal had the largest effect on increasing log emissions (with 1.6175 log tons more than Matimba), followed by Lethabo and then Majuba power stations.
The least effect from the 13 power stations for the interaction term electricity * station comes from the interaction term electricity × Kendal (with only 0.0001 log tons less than Matimba), and the interaction between the electricity variable and Komati power station has the greatest effect on increasing emissions significantly (with 0.0057 more log tons when compared to Matimba). Since the interaction between electricity and Komati power station (with Likelihood Ratio 95% Confidence Limits of [0.0049, 0.0064] and a Wald Chi-Square
p-value < 0.0001) showed the greatest effect on increasing emissions, the model supports the decision to decommission the power station on the 31st of October 2022 [
48]. This power station was found to be the second most inefficient power station, with an efficiency rank of 12 after Kriel, based on relative emissions (see
Table 2 above).
The least effect of the 13 power stations for the interaction term electricity * station comes from the interaction term electricity*Kendal (with only 0.0001 log tons less than Matimba), and the interaction between the electricity variable and Komati power station has the greatest effect on increasing emissions significantly (with 0.0057 more log tons when compared to Matimba). Since the interaction between electricity and Komati power station (with Likelihood Ratio 95% Confidence Limits of [0.0049, 0.0064] and a Wald Chi-Square
p-value < 0.0001) show the greatest effect on increasing emissions, the model gives insights into the decision to decommission the power station on the 31st of October 2022 [
48]. This power station was found to be the second most inefficient power station, with an efficiency rank of 12 out of 13 power stations, after Kriel, based on relative emissions (see
Table 2 above).
According to the age × station effect, some power plants, including Komati, Kendal, Camden, Lethabo, and Grootvlei, have an effect on decreasing emissions because their interaction coefficients are negative. Conversely, because their interaction coefficients are positive, power plants such as Tutuka, Arnot, Kriel, Hendrina, Matla, Duvha, and Majuba show an effect on increasing emissions. Of these interactions, Tutuka has the greatest effect on increasing emissions (with 0.0499 more log tons than Matimba) and Komati has the greatest impact on decreasing emissions (with 0.0653 log tons less than Matimba). In general, older plants produce more emissions. However, given its age, Tutuka emits more emissions than expected. Some parameters are not significant, as can be seen in the last column of
Table 3, with
p-values greater than 0.10.
4.5. Criteria for Assessing Goodness of Fit: Selecting the Best Model
In selecting the best model, two methods, namely, plots (histograms, Box plots, Q-Q plots, residuals versus predicted values, and observed values versus predicted values), and the deviance (against the degrees of freedom), and the performance indicators (R2 and adjusted R2) are used to identify the best distribution to represent the NO2 emissions data from 13 Eskom’s coal-fuelled power stations.
Equation (18) is the chosen model. The assumption of Normality is now checked to see if it holds for the NO2 emissions data (in tons).
Figure 5 shows some descriptive graphs of the original data, and the last plot shows residuals from ordinary regression analysis of the best model (Equation (18)).
In
Figure 5, the histogram of the original data (NO
2 emissions) appears to have an asymmetrical shape and is bimodal, indicating a departure from a Normal distribution (Kolmogorov–Smirnov
p-value < 0.01). Additionally, the Box plot of the same data reveals some skewness (with a value of −0.11) and kurtosis (with a value of −0.94) in NO
2 emissions (in tons).
The Q-Q plots of the original data against Normal distribution quantiles indicate that NO
2 emissions (in tons) are not Normally distributed, as the data points deviate from the 45° line in the graph’s extremities. The QQ plot of the NO
2 emissions data against the Normal quantiles also indicates that the data are not Normally distributed as the plot is not linear. The Pearson residuals from the regression line also indicate that the model is not a good fit as the Normality assumption (
is not met. The QQ plot of the resultant residuals and Normal quantiles is not linear. It can be concluded that NO
2 emissions (in tons) do not follow a Normal distribution based on the plots in
Figure 5. Additionally, in the plots in the last row of
Figure 5, the Normal distribution model (regression analysis), given by Equation (18), seems to suggest an increasing variance with the predicted values; hence, the model is not very good. Consequently, the Lognormal and Loggamma distributions are fitted to try and tame the variance.
Figure 6 below shows the plot of residuals versus predicted values and observed values versus predicted values for both the Lognormal and Loggamma distribution models. The plots are used to compare and assess the goodness of fit of the models.
In the plot of the observed against predicted values in
Figure 6, the Lognormal and Loggamma distribution models with the identity link function produced a plot with a somewhat constant variance over the predicted values; hence, the models give a good fit.
To confirm the best link function within each of the distribution models, the deviance is compared to the degrees of freedom (df). Generally, a model with a smaller deviance is preferred [
28,
29]. A model with a value of
suggests a good model fit, while a model with
suggests a poor model fit [
28,
29,
49]. For the Normal distribution, however, the deviance cannot be used as a direct goodness-of-fit statistic due to its dependence on
[
49].
Table 4 shows the model used, the deviance, and the degrees of freedom for both the Lognormal and Loggamma distribution models. Three link functions, namely, the identity, log, and inverse functions, are considered to determine the most suited one for the data in each the distribution models.
The Lognormal and Loggamma distribution models both produced good fits to the NO2 emissions data, and they have deviance values that are smaller than their associated degrees of freedom, with a value of across the three link functions. The identity link function gave the lowest deviance for the two distribution models in this study and was hence used.
To measure the model adequacy of the Lognormal and Loggamma GLM based distribution models, two related performance indicators are used, namely, the variance-function-based R
2 and the adjusted R
2. The classical R
2, which assumes normality of data, is well defined for GLMs with homoscedastic residuals, while the likelihood based R
2, e.g., R
2 by Magee [
50], Maddala [
51], Cox and Snell [
52], and Nagelkerke [
53], and the KL-divergence-based R
2 by Cameron and Windmeijer [
54], require a well-defined likelihood function. However, the variance-function-based R
2 and the adjusted R
2 by Zhang [
55] take into account the relationship between the mean and variance functions and do not require the likelihood function in their computation [
55], making them particularly a good choice for the current study.
Table 5 presents the variance-function-based R
2 and the adjusted R
2 for the Lognormal and Loggamma distributions.
In
Table 5, both the Lognormal and Loggamma distributions demonstrate a good fit, with all the performance indicators giving very similar values. For example, both distribution models have an R
2 value that is very close to 1, with R
2 = 0.9605 for the Lognormal model and R
2 = 0.9711 for the Loggamma model. However, it can be noted that the Loggamma distribution model gave better results than the Lognormal distribution model for both measures (R
2 and adjusted R
2).
5. Discussion
The performance of the GLM-based Loggamma distribution model is compared with that of the traditional Normal distribution based simple linear regression model and the GLM-based Lognormal. The Normal distribution model assumes a constant variance in modelling NO2 emissions from Eskom’s 13 coal-fuelled power stations in South Africa. In essence, this study investigates whether log transforming the NO2 emission data and fitting two GLM based models, i.e., the Normal and the Gamma models, can give a good fit if the assumption of normality of the data does not hold.
The ordinary least squares method is used to estimate the parameters of the regression model. The results of the linear regression model suggest that NO
2 emissions data are not Normally distributed, which is also supported by the results of the histogram, Box plot, and QQ plots. The very large deviance value compared to the degrees of freedom in the Normal model also suggested that the linear regression model does not give a good fit when modelling the NO
2 emissions data. The assumption of Normality of the data does not hold and thus, one cannot obtain a good fitting model to the data when using a linear regression model. This does not come as a surprise since it is common to find emissions data, including NO
2, that violate the Normality assumption [
17,
18,
19,
20,
21,
22,
56,
57,
58], especially when the variance increases with the mean [
36].
MLE parameter estimation methods were used for the GLM distribution models. The results indicate that the Lognormal and Loggamma models, both with the identity link function, gave a good fit, with the Loggamma producing a better fit between the two. These finding suggest that the GLM with the Loggamma distribution model provides a better framework for predicting NO
2 emissions from coal-fuelled power stations at Eskom. Therefore, Loggamma distribution models can carter to increasing variance with the predicted values (see
Figure 6).
Of the two distributions, the Lognormal distribution is common in applications of pollutant emission [
59,
60,
61,
62,
63], although it is less prevalent in a GLM framework for modelling NO
2 emissions from coal-fired power stations. There is little if any evidence of the Loggamma distribution’s application in NO
2 emission modelling from coal-fired power stations, especially in a GLM setup. Thus, the current study uses these GLM distribution models to close the gap and better explain NO
2 emissions data patterns.
The model is able to capture and quantify the behaviour of NO2 emissions, as well as capture the important or significant variables, such as electricity sent out (in GWhs), age of the power station (in years), and power station used, and the interaction terms, such as electricity × station and age × station. High- and low-emission stations are identified, e.g., the power stations Kendal, Komati, and Tutuka are flagged by the parameter estimates for the terms “power station”, “electricity × station”, and “age × station”, respectively, as having the largest effect on increasing emissions, while Arnot, Kendal, and Komati are flagged as having the lowest effect.
Based on mean relative emissions, the abatement technologies associated with the lowest and highest emissions are also identified, with the joint fabric filter and electrostatic precipitators, along with flue gas conditions, exhibiting the highest efficiency, while the electrostatic precipitators demonstrated the highest efficiency.
All the above information can be used to formulate concrete air pollution mitigation strategies, such as identifying which power station should get a new abatement technology. The information provided by the model can be used to meet regulations and adherence, promoting sustainability in the energy sector.
Therefore, Eskom can use the methods applied in this study to understand similar other power plants and environments not considered herein. The Loggamma model method employed here has not received wide and significant attention in the field of emissions modelling.
Air pollution from industrial areas can be managed and regulated. This includes areas such as coal-fired power stations. To achieve this, statistical information is needed to influence decisions and regulations that need to be enforced. In a developing country such as South Africa, economic growth is very important, but this should not come at the expense of our health and the environment. Striking a balance is a challenge. While the country strives to promote eco-friendly power generation, the energy demand keeps rising, compelling Eskom to prolong the operation of aged coal power stations and the postponement of their permanent decommissioning in order to meet current demand. Statistical models such as the Loggamma model proposed here can help quantify emissions to understand how long we can delay the inevitable closure of some of these power plants. Stringent policy and management strategies for reducing and mitigating emissions should be informed by models such as the Loggamma. Our results suggest prioritising the electricity sent out, the age of a power station, and/or least efficient power stations as a starting point while accelerating renewable energy efforts.
Recommendation
The findings of the current study demonstrate the benefits in deriving information on emissions by fitting the Loggamma distribution over the Normal distribution in a GLM setup when the data are non-Normal and the variance increases with the mean. The Loggamma distribution is a somewhat heavy tailed distribution, akin to the Lognormal distribution. The Lognormal distribution is in the Gumbel domain of attraction, but the Loggamma is in the Pareto-type [
32] heavier tail. The findings indicate the potential presence of extremes in the data. As a result, the next stage will be to investigate the tail distribution of the NO
2 emissions from Eskom’s coal-fuelled power stations by exploring extreme value theory distributions, namely, the generalised extreme value distribution (GEVD) and the generalised Pareto distribution (GPD). This is crucial for understanding the frequency and intensity of extremely high NO
2 emissions events and can provide information for more targeted mitigation strategies for the most severe air pollution scenarios and the power stations responsible for such high emissions. This is especially relevant considering the exacerbated effects of environmental and human exposure to very high NO
2 emissions.
6. Conclusions
This paper fitted and compared three GLM distribution models, namely, the Normal, Lognormal, and Loggamma distribution models, with the identity link function in the modelling of NO2 emissions at Eskom’s 13 coal-fuelled power stations. Through backward stepwise variable selection, two predictive models were developed, representing the two distributions. The NO2 emissions data have a variance that increases with the mean of the emissions. The residual plots, actual versus predicted plots, and deviance values (against their associated degrees of freedom) confirm the best fit of the Loggamma model to the NO2 emissions data over the Normally distributed based regression model and the Lognormal GLM.
These results are significant for understanding and predicting NO2 emissions, particularly when considering electricity production from Eskom’s coal-fuelled power stations in South Africa. Using the Loggamma-distributed GLM model for NO2 emissions, the emissions can be explained and predicted. This will assist in developing effective strategies to lower air pollution and promote sustainable practices in the energy industry.
The application of these models to other emissions, geographical locations, and power generation facilities can provide valuable information on the generalisability and applicability of the findings. Overall, this study contributes to the modelling technique enhancement for NO2 emissions and provides useful information for policymakers and stakeholders involved in air quality management and energy production, especially in the case of Eskom in South Africa.