Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa

Mamba, Mpendulo Wiseman; Chikobvu, Delson

doi:10.3390/atmos17040415

Open AccessArticle

Statistical Analysis of NO₂ Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa

by

Mpendulo Wiseman Mamba

^1,*

and

Delson Chikobvu

²

¹

Department of Mathematical and Physical Sciences, Central University of Technology, Bloemfontein 9301, South Africa

²

Department of Mathematical Statistics and Actuarial Science, University of the Free State, Bloemfontein 9300, South Africa

^*

Author to whom correspondence should be addressed.

Atmosphere 2026, 17(4), 415; https://doi.org/10.3390/atmos17040415

Submission received: 2 March 2026 / Revised: 11 April 2026 / Accepted: 13 April 2026 / Published: 19 April 2026

(This article belongs to the Special Issue Modeling and Monitoring of Air Quality: From Data to Predictions)

Download

Browse Figures

Versions Notes

Abstract

Gaseous emissions from coal combustion during electricity generation continue to be a challenge in South Africa. To meet the regulatory limits, it is crucial to understand the statistical distribution of such emissions from the power generating plants. The current paper characterises the nitrogen dioxide (NO₂) emissions from Eskom’s Majuba coal-fired power station by making use of the quantile–quantile (QQ) plots and derivative plots of three statistical parent distributions, namely, the Weibull, Lognormal, and Pareto distributions. These distributions are fitted and compared according to their tail heaviness as they cater for data that may have tails lighter or heavier than that of the Exponential distribution. Of the three distributions evaluated here, the Lognormal gave the best fit for the full body of the data according to the QQ and derivative plots, and the goodness-of-fit tools (bootstrap Kolmogorov–Smirnov (KS), Anderson–Darling (AD), Akaike Information Criterion (AIC), Schwarz’s Bayesian Information Criterion (BIC), and the BIC-corrected Vuong test for non-nested distributions). The Lognormal distribution also gave the best fit for the overall upper tail, while at the very top six largest NO₂ emission observations in the upper tail, a Pareto-type tail was observed. The practical implication of a heavy tail like the Pareto is that it models more frequent larger sized NO₂ emissions compared to lighter tails like the Weibull and Lognormal tails. The methods used in this study give a framework on how emissions of NO₂ from a coal-fired power station can be modelled using statistical parent distributions whilst also taking into account the distribution of the data in the tails which is mostly ignored when fitting statistical parent distributions. Understanding the distribution of the upper tail is very important since higher and rare emissions are of the most concern and are dangerous to human health and the environment.

Keywords:

heavy-tail; Lognormal; nitrogen dioxide (NO₂); derivative plot; quantile–quantile (QQ) plot; Majuba power station; Eskom; coal-fired power plant

Graphical Abstract

1. Introduction

The heavy reliance of modern day society on electricity implies that any disturbance to its supply impacts severely on the society’s day-to-day life and economically [1]. About 40% of electricity around the globe is generated by coal-fired power stations, and in the South African context, this figure is much higher at about 77% [2]. The life span of most coal-fired power stations is around 50 years and depends on how they were designed and constructed. Since 2019, many of these power stations have arrived at this expected life span [3] in South Africa (SA).

With industrialisation, urbanisation, and population growth, the consumption of energy in SA is rising. Ceteris paribus, this implies an increase in emissions [4,5]. About 50% of Africa’s emissions are produced in South Africa due to its extensive use of coal [6]. On a 20 year projection, emissions are expected to grow by approximately 30%, globally, in the absence of strong mitigation policies [7]. One of the emissions from the burning of coal during power generation is nitrogen dioxide (NO₂). NO₂ emissions are mostly sourced from thermal power stations and automobile exhausts. They are linked to respiratory tract inflammation and can lead to poor quality of health in patients struggling with emphysema, asthma, heart diseases, and other diseases when inhaled [8]. It is for these reasons that understanding and prediction of the behaviour of such emissions from coal-fired power stations is essential for the management and reduction of these emissions, including NO₂. For reaction mechanisms as well as the newest developments in the field, see [9,10].

The application of statistical or probability distributions for the prediction of pollutant concentrations to determine the pollutant impact on the human health is important. These pollutant concentrations are considered as statistical random variables that can be modelled by a positively skewed statistical distribution [11]. There is no preselected distribution designated for modelling a particular pollutant. However, the choice of a distribution depends on emission levels, meteorological conditions, and geography [12].

In the modelling of parent distribution of air pollutants, Nwaigwe et al. [13] used three distributions, namely, Weibull, Lognormal, and Gamma, to model carbon monoxide (CO) emissions observations in Nigeria for the period of 1996 to 2016. The Weibull outperformed the Lognormal and the Gamma distributions in describing the carbon monoxide data.

A study by Okorie et al. [14] on several sites (Weaverville, California (WVR); Tundra Lab, Niwot Ridge, Colorado (TUN); South Pole, Antarctica (SPO); and Mauna Loa, Hawaii (MLO)) in the United States of America investigated four datasets of surface-level ozone (O₃) and fitted eleven heavy-tailed distribution, namely, the generalized Pareto distribution, generalized extreme value distribution, Pareto type-I distribution, Pareto type-II distribution, Log-Cauchy distribution, Burr distribution, log-logistic distribution, Fréchet distribution, Lognormal distribution, Lévy distribution, and Dagum distribution. The Dagum distribution gave the best fit for Weaverville, the Burr distribution gave the best fit for Niwot Ridge and Mauna Loa, and the Log-Cauchy distribution gave the best fit for the South Pole site.

In another study of the Guadeloupe archipelago on the PM₁₀ (particulate matter of size 10 micrometres or less) daily average concentrations data collected over 11 years period, the Weibull, Lognormal, Burr, and stable distributions were fitted. Also considered were mixture models. Relative to the above studies, the Burr and Weibull mixture models gave the best fit for both the parent and tail distributions [15].

Another study on the daily average PM_2.5 concentrations collected at Yupparaj Wittayalai school station and City Hall station in Chiang Mai’s Muangwere in Thailand over a period of two years (2016–2018) were analysed by considering the Weibull, Gamma, Lognormal, and Inverse Gaussian distributions. In this study, the Inverse Gaussian distribution gave the best fit for modelling the daily average PM_2.5 concentrations from both stations [16].

Oguntunde et al. [17] applied three theoretical statistical distributions, namely, the Weibull, Gamma, and Lognormal, to model carbon monoxide (CO) concentrations in their study in Lagos, Nigeria. The Gamma outperformed the Weibull and Lognormal distributions fit of the data based on the Anderson–Darling and Kolmogorov–Smirnov tests. The characteristics of the pollutant was determined and probability of exceeding the set limits were predicted based on the best fitting distribution.

In a study by Giavis et al. [18], the Lognormal, Weibull, and Gamma were fitted to particulate matter with an aerodynamic diameter less than 10 µm (PM₁₀) recorded in Athens, Greece and Manchester, UK. The goodness-of-fit criteria was performed by three measures, Mean Bias Error, Root Mean Square Error, and index of agreement, for the three distributions. Results showed that, in general, the three distributions can be used to represent the PM₁₀ data. However, the Weibull gave unstable results for the PM₁₀ data, while the most appropriate fit of the data was obtained using the Lognormal distribution.

Another study in Malaysia [19] focused on the distribution of ground level ozone (O₃), one of the major contributors to the Air Pollution Index in Malaysia, by fitting and comparing the two-parameter distributions, the Lognormal and the Gamma, in order to find the best fitting distribution. In this study, the Gamma distribution outperformed the Lognormal in the modelling of the O₃ data.

For the high concentrations, however, the centre fitting of distribution of pollutant emissions data tend to produce a fit that is not a representative of the data [20]. In such cases, it is common to employ distributions that capture the behaviour at the extremities. This enhances understanding of the pattern of emission, and thus modelling of high emissions become very important since even exposure to such high emissions over a short period of time can lead to serious health implications in the population and other ills [21].

Thus, other studies considered the modelling of air pollutants by fitting multiple distributions that include both parent and extreme distributions. For example, Kan et al. [12], applied the Gamma, Lognormal, Pearson V, and extreme value distributions to daily average concentration data of three pollutants, namely, PM₁₀, SO₂, and NO₂ in Shanghai, China. The Lognormal, Pearson V, and extreme value distributions gave good fits for the PM₁₀, SO₂, and NO₂ data, respectively.

Like the studies above, the current study makes use of statistical parent distributions in the modelling on NO₂ emissions data from Majuba power station. However, unlike the studies mentioned, the chosen parent distributions in the current study considers the distribution of the data in the main body as well as the tails when selecting the best distributions to represent the NO₂ emissions data. Traditionally, parent distributions are used for central fitting of the data and estimates are based on where the bulk of the data is located, without focusing on the data in the tails/outliers. The current study aims to use and compare three parent distributions, namely, the Weibull, Lognormal, and Pareto distributions, chosen according to increasing tail heaviness [22] when finding the most representative distributions in the modelling of the full NO₂ emissions data. Additionally, these distributions are simple, widely utilised, and tail-equivalent (or possess similar asymptotic behaviour) to numerous more complex distributions, therefore justifying their selection [23]. In Albrecher et al. [24], it is indicated that if the data is suspected to have a heavy upper tail, then the QQ and derivative QQ plots (or derivative plots) of the Weibull, Lognormal, and Pareto offer an alternative for modelling the data, with the Pareto giving a good fit for large claims data, while the Lognormal performs well for the medium claims. However, the Weibull distribution is capable of modelling data with tails that are lighter (

τ > 0

), heavier (

τ < 0

) than, or equal to (

τ = 0

) that of an Exponential distribution [22]. The chosen distributions have capabilities of modelling data with lighter to heavier tails with reference to the Exponential distribution. This is important since the Exponential distribution is the basis for classification of tail heaviness [23,24], and explaining and understanding the upper tail distribution of the NO₂ emissions with reference to the Exponential distribution is a good initial step and can thus be beneficial.

Graphical analysis by employing multiple plots provide better exploration and analysis of data for a balanced conclusion [24]. QQ and derivative plots of the Weibull, Lognormal, and Pareto distributions are examples of such plots. Some works in the literature have employed at least two of the three distributions for explaining upper tail heaviness of a dataset. For example, Albrecher et al. [24] laid a foundation and introduced the derivative plots of the three distributions with applications to some insurance datasets. Beirlant et al. [25] considered only the Weibull, Lognormal, and Pareto distributions in the tail modelling and classification of a few insurance datasets and highlighted the strength of the derivative plots. Albrecher et al. [26] made use of the derivative plots, among other plots, of the Pareto and Weibull to model the main body and upper tail, respectively, of a few insurance datasets. Jakata et al. [27] compared the Weibull, Lognormal, and Pareto distributions to characterise the tails of the South African Industrial and Financial Indices growth rates, and employed derivative plots in arriving at the most appropriate distribution for their data. In these studies [24,25,26,27], the derivative plots were either used to assist in characterisation different datasets or referenced as good diagnostic tools. One benefit of the derivative plot when partitioning is plausible is that it allows for piecewise distribution fitting to a dataset since it can capture distributional patterns of a dataset across its components. It also possesses the benefit of being able to indicate if a lighter or heavier tailed distribution than the one being investigated could be best suited for the full dataset or its component(s), thus facilitating distribution fitting of a full dataset, including both central and tail fitting. These features can also be very useful in diagnostic checking of a dataset before more sophisticated modelling techniques such as mixture, composite, or extreme value theory distributions can be fitted.

In emissions literature, the flexible derivative plot, among the often-utilised QQ, PP, and density plots, is rarely utilised to determine the best suitable distribution for emission concentrations. This paper intends to utilise the derivative plot alongside QQ, PP, and density plots of the Weibull, Lognormal, and Pareto distributions to analyse the full NO₂ emissions data from the Majuba power station, as understanding emissions patterns, in both the bulk and upper tail of the data, can aid in modelling associated environmental risks. Therefore, the study aims to answer the following questions:

Among the Weibull, Lognormal, and Pareto distributions, which distribution(s) best characterises the full/overall and upper tail distribution patterns of NO₂ emissions data from the Majuba power plant?
Can the derivative plot provide additional diagnostic value for distribution selection?

Table 1 summarises the differences in the methods used in past studies and the current one. It mainly highlights improvements in actuarial modelling methods, i.e., derivative plots, with a focus on overall and tail fitting, which the current study proposes for NO₂ emissions dataset from Majuba power station.

Methodological Overview

The paper is organised and follows this order:

(i): First, the data is assessed if the assumptions of independence and stationarity hold before fitting of the probability distributions. If not, then adjustments to the data are made to satisfy the assumptions.
(ii): The upper tail heaviness is then assessed to determine the appropriateness of the selected distributions for the data, particularly the fitting in the upper tail. The EVI estimates and the generalised QQ plots are used to achieve this.
A good choice of $k$ , the number of exceedances, and, thus, the threshold can be determined by selecting a point or points where two or more of the EVI estimates plots $\hat{γ}$ intersect [24,28]. There is, however, a trade-off between the variance and bias in the selection of $k$ . With higher values of $k$ (lower values of the threshold), bias increases and the variance decreases. Conversely, with lower values of $k$ (higher values of the threshold), bias decreases and the variance increases [29]. As a result, caution should be applied when a $k$ is selected. The purpose of the current step is to only check the suitability of the selected distributions by assessing the upper tail.
(iii): The Weibull, Lognormal, and Pareto distributions are then fitted to the NO₂ emissions data by employing the QQ and corresponding derivative plots of these distributions. For all three distributions, a linear QQ plot and a horizontal derivative plot show that the data belongs to that particular distribution. Convexity in the QQ plot and an increasing derivative plot suggest that a heavier tailed distribution than the one under investigation is a better candidate for that component of the data, while concavity in the QQ plot and a decreasing derivative plot suggest the appropriateness of a lighter tailed distribution than the one investigated for the component. As a result of this flexibility, employing the QQ and derivative plot can allow for piecewise analysis where necessary [24].
(iv): The bootstrap goodness-of-fit tests, cross-validated likelihood, and information criteria (Akaike Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC)) are used to assess the adequacy of the models for the full NO₂ emissions data, then the BIC-corrected Vuong test for non-nested distributions is used to compare the performance of the distributions used in this study. The BIC-corrected version of the Vuong test is used since it places heavier penalty on model complexity compared to the AIC-corrected and uncorrected versions [30].
(v): In the final step, the BIC and BIC-corrected Vuong test are again used to compare the performance of the three distributions across different values of $k$ to check the stability of the fit in the upper tail of NO₂ emissions data. A consistent distribution across $k$ will indicate stability in distribution choice.

2. Methodology

This section presents the probability distribution functions (pdf), namely, the Weibull, Lognormal, and Pareto distributions, their corresponding quantile–quantile plots, and the derivative plots. Also provided are the estimators of the extreme value index (EVI),

γ

.

2.1. The Shape of the Upper Tail: Extreme Value Index ( $γ$ ) Estimation

The Hill and EPD estimators provide results for estimating the extreme value index (

γ

). However, the estimates produced by these methods are limited to

γ > 0

and do not cater for

γ \leq 0

. As a result, for this study, the generalised Pareto distribution, the generaised Hill and the Moment estimators are used to try and determine the shape of the tail before fitting any of the three distributions. These estimators will assist in assessing which distribution is likely to provide a good tail fit. Additionally, to determine a potential candidate distribution for both the central and tail fit, the generalised QQ plot provides a good alternative and will be considered. The estimators are given by the following equations.

Generalised Hill [24]

${\hat{γ}}_{k, n}^{G H} = \frac{1}{k} \sum_{j = 1}^{k} l o g U H_{j, n} - l o g U H_{k + 1, n} = H_{k + 1, n} + \frac{1}{k} \sum_{j = 1}^{k} (l o g H_{j, n} - l o g H_{k + 1, n}),$

(1)

where $U H_{j, n} = X_{n - j, n} H_{j, n}$ .
Moment estimator [24,31]

${\hat{γ}}_{k, n}^{M} = H_{k, n} + 1 - \frac{1}{2} {(1 - \frac{H_{k, n}^{2}}{H_{k, n}^{(2)}})}^{- 1},$

(2)

where $H_{k, n}^{(2)} = \frac{1}{k} \sum_{j = 1}^{k} {(\log X_{n - j + 1, n} - \log X_{n - k, n})}^{2}$ and $H_{k, n}^{2} = {{(H}_{k, n})}^{2}$ .
Generalised QQ plot [24]

To verify the choice of

γ

, the generalised QQ plot, given by the following equation is used:

(l o g \frac{n + 1}{k + 1}, l o g X_{n - k, n} H_{k, n}), k = 1, \dots, n - 1,

(3)

The generalised QQ plot, therefore, allows for any value of

γ

. A horizontal trend suggests the data belongs to the light tailed Gumbel domain, with

γ = 0

. A decreasing pattern suggests the data may belong to a lighter tailed than an Exponential distribution, while an increase in the generalised QQ plot indicates that the data is heavier than the Exponential distribution.

2.2. The Exponential Distribution

When modelling the of tails of statistical data distributions, the Exponential distribution plays a significant role as the baseline in the determination of the thickness/thinness of the data’s tail. The Exponential distribution is a special form of the Weibull distribution when the shape parameter

τ = 1

.

2.3. Weibull Distribution

The Weibull distribution with shape parameter,

τ

, scale parameter,

λ

, and distribution function,

F (x) = 1 - \exp (- λ x^{τ}), x > 0,

(4)

is a first Box–Cox transformation of the Exponential distribution, and for

0 < τ < 1

, it is sub-exponential. When

τ > 1

, the Weibull in Equation (4) is lighter tailed than the Exponential (LTE). Conversely, when

τ < 1

, the Weibull is heavier tailed than the Exponential HTE. The Weibull distribution has extreme value index

γ = 0

and, thus, belongs to the Gumbel domain.

The QQ plot of the Weibull distribution is given as

(\log [- \log (1 - \frac{i}{n + 1})], \log X_{i, n}), i = 1, \dots, n .

(5)

and the derivative plot is given as

(\log x_{n - k, n}, \frac{H_{k, n}}{W_{k, n}}) o r (k, \frac{H_{k, n}}{W_{k, n}}),

(6)

where

W_{k, n} = \frac{1}{k} \sum_{j = 1}^{k} \log \log \frac{n + 1}{j} - \log \log \frac{n + 1}{k + 1}

and

H_{k, n} = \frac{1}{k} \sum_{j = 1}^{k} \log X_{n - j + 1} - \log X_{n - k, n}

.

H_{k, n}

is the estimator of

γ = 1 / α

[32].

2.4. Lognormal Distribution

The Lognormal distribution is obtained by transforming the data and fitting the normal distribution to the data. This is a HTE distribution with parameters

μ

and

σ

denoting the mean and standard deviation. The distribution function of the Lognormal is

F (x) = 1 - \frac{1}{σ \sqrt{2 π}} \int_{x}^{\infty} \exp {- \frac{{(\log u - μ)}^{2}}{2 σ^{2}}} \frac{d u}{u} = Φ (\frac{\log x - μ}{σ}), μ \in R, σ > 0 .

(7)

The Lognormal distribution has

γ = 0

and is also heavier than both Weibull distributions. The tail of the distribution is given by

\bar{F} (x) ~ \frac{σ}{\log x \sqrt{2 π}} \exp {- \frac{{(\log x - μ)}^{2}}{2 σ^{2}}}, x \to \infty

(8)

The Lognormal QQ plot is given as follows

(Φ^{- 1} (\frac{i}{n + 1}), l o g X_{i, n}), i = 1, \dots, n,

(9)

where

Φ^{- 1}

denotes the standard normal quantile function. Let

φ

denote the standard normal density, then the Lognormal derivative plot is the presented by

(\log x_{n - k, n}, \frac{H_{k, n}}{N_{k, n}}) o r (k, \frac{H_{k, n}}{N_{k, n}}),

(10)

with

N_{k, n} = \frac{n + 1}{k + 1} φ (Φ^{- 1} (1 - \frac{k + 1}{n + 1})) - Φ^{- 1} (1 - \frac{k + 1}{n + 1})

since

\frac{1}{k} \sum_{j = 1}^{k} Φ^{- 1} (1 - \frac{j}{n + 1}) - Φ^{- 1} (1 - \frac{k + 1}{n + 1}) \approx \int_{0}^{1} Φ^{- 1} (1 - u \frac{k + 1}{n + 1}) d u - Φ^{- 1} (1 - \frac{k + 1}{n + 1}) = N_{k, n} .

(11)

2.5. Pareto Distribution

In statistics, the Pareto distribution is viewed as the prime example of a heavy-tailed distribution. The probability distribution function of a strict Pareto with shape and scale parameters given by

α

and

β

, respectively, is defined as

F (x) = 1 - {(\frac{x}{β})}^{- α}, α > 0, 0 < β < x .

(12)

as a sub-exponential for all values of

α

. Suppose

\log X

has an Exponential distribution with

λ = α

when

X

has a strict Pareto (

α

) distribution, then the Pareto QQ and derivative plots are presented by the following equations,

(- \log (1 - \frac{i}{n + 1}), \log X_{i, n}), i = 1, \dots, n,

(13)

and

(\log x_{n - k, n}, H_{k, n}) o r (k, H_{k, n}),

(14)

respectively.

2.6. Goodness-of-Fit Test

2.6.1. Kolmogorov–Smirnov (KS) and Anderson–Darling (AD) Tests

To assess how well each of the three distributions in this study fits the NO₂ emissions data, the KS and AD tests are used. These tests are used to arrive at a decision of whether the data comes from a population with the specified distribution or not. These goodness-of-fit tests test are performed to test the following hypotheses:

H₀.

The NO₂ emissions data comes from the specified distribution.

H₁.

The NO₂ emissions data does not come from the specified distribution.

If the p-value is smaller than 0.05 (the 5% significance level), then there is strong evidence against the null hypothesis that NO₂ emissions data comes from the specified distribution. The KS and AD test statistics are defined in [33] by Equations (15) and (16), respectively, as

D = {m a x}_{1 \leq i \leq N} [F (Y_{i}) - \frac{i - 1}{N}, \frac{i}{N} - F (Y_{i})],

(15)

and

A^{2} = - N - \frac{1}{N} \sum_{i = 1}^{N} (2 i - 1) {\ln [F (Y_{i})] + \ln [1 - F (Y_{N - i + 1})]},

(16)

where

F

is the theoretical cumulative distribution function of the tested or specified distribution and

Y_{i}

,

i = 1, 2, 3, \dots, Y_{N},

are the

N

ordered NO₂ emissions data points. The AD test is a modification of the KS test in that it allocates more weight to the tails. The AD test is more sensitive to what happens at the tails of the distribution.

The p-values will be calculated using the parametric bootstrap because the traditional KS and AD critical values are invalid for parameter estimation [34,35,36,37]. In each of the three distributions (Weibull, Lognormal, and Pareto), N = 1000 bootstrap samples were generated, parameters were re-estimated, and empirical p-values were calculated [35].

2.6.2. BIC-Corrected Vuong Test

The BIC-corrected Vuong test with the null and alternative hypotheses given as H₀: Both models are equally close to the true distribution against H₁: One model is closer to the true distribution, compares a pair of non-nested distributions. If V, the test statistic, is greater than 0, then Model 1 (the first model) is preferred over Model 2 (the second model) for the data, and if V is less than 0, then Model 2 is preferred [30,38,39,40,41,42]. This test will be used for comparing the three distributions.

2.6.3. Cross-Validated Predictive Likelihood

The K-fold cross-validated likelihood is implemented to assess out-of-sample predictive performance [43,44,45,46,47]. The NO₂ emissions data will be randomly partitioned into

K = 5

approximately equally sized mutually exclusive folds. For each of the

K = 1,2, . ., 5

:

(1): the remaining $K - 1$ fold (the training set) will be used in the estimation of maximum likelihood parameters,
(2): the fitted model will then be evaluated on the held-out fold (the testing set),
(3): summing contributions of the predictive loglikelihoods across all folds, the cross-validated loglikelihood will be obtained.

2.6.4. The Akaike Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC)

The AIC and BIC criteria will be used to rank the performance of the distributions. Lower values of the AIC and BIC are desired. Thus, a distribution with the lowest value will be considered the best fitting distribution model from the three.

3. Results

3.1. Data and Data Decomposition

The current paper uses the monthly NO₂ emissions data given in tons from Eskom’s Majuba coal-fired power station located in Volksrust, Mpumalanga, South Africa collected from April 2005 to March 2014. The data used for this study, as shown in Table A1 of Appendix A, only consists of aggregated monthly data. No information on the analytical technique (chemiluminescence, FTIR, or NDIR/UV), sampling method (hot–wet or cold–dry), collection frequency (hourly average, daily average, or monthly cumulative), and data quality control processes (internal or external quality assurance/quality control procedures, data calibration and validation, data quality issues or limitations) is available since the data was obtained from the power utility, Eskom, as is. From a temporal variability and uncertainty of the measurement’s perspective, this limitation is acknowledged and should be taken in to account in the interpretation of the findings.

The R version 4.4.2 (2024-10-31) statistical software with packages, ReIns (1.0.15) and fitdistrplus (1.2-2) was used to analyse the data.

The STL (Seasonal and Trend decomposition using Loess) [48,49] data decomposition is used to explore the presence of trends and seasonality in the NO₂ emissions data, and the results are presented in Figure 1.

Figure 1 shows the STL decomposition of the NO₂ emissions data used in this study. The plot in Figure 1 is divided into four components, the actual data, the seasonal component, the trend structure and the remainder (irregular). Let

Y_{t}

be the raw NO₂ emissions data,

T_{t}

, the trend component,

S_{t}

the seasonal component and

R_{t}

the remainder component, then

Y_{t} = T_{t} + S_{t} + R_{t}

,

t = 1, . ., n

[48,50]. Loess smoothing is used on the seasonal sub-series to obtain the seasonal component. By taking the mean, the smoothing is effectively replaced. The trend is obtained by removing the seasonal values and smoothing the remainder. The overall level is subtracted from the seasonal component and added to the trend component. Taking the residuals from the seasonal plus the trend fit, the remainder component is obtained. The advantages of this method can be found in reference [51].

In Figure 1, it can be observed that the data depicts some seasonality (regular patterns). This is supported by the Osborn, Chui, Smith, and Birchenhall (OCSB) test of seasonal unit root, with test statistic = −7.0139, 5% critical value = −1.803, indicating that the data is seasonally stationary [52]. This suggests that the seasonal pattern is likely deterministic, and therefore no differencing is required to de-seasonal it. On the other hand, there is some small-scale trend presence in Figure 1, as evidenced by the p-value = 0.000000006299 < 0.05 of the Mann–Kendall rank test, indicating sufficient evidence against the null hypothesis of absence of a trend.

Since the data shows presence of a trend and seasonality, it is detrended and de-seasonalised to approximate stationarity and weak dependence for the subsequent distributional (Weibull, Lognormal, and Pareto distributions) modelling. The resultant dataset is of the form

Y_{t}^{*} = Y_{t} - T_{t} - S_{t} + M_{t} = R_{t} + M_{t},

such that

E (Y_{t}^{*}) = {E (Y}_{t}) - E (T_{t}) - {E (S}_{t}) + {E (M}_{t}) = {E (R}_{t}) + {E (M}_{t}) = {E (M}_{t}),

where

Y_{t} - T_{t} - S_{t} = R_{t}

, and

M_{t}

is the recentering component, namely, the mean of

Y_{t}

since in STL,

{E (S}_{t}) \approx 0

and

{E (R}_{t}) \approx 0

.

Y_{t}^{*}

is thus the detrended, seasonality adjusted NO₂ emissions data [49,53,54].

3.2. Descriptive Statistics

Table 2 gives the descriptive statistics for the detrended, seasonally adjusted NO₂ emissions data,

Y_{t}^{*}

.

The skewness (=0.2359) and kurtosis (=3.0447) values suggest asymmetry of the NO₂ emissions data. These observations are confirmed by the QQ plot and histogram in Figure 2. In the upper tail of the QQ plot, the observations are above the theoretical normality line (red), indicating that the upper tail is indeed heavier than that of a normal distribution. Additionally, the histogram is not symmetric but somewhat right skewed.

3.3. Stationarity and Independence Tests

Table 3 presents the normality, stationarity, and independence tests for the NO₂ emissions data,

Y_{t}^{*}

.

The normality of the NO₂ emissions data is tested by making use of the KS and AD tests in Table 3. Both tests indicate strong evidence against the null hypothesis of normality of the NO₂ emissions data since the p-values are smaller than 0.05.

Two stationarity tests employed in this study, namely, the Augmented Dickey–Fuller and Phillips–Perron Unit Root tests show strong evidence against the null hypothesis of non-stationarity of the NO₂ emissions data. These are supported by the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) with p-value > 0.05, indicating strong evidence of failure to reject the null hypothesis of stationarity of the NO₂ emissions data. On the other hand, all the tests presented in Table 3 used to assess the randomness of NO₂ emissions data with the null hypothesis of independence (randomness) indicate strong evidence to support the null hypothesis of randomness (independence) of the data since all have p-value > 0.05 [14].

The limited sample size of 108 observations suggests that the stationarity and randomness tests indicate the NO₂ emissions data,

Y_{t}^{*}

, to be approximately stationary time series data that may exhibit residual dependence structures, instead of a definitive proof of independent and identically distributed (iid) NO₂ emissions data.

3.4. Shape of the Tail

Figure 3 presents the EVI estimates given by the generalised Hill, the moment, and the PoT estimates. The estimators are chosen due to their ability to accommodate not only the positive gamma values (

γ > 0

) like the Hill and EPD, but they can account for values of gamma smaller than or equal to zero (

γ \leq 0

), since emission data is not always heavy-tailed. The Hill and EPD are also included for comparison purposes. Also included in Figure 3 is the generalised QQ plot.

In Figure 3, all the EVI estimator (generalised Hill, the moment and the PoT estimator) gave values that are either equal to zero or less than 0 (

\hat{γ} \leq 0

) for any value of

k

, the number of values in the upper tail. However, at the top six largest NO₂ emissions observations, a heavy tail behaviour such as Pareto is suggested, with the generalised Hill, the moment and the PoT estimator greater than 0 (

\hat{γ} > 0

) at this region. Additionally, the generalised QQ plot decreases, followed by an approximate constant behaviour, then an increasing behaviour at the top six largest observations. This supports the observations from the EVI estimates of a lighter or heavier tailed distribution depending on the choice of

k

. For example, if the number of values in the upper tail is six or less,

k \leq 6

, then the tail is heavy, belonging to a Pareto type tailed distribution, else if

k > 6

, a light-to-moderately heavy-tailed distribution such as one belonging in the Gumbel tail may be appropriate for the upper tail. Thus, the Weibull and Lognormal distributions are good candidates to model the main body of the NO₂ emissions data; however, the Pareto distribution is a good candidate solely for the very extreme upper tail (

k \leq 6

), and not for the main body of the data.

3.5. Distribution of the Data

Figure 4 illustrates the additional diagnostic value of the derivative plot when assessing the goodness-of-fit of data to statistical distributions. The derivative plot for the Weibull distribution shows an increasing behaviour indicating the appropriateness of the heavier tailed distribution in modelling NO₂ emissions data. This is supported by the shape of the Weibull QQ plot that is showing an overall upward curve.

In Figure 5 above, the histogram, PP, QQ, and derivative plots indicate that the Lognormal is a good fit for the NO₂ emissions data for most parts of the data. It can be noted that in the upper tail, a small-scale increasing pattern is observed (at the six largest observations) in the derivative plot, suggesting a small deviation from the Lognormal and a heavier tailed behaviour. This is supported by the QQ plot showing a good fit with minimal deviations from the straight line in the upper tail. The Lognormal distribution is a good fit for the data. However, the largest six observations are consistent with a Pareto-type tail behaviour.

All diagnostic plots in Figure 6 show that the data significantly deviates from the Pareto distribution. The PP plot is not linear. Additionally, the Pareto derivative plot has a decreasing pattern and the QQ plot is concave, implying that a lighter tailed distribution than the Pareto distribution may be a good fit for the data. This is not surprising since it has already been observed that the Lognormal derivative plot is constant. However, at the largest six points, a somewhat constant shape is observed.

In this study, it is important to note that the Pareto distribution is intended mainly as a benchmark for tail behaviour rather than as a genuine full-distribution competitor.

3.6. Goodness-of-Fit Test of the Data

Table 4 presents the bootstrap goodness-of-fit tests (KS and AD) and information criteria (AIC and BIC) for assessing how well a distribution fits the NO₂ emissions data from Eskom’s Majuba power station.

The BIC-corrected Vuong test with the null H₀: Both models are equally close to the true distribution is used to compare a pair of non-nested distributions in the following order, Lognormal vs. Weibull, Lognormal vs. Pareto, and Weibull vs. Pareto. The former of a pair is considered Model 1, while the latter is Model 2 [30,38,39,40,41,42].

In Table 4, both the bootstrap KS and AD tests give p-values that are greater than 0.05 for the Lognormal distribution, indicating strong evidence to support the null hypothesis that the data belongs to the Lognormal distribution. This implies that the Lognormal distribution is the best-performing model among the three candidates considered for the representation of the NO₂ emissions data. Among the three distributions, the Lognormal distribution yielded the lowest test statistic and the highest p-value, indicating it as the most suitable candidate for modelling the NO₂ emissions data. This is consistent with the lowest values of AIC = 1819.31 and BIC = 1824.67 generated by the Lognormal distribution. The BIC-corrected Vuong test in Table 5 demonstrates that the Lognormal distribution significantly outperforms the Weibull and Pareto distributions, yielding V > 0 with a p-value below 0.05 when compared to either distribution as the first model (i.e., when Model 1 is Lognormal) in both instances. The predictive performance of the Lognormal distribution surpasses that of the Weibull and Pareto distributions, as evidenced by its achievement of the highest cross-validated log-likelihood in Table 4. However, the cross-validated log-likelihood should be interpreted cautiously and only alongside the other criteria.

Table 6 presents the maximum likelihood (ML) parameter estimates of the three distributions. The very big value of the shape and scale parameters and the analysis software’s failure to produce standard errors for these parameters for the Pareto distribution indicates the inappropriateness of this distribution for the central fitting.

3.7. Tail Selection

It has already been determined that the Lognormal distribution outperformed the Weibull and Pareto distributions in modelling the full body of the data. In this section, the three distributions are compared for fitting in the tail of the NO₂ emissions data. As observed earlier in the EVI estimates and QQ and derivative plots, the largest six values demonstrated a Pareto-type behaviour. However, for reliable results, the BIC-corrected Vuong test is performed for larger values of

k

(

\geq 11

). Table 7 presents the results for the test.

The results in Table 7 indicate that from the three distributions used in this paper, the Lognormal distribution is the most plausible distribution to model the upper tail of the NO₂ emissions data since across all

k

values, values of the BIC are smallest and V > 0 with p-value < 0.05 for the BIC-corrected Vuong test where the Lognormal is the first model (Model 1). The consistency in the BIC, the statistic V and p-values, favouring the Lognormal distribution across

k

, signifies stability since data can show different tail behaviours due to effects of the sample size and composition [58].

The Pareto-type behaviour, that is, the potential presence of extremes was suggested by the EVI, QQ and derivative plots for the top six largest observations (

k \leq 6

), indicating potential extreme upper tail risk. The EVI estimates are, however, limited and very sensitive to only a few data points and thus lead to uncertainty. With an increase in upper tail sample size

k (> 6)

the estimates become more stable and move closer to a Lognormal distribution. This demonstrates the well-known variance-bias trade-off, that is, as the threshold increases, the number of observations is reduced, resulting in more variance and decreased bias. A reduced threshold results in an increased number of observations, a drop in variance, and an increase in bias.

The goal is modelling the overall more stable upper tail of the NO₂ emissions data, which is better modelled by the Lognormal distribution, corresponding to typical exceedances of a certain value in risk assessment practice. However, for very extreme NO₂ emissions values, a Pareto-type behaviour is suggested, corresponding to the worse-case scenarios in practice. This tail behaviour is common in application, see Bee et al. [58], and references therein. In theory, the fitted Lognormal or Pareto distributions can be used in the probability estimation of exceeding a regulatory limit,

T

, like those by the World Health Organisation or the South African National Air Quality Standards by calculating

P (X > T)

. This is, however, beyond the scope of the current paper.

4. Discussion

The current paper fitted and compared three statistical parent probability distributions, namely, the Weibull, Lognormal, and the Pareto distributions to the transformed monthly NO₂ emissions data from Majuba power station. These distributions cater for varied heaviness of distributions of the tails and are simple (with only two parameters) and possess similar asymptotic behaviour to many more complex distributions [23]. The aim is the selection of the distributions most suitable, from the three evaluated, for the modelling of NO₂ emissions data from Majuba power station throughout the levels of the emission.

The constant-to-increasing shape of the generalised QQ plot and negative to positive EVI estimates suggested the appropriateness of a light to heavy-tailed distribution depending on the choice of

k

, the number of exceedances over a selected threshold. The derivative plots for the Weibull show an overall increasing behaviour, while for the Pareto distribution, a decreasing behaviour is observed, indicating the appropriateness of a distribution with a lighter tail than Pareto distribution but heavier tailed than the Weibull distribution for the modelling of the full body of the NO₂ emissions data, namely, in our case, the Lognormal distribution. The histogram, PP, and QQ plots show that the Lognormal provided the best fit among the three candidate models considered for the full body of the NO₂ emissions data. This is supported by the bootstrap KS and AD tests, highest cross-validated loglikelihood values, lowest AIC and BIC values, and confirmed by the BIC-corrected Vuong test.

As in the modelling of the full body of the data, the BIC and BIC-corrected Vuong test were again used to compare the three distributions in the upper tail and the Lognormal proved to be a consistent fit in the upper tail across different

k

values. This indicates the stability of the chosen distribution in the upper tail. However, as shown in the EVI estimates and derivative plots, a Pareto behaviour is evident in the top six observations and cannot be ignored. The presence of such extreme emissions requires careful consideration, as extremes are inherently rare yet possess significant implications. A few extremes could determine whether the regulation limit results in a manageable consequence or one that could financially incapacitate the company.

The use of the Lognormal in modelling emissions is common, see references [59,60,61,62,63,64]. However, these studies did not pay particular attention to the tail-heaviness based on ordering of the distributions used, and did not apply the very flexible derivative plot to obtain the best fitting distribution. This is the novelty of the current study.

The heavy reliance of modern-day society in SA on electricity implies that any disturbance to its supply impacts severely on the society’s day-to-day life and economically. Since 2019, many of power stations in SA have reached the end of their expected life span. However, decommissioning is difficult due to the limited alternatives of renewable energy and the ever-increasing electricity demand, implying an increase in emissions. It is for these reasons understanding and quantification of the behaviour of emissions from coal-fired power stations is important for the quantification and management and reduction of these emissions. Graphical plotting techniques through the application of QQ and derivative plots of the Weibull, Lognormal, and Pareto distributions presented in this study offer a good option for quantifying emissions. This paper proposes the Lognormal distribution for the full dataset and the upper tail. However, for the very top six largest observations in the upper tail, a Pareto-type tail is evident. This result is common in application; see Bee et al. [58] and references therein. These findings may assist in understanding the patterns in the distribution of NO₂ emissions. Practical implication of a heavy tail like the Pareto is that it models more frequent larger magnitude of NO₂ emissions compared to lighter tails like the Weibull and Lognormal tails [23].

4.1. Limitations

Since the data showed the presence of a trend and seasonality, it was detrended and de-seasonalised based on the results of the STL decomposition. The remainder was then centred to achieve independence and identical distribution (iid) while maintaining the original scale of the data through

M_{t}

(the mean of

Y_{t}

). Nevertheless, the limited sample size may compromise the data’s independence and identical distribution, despite the randomness tests indicating otherwise. The inability to reject the null hypothesis of randomness/independence should not be interpreted as conclusive evidence of the independent and identically distributed (iid) nature of the NO₂ emissions data, but rather as a means to derive an approximate stationary time series that might display residual dependence structures. Furthermore, the variability arising from the STL decomposition is not considered in distribution fitting, as this decomposition is a smoothing-based technique that does not inherently address uncertainty in the subsequent modelling phases. Therefore, the parametric models employed in this study should not be interpreted as exact representations assuming independent and identically distributed (iid) data, but rather as approximations of the marginal distribution of the transformed dataset. However, where weak and short-range dependence is present, the current approach is practical and commonly used in environmental applications. The smoothing process is integrated into the software and operates automatically within the statistics package used.

The data used for this study only consists of aggregated monthly data and the analysis is thus conducted at this summed level. No details regarding the analytical technique (chemiluminescence, FTIR, or NDIR/UV), sampling method (hot–wet or cold–dry), collection frequency (hourly average, daily average, or monthly cumulative), and data quality control processes (internal or external quality assurance/quality control procedures, data calibration and validation, data quality issues or limitations) are accessible, as the data was sourced from the power utility, Eskom, in its original form. This limitation should be taken in to account in the interpretation of the findings, regarding time-related changes and measurements uncertainty.

This paper’s focus and limitation on the statistical characterisation of NO₂ emissions may benefit from enhanced robustness and generalisability by incorporating (1) dispersion modelling, (2) the calculation of the probability of exceeding regulatory limits, (3) data modelling from other Eskom power stations in future studies, and (4) meteorological data since emissions are strongly related to meteorological conditions, e.g., rainfall events.

Derivative plots have a limitation that they are interpreted qualitatively (based on visual assessment) and this introduces subjectivity in their analysis. For example, in larger samples/components, the patterns (increasing, decrease or constant trend) in the plot are clear, while for smaller samples/components, it may be difficult to make a decision on the direction of the slope. Therefore, they are used as qualitative diagnostic in conjunction with quantitative tools, rather than as independent decision rules.

To facilitate their interpretation and ensure objectivity, in this study, they are interpreted together with quantitative tools such as the AIC, BIC, bootstrap goodness-of-fit tests (KS and AD), cross-validated loglikelihood, EVI estimates, and Vuong test. Additionally, various plots, mainly the QQ plot, along with PP and density plots, are used to support the derivative plot and arrive at a balanced conclusion [24]. Subjectivity is further reduced by using the

k

values in the derivative plots instead of the individual data points.

4.2. Future Studies

Future studies may consider more sophisticated approaches, such as explicit time series analysis models or models that cater for distributional properties and temporal structure (also catering for dependence).

For tail fitting, a more relevant family of distributions from the extreme value framework focusing on only the extreme values (very high or low values) may be considered, namely, the GEVD and GPD models. These distributions are more capable of handling such extremes at the expense of losing information provided by statistical parent distributions such as the Lognormal. However, the methods employed here lay a good foundation before fitting these extreme value distributions.

5. Conclusions

The aim of the current study was to find the most suitable distribution(s) to represent NO₂ emissions data from Majuba power station. This was done by comparing the derivative plots together with the QQ, PP, and density plots of the three varied tailed distributions, namely, (1) the Weibull, which can be lighter or heavier tailed than the Exponential distribution if

τ > 1

or

τ < 1

, respectively, and two heavy-tailed distributions, (2) the Lognormal and (3) the Pareto distributions.

The derivative plot offers the benefit of piecewise distribution fitting, while the methodologies utilised in this paper provide a probabilistic framework for central and tail fitting of parent distributions to emissions data, in particular NO₂ emissions from power utilities like Eskom. These techniques can be used to enhance assessment of emission related risk and future exceedance probability estimation when jointly used with health thresholds, atmospheric dispersion, or policy scenarios. The study also gives a good foundation before other sophisticated methods or distributions such as fitting of the generalised extreme value distribution (GEVD) and/or the generalised Pareto distribution (GPD) can be considered. These methods focus exclusively on the tail distribution of the data.

Author Contributions

Writing the original draft of this manuscript, M.W.M.; review, editing, and supervision, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Emissions of nitrogen dioxide (NO₂) in tons.

7951	7791	9381	10,449.26779	9284.955938	12,111.88881	11,557.36505	9580	13,005
7955	10,410	11,503	11,408.34552	10,740.62565	12,544.99748	11,846.01536	11,712	13,257
8028	7334	12,106	9668.793759	9779.458636	11,123.66211	11,277.75171	11,112	12,930
7844	8478	11,356	11,638.42621	10,078.26273	11,017.08322	10,416	10,149	12,844
6369	10,547	13,219	11,547.9462	9286.892432	10,370.19177	11,261	12,903	10,808
7470	7999	9837	13,179.70739	9618.821848	9522.406988	12,277	12,829	11,329
8538	8394	10,253	10,427.70967	10,153.32493	10,406.49334	11,459.64895	13,075	10,999
7750	6301	10,221	8904.369469	9943.661032	11,380.86745	10,686.1989	13,739	10,641
6996	5400	11,738	9196	10,864.4793	11,687.01793	13,027.66605	10,412	9670
9076	8822	10,214	9819.359517	11,452.04641	11,776.01469	13,565.87818	10,743	9293
8131	8560	10,240	9582.87134	12,242.48433	10,611.65925	10,540.09094	9301	10,445
6146	8916	12,767	13,923.24532	12,189.72768	10,855.8829	10,706.92656	13,078	11,511

References

Pollet, B.G.; Staffell, I.; Adamson, K.-A. Current Energy Landscape in the Republic of South Africa. Int. J. Hydrogen Energy 2015, 40, 16685–16701. [Google Scholar] [CrossRef]
Nkambule, N.P.; Blignaut, J.N. Externality Costs of the Coal-Fuel Cycle: The Case of Kusile Power Station. S. Afr. J. Sci. 2017, 113, 9. [Google Scholar] [CrossRef] [PubMed]
Nogaya, G.; Nwulu, N.I.; Gbadamosi, S.L. Repurposing South Africa’s Retiring Coal-Fired Power Stations for Renewable Energy Generation: A Techno-Economic Analysis. Energies 2022, 15, 5626. [Google Scholar] [CrossRef]
Shikwambana, L.; Mhangara, P.; Mbatha, N. Trend Analysis and First Time Observations of Sulphur Dioxide and Nitrogen Dioxide in South Africa Using TROPOMI/Sentinel-5 P Data. Int. J. Appl. Earth Obs. Geoinf. 2020, 91, 102130. [Google Scholar] [CrossRef]
Mukwevho, P.; Retief, F.; Burger, R.; Moolna, A. Identifying Critical Assumptions and Risks in Air Quality Management Planning Using Theory of Change Approach. Clean Air J. 2024, 34, 16571. [Google Scholar] [CrossRef]
Boden, T.A.; Marland, G.; Andres, R.J. Global, Regional, and National Fossil-Fuel CO₂ Emissions. In Carbon Dioxide Information Analysis Center (CDIAC) Datasets; U.S. Department of Energy Office of Scientific and Technical Information: Oak Ridge, TN, USA, 2010. [Google Scholar] [CrossRef]
Foster, E.; Contestabile, M.; Blazquez, J.; Manzano, B.; Workman, M.; Shah, N. The Unstudied Barriers to Widespread Renewable Energy Deployment: Fossil Fuel Price Responses. Energy Policy 2017, 103, 258–264. [Google Scholar] [CrossRef]
Monn, C. Exposure Assessment of Air Pollutants: A Review on Spatial Heterogeneity and Indoor/Outdoor/Personal Exposure to Suspended Particulate Matter, Nitrogen Dioxide and Ozone. Atmos. Environ. 2001, 35, 1–32. [Google Scholar] [CrossRef]
Li, Z.; Yu, Y.; Jia, L.; Wu, Y.; Cheng, P.; Zhang, Z.; Li, Z.; Fan, C.; Guo, X. Thermal Characteristic Analysis and Performance Optimization of a Novel Heating Boiler Based on a Porous Media Model. Appl. Therm. Eng. 2026, 289, 130035. [Google Scholar] [CrossRef]
Cheng, P.; Li, Z.; Zheng, Y.; Meng, Q.; Yu, Y.; Jin, Y.; Gao, X.; Guo, X.; Jia, L. Study on the Regulation of Performance and Hg0 Removal Mechanism of MIL-101(Fe)-Derived Carbon Materials. Sep. Purif. Technol. 2025, 379, 134939. [Google Scholar] [CrossRef]
Marchant, C.; Leiva, V.; Cavieres, M.F.; Sanhueza, A. Air Contaminant Statistical Distributions with Application to PM10 in Santiago, Chile. In Reviews of Environmental Contamination and Toxicology; Springer: New York, NY, USA, 2013; pp. 1–31. [Google Scholar] [CrossRef]
Kan, H.-D.; Chen, B.-H. Statistical Distributions of Ambient Air Pollutants in Shanghai, China. Biomed. Environ. Sci. 2004, 17, 366–372. [Google Scholar]
Nwaigwe, C.C.; Ogbonna, C.J.; Achem, O. On the Modeling of Carbon Monoxide Flaring in Nigeria. Int. J. Stat. Probab. 2018, 7, 94. [Google Scholar] [CrossRef]
Okorie, I.E.; Akpanta, A.C.; Osu, B.O. Flexible Heavy Tail Distributions for Surface Ozone for Selected Sites in the United States of America. Ozone Sci. Eng. 2019, 41, 473–488. [Google Scholar] [CrossRef]
Plocoste, T.; Calif, R.; Euphrasie-Clotilde, L.; Brute, F.-N. The Statistical Behavior of PM10 Events over Guadeloupean Archipelago: Stationarity, Modelling and Extreme Events. Atmos. Res. 2020, 241, 104956. [Google Scholar] [CrossRef]
Intarapak, S.; Supapakorn, T. Investigation on the Statistical Distribution of PM2.5 Concentration in Chiang Mai, Thailand. WSEAS Trans. Environ. Dev. 2021, 17, 1219–1227. [Google Scholar] [CrossRef]
Oguntunde, P.E.; Odetunmibi, O.A.; Adejumo, A.O. A Study of Probability Models in Monitoring Environmental Pollution in Nigeria. J. Probab. Stat. 2014, 2014, 864965. [Google Scholar] [CrossRef]
Giavis, G.M.; Kambezidis, H.D.; Lykoudis, S.P. Frequency Distribution of Particulate Matter (PM10) in Urban Environments. Int. J. Environ. Pollut. 2009, 36, 99. [Google Scholar] [CrossRef]
Hamid, H.A.; Jaffar, I.; Raffee, A.F. Two-Parameter Central Fitting Distribution to Predict the Concentration of Ground Level Ozone: Case Study in Industrial Area. AIP Conf. Proc. 2018, 2013, 020055. [Google Scholar] [CrossRef]
Lu, H.C.; Fang, G.C. Predicting the Exceedances of a Critical PM10 Concentration—A Case Study in Taiwan. Atmos. Environ. 2003, 37, 3491–3499. [Google Scholar] [CrossRef]
Martins, L.D.; Wikuats, C.F.H.; Capucim, M.N.; de Almeida, D.S.; da Costa, S.C.; Albuquerque, T.; Barreto Carvalho, V.S.; de Freitas, E.D.; de Fátima Andrade, M.; Martins, J.A. Extreme Value Analysis of Air Pollution Data and Their Comparison between Two Large Urban Regions of South America. Weather Clim. Extrem. 2017, 18, 44–54. [Google Scholar] [CrossRef]
El Adlouni, S.; Bobée, B.; Ouarda, T.B.M.J. On the Tails of Extreme Event Distributions in Hydrology. J. Hydrol. 2008, 355, 16–33. [Google Scholar] [CrossRef]
Papalexiou, S.M.; Koutsoyiannis, D.; Makropoulos, C. How Extreme Is Extreme? An Assessment of Daily Rainfall Distribution Tails. Hydrol. Earth Syst. Sci. 2013, 17, 851–862. [Google Scholar] [CrossRef]
Albrecher, H.; Beirlant, J.; Teugels, J.L. Reinsurance: Actuarial and Statistical Aspects; John Wiley & Sons: Hoboken, NJ, USA, 2017. [Google Scholar]
Beirlant, J.; Bladt, M. Tail Classification Using Non-Linear Regression on Model Plots. Extremes 2025, 28, 345–369. [Google Scholar] [CrossRef]
Albrecher, H.; Araujo-Acuna, J.C.; Beirlant, J. Tempered pareto-type modelling using weibull distributions. ASTIN Bull. 2021, 51, 509–538. [Google Scholar] [CrossRef]
Jakata, O.; Chikobvu, D. Estimation of Financial Risk Using the Archimedean Gumbel Copula with Log-Normal Distributed Marginals. J. Stat. Appl. Probab. 2025, 14, 543–560. [Google Scholar] [CrossRef]
Reynkens, T. Using the ReIns Package. Available online: https://cran.r-project.org/web/packages/ReIns/vignettes/ReIns.html (accessed on 23 February 2026).
Bader, B.; Yan, J.; Zhang, X. Automated Threshold Selection for Extreme Value Analysis via Ordered Goodness-of-Fit Tests with Adjustment for False Discovery Rate. Ann. Appl. Stat. 2018, 12, 310–329. [Google Scholar] [CrossRef]
Desmarais, B.A.; Harden, J.J. Testing for Zero Inflation in Count Models: Bias Correction for the Vuong Test. Stata J. Promot. Commun. Stat. Stata 2013, 13, 810–835. [Google Scholar] [CrossRef]
Dekkers, A.L.M.; Einmahl, J.H.J.; De Haan, L. A Moment Estimator for the Index of an Extreme-Value Distribution. Ann. Stat. 1989, 17, 1833–1855. [Google Scholar] [CrossRef]
Hill, B.M. A Simple General Approach to Inference About the Tail of a Distribution. Ann. Stat. 1975, 3, 1163–1174. [Google Scholar] [CrossRef]
de Souza, A.; Aristone, F.; Fernandes, W.A.; Oliveira, A.P.G.; Olaofe, Z.; Abreu, M.C.; de Oliveira, J.F., Jr.; Cavazzana, G.; dos Santos, C.M.; Pobocikova, I. Analysis of Ozone Concentrations Using Probability Distributions. Ozone Sci. Eng. 2020, 42, 539–550. [Google Scholar] [CrossRef]
D’Agostino, R.B.; Stephens, M.A. Goodness-of-Fit Techniques; Dekker: New York, NY, USA, 1986. [Google Scholar]
Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Stephens, M.A. EDF Statistics for Goodness of Fit and Some Comparisons. J. Am. Stat. Assoc. 1974, 69, 730–737. [Google Scholar] [CrossRef]
MacKinnon, J.G. Bootstrap Inference in Econometrics. Can. J. Econ. Can. D’écon. 2002, 35, 615–645. [Google Scholar] [CrossRef]
Vuong, Q.H. Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses. Econometrica 1989, 57, 307. [Google Scholar] [CrossRef]
Clarke, K.A. A Simple Distribution-Free Test for Nonnested Model Selection. Polit. Anal. 2007, 15, 347–363. [Google Scholar] [CrossRef]
Karaivanov, A. Financial Constraints and Occupational Choice in Thai Villages. J. Dev. Econ. 2012, 97, 201–220. [Google Scholar] [CrossRef]
Fafchamps, M. Sequential Labor Decisions Under Uncertainty: An Estimable Household Model of West-African Farmers. Econometrica 1993, 61, 1173. [Google Scholar] [CrossRef]
Schneider, L.; Chalmers, R.P.; Debelak, R.; Merkle, E.C. Model Selection of Nested and Non-Nested Item Response Models Using Vuong Tests. Multivar. Behav. Res. 2020, 55, 664–684. [Google Scholar] [CrossRef] [PubMed]
Stone, M. Cross-Validatory Choice and Assessment of Statistical Prediction. J. R. Stat. Soc. Ser. B 1974, 36, 111–147. [Google Scholar] [CrossRef]
Geisser, S. The Predictive Sample Reuse Method with Applications. J. Am. Stat. Assoc. 1975, 70, 320–328. [Google Scholar] [CrossRef]
Gelfand, A.E.; Dey, D.K.; Chang, H. Model Determination Using Predictive Distributions with Implementation via Sampling-Based Methods. In Bayesian Statistics 4; Oxford University Press: Oxford, UK, 1992; pp. 147–167. [Google Scholar] [CrossRef]
Vehtari, A.; Gelman, A.; Gabry, J. Practical Bayesian Model Evaluation Using Leave-One-out Cross-Validation and WAIC. Stat. Comput. 2017, 27, 1413–1432. [Google Scholar] [CrossRef]
Arlot, S.; Celisse, A. A Survey of Cross-Validation Procedures for Model Selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
Rojo, J.; Rivero, R.; Romero-Morte, J.; Fernández-González, F.; Pérez-Badia, R. Modeling Pollen Time Series Using Seasonal-Trend Decomposition Procedure Based on LOESS Smoothing. Int. J. Biometeorol. 2017, 61, 335–348. [Google Scholar] [CrossRef] [PubMed]
Cleveland, R.B.; Cleveland, W.S.; McRae, J.E.; Terpenning, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. J. Off. Stat. 1990, 6, 3–73. [Google Scholar]
Wang, X.; Smith, K.; Hyndman, R. Characteristic-Based Clustering for Time Series Data. Data Min. Knowl. Discov. 2006, 13, 335–364. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
Osborn, D.R.; Chui, A.P.L.; Smith, J.P.; Birchenhall, C.R. Seasonality and the Order of Integration for Consumption. Oxf. Bull. Econ. Stat. 1988, 50, 361–377. [Google Scholar] [CrossRef]
Brockwell, P.J.; Davis, A.R. Introduction to Time Series and Forecasting, 2nd ed.; Springer: Cham, Switzerland, 2002. [Google Scholar]
Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications, 4th ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
Mateus, A.; Caeiro, F. An R Implementation of Several Randomness Tests. AIP Conf. Proc. 2014, 1618, 531–534. [Google Scholar] [CrossRef]
Moore, G.H.; Wallis, W.A. Time Series Significance Tests Based on Signs of Differences. J. Am. Stat. Assoc. 1943, 38, 153–164. [Google Scholar] [CrossRef]
Cox, D.R.; Stuart, A. Some Quick Sign Tests for Trend in Location and Dispersion. Biometrika 1955, 42, 80–95. [Google Scholar] [CrossRef]
Bee, M.; Riccaboni, M.; Schiavo, S. Pareto versus Lognormal: A Maximum Entropy Test. Phys. Rev. E 2011, 84, 026104. [Google Scholar] [CrossRef]
Deepa, A.; Shiva Nagendra, S.M. Statistical Distribution Models for Urban Air Quality Management. In Advances in Geosciences Volume 16: Atmospheric Science (AS); World Scientific: Singapore, 2010; pp. 285–297. [Google Scholar] [CrossRef]
Taylor, J.A.; Jakeman, A.J.; Simpson, R.W. Modeling Distributions of Air Pollutant Concentrations—I. Identification of Statistical Models. Atmos. Environ. 1986, 20, 1781–1789. [Google Scholar] [CrossRef]
Gulia, S.; Nagendra, S.M.S.; Khare, M. Extreme Events of Reactive Ambient Air Pollutants and Their Distribution Pattern at Urban Hotspots. Aerosol Air Qual. Res. 2017, 17, 394–405. [Google Scholar] [CrossRef]
Sharma, S.; Sharma, P.; Khare, M.; Kwatra, S. Statistical Behavior of Ozone in Urban Environment. Sustain. Environ. Res. 2016, 26, 142–148. [Google Scholar] [CrossRef]
Aleksandropoulou, V.; Eleftheriadis, K.; Diapouli, E.; Torseth, K.; Lazaridis, M. Assessing PM 10 Source Reduction in Urban Agglomerations for Air Quality Compliance. J. Environ. Monit. 2012, 14, 266–278. [Google Scholar] [CrossRef]
Maciejewska, K.; Juda-Rezler, K.; Reizer, M.; Klejnowski, K. Modelling of Black Carbon Statistical Distribution and Return Periods of Extreme Concentrations. Environ. Model. Softw. 2015, 74, 212–226. [Google Scholar] [CrossRef]

Figure 1. The STL decomposition of NO₂ emissions from Majuba power station: actual data, seasonal, trend and remainder components.

Figure 2. The QQ plot (a) and histogram (b) of NO₂ emissions from Majuba power station. In both graphs, the red line denotes the fitted normal distribution (theoretical). The blue dotted line in the QQ plot represents the data points, and the dotted grey line denotes the 95% confidence interval.

Figure 3. The EVI estimates plot (a) and the generalised quantile–quantile (QQ) plot (b). The circles in the generalised QQ plot indicate the NO₂ emissions data points.

Figure 4. The histogram (a), PP plot (b), QQ plot (c) and the derivative plot (d) for the Weibull distribution. The circles in the PP, QQ and derivative plots indicate the NO₂ emissions data points, while the straight line in the PP plot indicates the expected fit if the distribution well represents the data.

Figure 5. The histogram (a), PP plot (b), QQ plot (c), and the derivative plot (d) for the Lognormal distribution. The circles in the PP, QQ and derivative plots indicate the NO₂ emissions data points, while the straight line in the PP plot indicates the expected fit if the distribution well represents the data.

Figure 6. The histogram (a), PP plot (b), QQ plot (c), and the derivative plot (d) for the Pareto distribution. The circles in the PP, QQ and derivative plots indicate the NO₂ emissions data points while, the straight line in the PP plot indicates the expected fit if the distribution well represents the data.

Table 1. Comparison of past studies and the current study.

Study	Type of Data Used (Location).	Methodology Used	Analysis of Tail Distribution?	Main Study Limitations
Kan et al. [12] Nwaigwe et al. [13] Intarapak et al. [16] Oguntunde et al. [17] Giavis et al. [18] Hamid et al. [19]	Pollutant data	Parametric modelling, GOF tests	No	Focus on overall distribution, no tail emphasis
Okorie et al. [14]	Surface ozone (USA)	Flexible heavy-tailed distributions	Limited	Tail considered, but no threshold-based EVT framework
Plocoste et al. [15]	PM₁₀ (Guadeloupe)	Parametric modelling, EVT tools, GOF tests	Yes	Focus on overall distribution, and tail distribution considered but not exceedances
Albrecher et al. [24]	Actuarial loss data	Parametric modelling, EVT, QQ plots, derivative plots, tail heaviness ranking	Yes	Focus on overall and tail distribution, Introduces derivative QQ plots for tail classification
Beirlant et al. [25]	Theoretical/EVT	Nonlinear regression on model plots with mention of advantages of derivative plots, tail heaviness ranking	Yes	Advanced tail classification methodology
Albrecher et al. [26]	Actuarial loss data	Weibull-tempered Pareto modelling, QQ and derivative diagnostics, tail heaviness ranking	Yes	Focus on overall and tail distribution, Uses derivative plots for tail discrimination
Jakata et al. [27]	Financial risk data	Uses derivative plots for tail characterisation before Copula modelling with lognormal marginals, tail heaviness ranking	Yes	Tail dependence modelling, Uses derivative plots for tail characterisation
Current study	NO₂ emissions from a Majuba power plant (2005–2014)	STL decomposition, seasonal and trend adjustment (see next subsection for details), parametric modelling, GOF tests (bootstrap, Vuong test, cross-validated likelihood, etc.), EVT diagnostics, derivative QQ plots, tail heaviness ranking	Yes (threshold-based)	Focus on overall and tail distribution, Uses derivative plots for tail discrimination, Small sample, tail inference uncertainty

Table 2. Descriptive statistics for nitrogen dioxide (NO₂),

Y_{t}^{*}

, emissions given in tons.

Table 2. Descriptive statistics for nitrogen dioxide (NO₂),

Y_{t}^{*}

, emissions given in tons.

N	Mean	Median	Standard Deviation	Minimum	Maximum	Skewness	Kurtosis
108	10,445.59	10,311.79	1090.09	7858.59	13,507.59	0.2359	3.0447

Table 3. Tests of normality, stationarity, and independence for the NO₂ emissions tons from Majuba power station.

Test		Statistic	p-Value
Normality	Kolmogorov–Smirnov (KS) test for normality of data	1	2.2 × 10⁻¹⁶
Normality	Anderson–Darling (AD) test for normality of data	Inf	5.556 × 10⁻⁶
Stationarity	Augmented Dickey–Fuller (ADF) test	−8.3474	<0.01
	Phillips–Perron Unit Root Test	−69.24	<0.01
	Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test for Level Stationarity	0.014691	>0.1
Independence (Randomness) [55]	Difference-sign test of randomness [53,55,56]	0.8295	0.4068
	Mann–Kendall rank test of randomness [55]	−0.047787	0.9619
	Cox–Stuart test of randomness [57]	29	0.6835
	Wald–Wolfowitz Runs Test-Two sided [55]	−1.9336	0.05317

Table 4. Bootstrap goodness-of-fit tests, information criteria and cross-validated loglikelihood for the Weibull, Lognormal, and Pareto distributions.

Distribution	AIC	BIC	Cross Validated Loglikelihood	Bootstrap KS		Bootstrap AD
Distribution	AIC	BIC	Cross Validated Loglikelihood	Statistic	p-Value	Statistic	p-Value
Weibull	1833.55	1838.92	−916.2252	0.0901	0.031	1.5409	0.001
Lognormal	1819.31	1824.67	−908.6738	0.0429	0.907	0.2305	0.812
Pareto	2218.85	2224.21	−1107.4337	0.5366	0	40.116	0

Table 5. BIC-corrected Vuong test for the Weibull, Lognormal, and Pareto distributions.

Comparison Model 1 vs. Model 2	BIC-Corrected Vuong Test
Comparison Model 1 vs. Model 2	Statistic (V)	p-Value
Lognormal vs. Weibull	2.084035	0.03716
Lognormal vs. Pareto	26.883949	<0.0001
Weibull vs. Pareto	26.789151	<0.0001

Table 6. ML parameter estimates for the Weibull, Lognormal, and Pareto distributions.

Distribution	Estimate	Standard Error	Estimate	Standard Error
Weibull	$τ$		$λ$
Weibull	9.8699	0.687	10,941.9598	113.1302
Lognormal (values are in logscale)	$μ$		$σ$
Lognormal (values are in logscale)	9.2485	0.01	0.104	0.0071
Pareto	$α$		$β$
Pareto	4.19 × 10⁶		4.38 × 10¹⁰

Table 7. BIC-corrected Voung test for the comparison of the Weibull, Lognormal, and Pareto distributions.

k	BIC			BIC-Corrected Vuong Test						Best Model
	BIC			Lognormal vs. Weibull		Lognormal vs. Pareto		Weibull vs. Pareto
	Weibull	Lognormal	Pareto	Statistic (V)	p-Value	Statistic (V)	p-Value	Statistic (V)	p-Value
11	174.04	171.78	234.29	2.3488	0.0188	13.4507	0	13.1238	0	Lognormal
12	190.67	188.17	255.2	2.301	0.0214	14.9088	0	13.7076	0	Lognormal
13	207.29	204.51	276.1	2.1646	0.0304	16.1086	0	14.3098	0	Lognormal
14	223.74	220.55	296.98	2.1712	0.0299	17.0232	0	14.9065	0	Lognormal
15	240.05	236.35	317.85	2.2477	0.0246	17.6449	0	15.4743	0	Lognormal
16	256.29	252.05	338.7	2.3428	0.0191	18.0508	0	15.954	0	Lognormal
17	272.45	267.61	359.55	2.4571	0.014	18.2801	0	16.3598	0	Lognormal
18	288.74	283.39	380.38	2.5268	0.0115	18.485	0	16.7224	0	Lognormal
19	304.95	299.04	401.2	2.6222	0.0087	18.6069	0	17.0047	0	Lognormal
20	321.29	314.89	422	2.6761	0.0074	18.7306	0	17.3331	0	Lognormal
21	337.59	330.68	442.79	2.7504	0.006	18.8289	0	17.6024	0	Lognormal
22	354.26	347.13	463.54	2.7238	0.0065	18.959	0	17.8265	0	Lognormal
23	370.92	363.51	484.29	2.738	0.0062	19.1409	0	18.0434	0	Lognormal
24	387.55	379.82	505.03	2.7736	0.0055	19.3557	0	18.3223	0	Lognormal
25	404.2	396.16	525.76	2.8134	0.0049	19.5581	0	18.5777	0	Lognormal
26	420.82	412.4	546.48	2.8719	0.0041	19.7686	0	18.7847	0	Lognormal
27	437.46	428.69	567.19	2.9266	0.0034	19.9751	0	19.016	0	Lognormal
28	454.05	444.86	587.89	2.999	0.0027	20.1686	0	19.243	0	Lognormal
29	470.64	461.05	608.59	3.0678	0.0022	20.3542	0	19.4307	0	Lognormal
30	487.21	477.16	629.28	3.1475	0.0016	20.5208	0	19.6426	0	Lognormal
31	503.77	493.28	649.97	3.2226	0.0013	20.6809	0	19.8358	0	Lognormal
32	520.4	509.5	670.65	3.286	0.001	20.8441	0	20.0033	0	Lognormal
33	537.01	525.66	691.32	3.362	0.0008	20.9974	0	20.2079	0	Lognormal
34	553.64	541.87	711.98	3.4203	0.0006	21.1522	0	20.3653	0	Lognormal
35	570.22	557.99	732.64	3.4944	0.0005	21.2889	0	20.5108	0	Lognormal
36	586.79	574.08	753.29	3.5685	0.0004	21.4181	0	20.6749	0	Lognormal
37	603.37	590.18	773.95	3.6392	0.0003	21.5471	0	20.8398	0	Lognormal
38	619.92	606.24	794.59	3.7169	0.0002	21.6589	0	20.9758	0	Lognormal
39	636.46	622.27	815.23	3.7928	0.0001	21.7655	0	21.1142	0	Lognormal
40	653.01	638.3	835.87	3.8633	0.0001	21.8583	0	21.2393	0	Lognormal

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mamba, M.W.; Chikobvu, D. Statistical Analysis of NO₂ Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa. Atmosphere 2026, 17, 415. https://doi.org/10.3390/atmos17040415

AMA Style

Mamba MW, Chikobvu D. Statistical Analysis of NO₂ Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa. Atmosphere. 2026; 17(4):415. https://doi.org/10.3390/atmos17040415

Chicago/Turabian Style

Mamba, Mpendulo Wiseman, and Delson Chikobvu. 2026. "Statistical Analysis of NO₂ Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa" Atmosphere 17, no. 4: 415. https://doi.org/10.3390/atmos17040415

APA Style

Mamba, M. W., & Chikobvu, D. (2026). Statistical Analysis of NO₂ Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa. Atmosphere, 17(4), 415. https://doi.org/10.3390/atmos17040415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa

Abstract

1. Introduction

Methodological Overview

2. Methodology

2.1. The Shape of the Upper Tail: Extreme Value Index ( γ ) Estimation

2.2. The Exponential Distribution

2.3. Weibull Distribution

2.4. Lognormal Distribution

2.5. Pareto Distribution

2.6. Goodness-of-Fit Test

2.6.1. Kolmogorov–Smirnov (KS) and Anderson–Darling (AD) Tests

2.6.2. BIC-Corrected Vuong Test

2.6.3. Cross-Validated Predictive Likelihood

2.6.4. The Akaike Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC)

3. Results

3.1. Data and Data Decomposition

3.2. Descriptive Statistics

3.3. Stationarity and Independence Tests

3.4. Shape of the Tail

3.5. Distribution of the Data

3.6. Goodness-of-Fit Test of the Data

3.7. Tail Selection

4. Discussion

4.1. Limitations

4.2. Future Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Statistical Analysis of NO₂ Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa

2.1. The Shape of the Upper Tail: Extreme Value Index ( $γ$ ) Estimation