Next Article in Journal
Formation Mechanism of Consecutive Dense Fog Events over the Ma-Zhao Expressway in Yunnan, Southwest China, Late Autumn 2022
Previous Article in Journal
Dual-Pathway Superposition: Independent Forcings of Spring Indian Ocean SST and Summer Tibetan Plateau Heating on Middle and Lower Yangtze Rainfall
Previous Article in Special Issue
Machine Learning on the Frontlines of Air Pollution and Public Health: Revealing the Connection with Hospital Admissions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa

by
Mpendulo Wiseman Mamba
1,* and
Delson Chikobvu
2
1
Department of Mathematical and Physical Sciences, Central University of Technology, Bloemfontein 9301, South Africa
2
Department of Mathematical Statistics and Actuarial Science, University of the Free State, Bloemfontein 9300, South Africa
*
Author to whom correspondence should be addressed.
Atmosphere 2026, 17(4), 415; https://doi.org/10.3390/atmos17040415
Submission received: 2 March 2026 / Revised: 11 April 2026 / Accepted: 13 April 2026 / Published: 19 April 2026
(This article belongs to the Special Issue Modeling and Monitoring of Air Quality: From Data to Predictions)

Abstract

Gaseous emissions from coal combustion during electricity generation continue to be a challenge in South Africa. To meet the regulatory limits, it is crucial to understand the statistical distribution of such emissions from the power generating plants. The current paper characterises the nitrogen dioxide (NO2) emissions from Eskom’s Majuba coal-fired power station by making use of the quantile–quantile (QQ) plots and derivative plots of three statistical parent distributions, namely, the Weibull, Lognormal, and Pareto distributions. These distributions are fitted and compared according to their tail heaviness as they cater for data that may have tails lighter or heavier than that of the Exponential distribution. Of the three distributions evaluated here, the Lognormal gave the best fit for the full body of the data according to the QQ and derivative plots, and the goodness-of-fit tools (bootstrap Kolmogorov–Smirnov (KS), Anderson–Darling (AD), Akaike Information Criterion (AIC), Schwarz’s Bayesian Information Criterion (BIC), and the BIC-corrected Vuong test for non-nested distributions). The Lognormal distribution also gave the best fit for the overall upper tail, while at the very top six largest NO2 emission observations in the upper tail, a Pareto-type tail was observed. The practical implication of a heavy tail like the Pareto is that it models more frequent larger sized NO2 emissions compared to lighter tails like the Weibull and Lognormal tails. The methods used in this study give a framework on how emissions of NO2 from a coal-fired power station can be modelled using statistical parent distributions whilst also taking into account the distribution of the data in the tails which is mostly ignored when fitting statistical parent distributions. Understanding the distribution of the upper tail is very important since higher and rare emissions are of the most concern and are dangerous to human health and the environment.

Graphical Abstract

1. Introduction

The heavy reliance of modern day society on electricity implies that any disturbance to its supply impacts severely on the society’s day-to-day life and economically [1]. About 40% of electricity around the globe is generated by coal-fired power stations, and in the South African context, this figure is much higher at about 77% [2]. The life span of most coal-fired power stations is around 50 years and depends on how they were designed and constructed. Since 2019, many of these power stations have arrived at this expected life span [3] in South Africa (SA).
With industrialisation, urbanisation, and population growth, the consumption of energy in SA is rising. Ceteris paribus, this implies an increase in emissions [4,5]. About 50% of Africa’s emissions are produced in South Africa due to its extensive use of coal [6]. On a 20 year projection, emissions are expected to grow by approximately 30%, globally, in the absence of strong mitigation policies [7]. One of the emissions from the burning of coal during power generation is nitrogen dioxide (NO2). NO2 emissions are mostly sourced from thermal power stations and automobile exhausts. They are linked to respiratory tract inflammation and can lead to poor quality of health in patients struggling with emphysema, asthma, heart diseases, and other diseases when inhaled [8]. It is for these reasons that understanding and prediction of the behaviour of such emissions from coal-fired power stations is essential for the management and reduction of these emissions, including NO2. For reaction mechanisms as well as the newest developments in the field, see [9,10].
The application of statistical or probability distributions for the prediction of pollutant concentrations to determine the pollutant impact on the human health is important. These pollutant concentrations are considered as statistical random variables that can be modelled by a positively skewed statistical distribution [11]. There is no preselected distribution designated for modelling a particular pollutant. However, the choice of a distribution depends on emission levels, meteorological conditions, and geography [12].
In the modelling of parent distribution of air pollutants, Nwaigwe et al. [13] used three distributions, namely, Weibull, Lognormal, and Gamma, to model carbon monoxide (CO) emissions observations in Nigeria for the period of 1996 to 2016. The Weibull outperformed the Lognormal and the Gamma distributions in describing the carbon monoxide data.
A study by Okorie et al. [14] on several sites (Weaverville, California (WVR); Tundra Lab, Niwot Ridge, Colorado (TUN); South Pole, Antarctica (SPO); and Mauna Loa, Hawaii (MLO)) in the United States of America investigated four datasets of surface-level ozone (O3) and fitted eleven heavy-tailed distribution, namely, the generalized Pareto distribution, generalized extreme value distribution, Pareto type-I distribution, Pareto type-II distribution, Log-Cauchy distribution, Burr distribution, log-logistic distribution, Fréchet distribution, Lognormal distribution, Lévy distribution, and Dagum distribution. The Dagum distribution gave the best fit for Weaverville, the Burr distribution gave the best fit for Niwot Ridge and Mauna Loa, and the Log-Cauchy distribution gave the best fit for the South Pole site.
In another study of the Guadeloupe archipelago on the PM10 (particulate matter of size 10 micrometres or less) daily average concentrations data collected over 11 years period, the Weibull, Lognormal, Burr, and stable distributions were fitted. Also considered were mixture models. Relative to the above studies, the Burr and Weibull mixture models gave the best fit for both the parent and tail distributions [15].
Another study on the daily average PM2.5 concentrations collected at Yupparaj Wittayalai school station and City Hall station in Chiang Mai’s Muangwere in Thailand over a period of two years (2016–2018) were analysed by considering the Weibull, Gamma, Lognormal, and Inverse Gaussian distributions. In this study, the Inverse Gaussian distribution gave the best fit for modelling the daily average PM2.5 concentrations from both stations [16].
Oguntunde et al. [17] applied three theoretical statistical distributions, namely, the Weibull, Gamma, and Lognormal, to model carbon monoxide (CO) concentrations in their study in Lagos, Nigeria. The Gamma outperformed the Weibull and Lognormal distributions fit of the data based on the Anderson–Darling and Kolmogorov–Smirnov tests. The characteristics of the pollutant was determined and probability of exceeding the set limits were predicted based on the best fitting distribution.
In a study by Giavis et al. [18], the Lognormal, Weibull, and Gamma were fitted to particulate matter with an aerodynamic diameter less than 10 µm (PM10) recorded in Athens, Greece and Manchester, UK. The goodness-of-fit criteria was performed by three measures, Mean Bias Error, Root Mean Square Error, and index of agreement, for the three distributions. Results showed that, in general, the three distributions can be used to represent the PM10 data. However, the Weibull gave unstable results for the PM10 data, while the most appropriate fit of the data was obtained using the Lognormal distribution.
Another study in Malaysia [19] focused on the distribution of ground level ozone (O3), one of the major contributors to the Air Pollution Index in Malaysia, by fitting and comparing the two-parameter distributions, the Lognormal and the Gamma, in order to find the best fitting distribution. In this study, the Gamma distribution outperformed the Lognormal in the modelling of the O3 data.
For the high concentrations, however, the centre fitting of distribution of pollutant emissions data tend to produce a fit that is not a representative of the data [20]. In such cases, it is common to employ distributions that capture the behaviour at the extremities. This enhances understanding of the pattern of emission, and thus modelling of high emissions become very important since even exposure to such high emissions over a short period of time can lead to serious health implications in the population and other ills [21].
Thus, other studies considered the modelling of air pollutants by fitting multiple distributions that include both parent and extreme distributions. For example, Kan et al. [12], applied the Gamma, Lognormal, Pearson V, and extreme value distributions to daily average concentration data of three pollutants, namely, PM10, SO2, and NO2 in Shanghai, China. The Lognormal, Pearson V, and extreme value distributions gave good fits for the PM10, SO2, and NO2 data, respectively.
Like the studies above, the current study makes use of statistical parent distributions in the modelling on NO2 emissions data from Majuba power station. However, unlike the studies mentioned, the chosen parent distributions in the current study considers the distribution of the data in the main body as well as the tails when selecting the best distributions to represent the NO2 emissions data. Traditionally, parent distributions are used for central fitting of the data and estimates are based on where the bulk of the data is located, without focusing on the data in the tails/outliers. The current study aims to use and compare three parent distributions, namely, the Weibull, Lognormal, and Pareto distributions, chosen according to increasing tail heaviness [22] when finding the most representative distributions in the modelling of the full NO2 emissions data. Additionally, these distributions are simple, widely utilised, and tail-equivalent (or possess similar asymptotic behaviour) to numerous more complex distributions, therefore justifying their selection [23]. In Albrecher et al. [24], it is indicated that if the data is suspected to have a heavy upper tail, then the QQ and derivative QQ plots (or derivative plots) of the Weibull, Lognormal, and Pareto offer an alternative for modelling the data, with the Pareto giving a good fit for large claims data, while the Lognormal performs well for the medium claims. However, the Weibull distribution is capable of modelling data with tails that are lighter ( τ   >   0 ), heavier ( τ   <   0 ) than, or equal to ( τ   =   0 ) that of an Exponential distribution [22]. The chosen distributions have capabilities of modelling data with lighter to heavier tails with reference to the Exponential distribution. This is important since the Exponential distribution is the basis for classification of tail heaviness [23,24], and explaining and understanding the upper tail distribution of the NO2 emissions with reference to the Exponential distribution is a good initial step and can thus be beneficial.
Graphical analysis by employing multiple plots provide better exploration and analysis of data for a balanced conclusion [24]. QQ and derivative plots of the Weibull, Lognormal, and Pareto distributions are examples of such plots. Some works in the literature have employed at least two of the three distributions for explaining upper tail heaviness of a dataset. For example, Albrecher et al. [24] laid a foundation and introduced the derivative plots of the three distributions with applications to some insurance datasets. Beirlant et al. [25] considered only the Weibull, Lognormal, and Pareto distributions in the tail modelling and classification of a few insurance datasets and highlighted the strength of the derivative plots. Albrecher et al. [26] made use of the derivative plots, among other plots, of the Pareto and Weibull to model the main body and upper tail, respectively, of a few insurance datasets. Jakata et al. [27] compared the Weibull, Lognormal, and Pareto distributions to characterise the tails of the South African Industrial and Financial Indices growth rates, and employed derivative plots in arriving at the most appropriate distribution for their data. In these studies [24,25,26,27], the derivative plots were either used to assist in characterisation different datasets or referenced as good diagnostic tools. One benefit of the derivative plot when partitioning is plausible is that it allows for piecewise distribution fitting to a dataset since it can capture distributional patterns of a dataset across its components. It also possesses the benefit of being able to indicate if a lighter or heavier tailed distribution than the one being investigated could be best suited for the full dataset or its component(s), thus facilitating distribution fitting of a full dataset, including both central and tail fitting. These features can also be very useful in diagnostic checking of a dataset before more sophisticated modelling techniques such as mixture, composite, or extreme value theory distributions can be fitted.
In emissions literature, the flexible derivative plot, among the often-utilised QQ, PP, and density plots, is rarely utilised to determine the best suitable distribution for emission concentrations. This paper intends to utilise the derivative plot alongside QQ, PP, and density plots of the Weibull, Lognormal, and Pareto distributions to analyse the full NO2 emissions data from the Majuba power station, as understanding emissions patterns, in both the bulk and upper tail of the data, can aid in modelling associated environmental risks. Therefore, the study aims to answer the following questions:
  • Among the Weibull, Lognormal, and Pareto distributions, which distribution(s) best characterises the full/overall and upper tail distribution patterns of NO2 emissions data from the Majuba power plant?
  • Can the derivative plot provide additional diagnostic value for distribution selection?
Table 1 summarises the differences in the methods used in past studies and the current one. It mainly highlights improvements in actuarial modelling methods, i.e., derivative plots, with a focus on overall and tail fitting, which the current study proposes for NO2 emissions dataset from Majuba power station.

Methodological Overview

The paper is organised and follows this order:
(i)
First, the data is assessed if the assumptions of independence and stationarity hold before fitting of the probability distributions. If not, then adjustments to the data are made to satisfy the assumptions.
(ii)
The upper tail heaviness is then assessed to determine the appropriateness of the selected distributions for the data, particularly the fitting in the upper tail. The EVI estimates and the generalised QQ plots are used to achieve this.
A good choice of k , the number of exceedances, and, thus, the threshold can be determined by selecting a point or points where two or more of the EVI estimates plots γ ^ intersect [24,28]. There is, however, a trade-off between the variance and bias in the selection of k . With higher values of k (lower values of the threshold), bias increases and the variance decreases. Conversely, with lower values of k (higher values of the threshold), bias decreases and the variance increases [29]. As a result, caution should be applied when a k is selected. The purpose of the current step is to only check the suitability of the selected distributions by assessing the upper tail.
(iii)
The Weibull, Lognormal, and Pareto distributions are then fitted to the NO2 emissions data by employing the QQ and corresponding derivative plots of these distributions. For all three distributions, a linear QQ plot and a horizontal derivative plot show that the data belongs to that particular distribution. Convexity in the QQ plot and an increasing derivative plot suggest that a heavier tailed distribution than the one under investigation is a better candidate for that component of the data, while concavity in the QQ plot and a decreasing derivative plot suggest the appropriateness of a lighter tailed distribution than the one investigated for the component. As a result of this flexibility, employing the QQ and derivative plot can allow for piecewise analysis where necessary [24].
(iv)
The bootstrap goodness-of-fit tests, cross-validated likelihood, and information criteria (Akaike Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC)) are used to assess the adequacy of the models for the full NO2 emissions data, then the BIC-corrected Vuong test for non-nested distributions is used to compare the performance of the distributions used in this study. The BIC-corrected version of the Vuong test is used since it places heavier penalty on model complexity compared to the AIC-corrected and uncorrected versions [30].
(v)
In the final step, the BIC and BIC-corrected Vuong test are again used to compare the performance of the three distributions across different values of k to check the stability of the fit in the upper tail of NO2 emissions data. A consistent distribution across k will indicate stability in distribution choice.

2. Methodology

This section presents the probability distribution functions (pdf), namely, the Weibull, Lognormal, and Pareto distributions, their corresponding quantile–quantile plots, and the derivative plots. Also provided are the estimators of the extreme value index (EVI), γ .

2.1. The Shape of the Upper Tail: Extreme Value Index ( γ ) Estimation

The Hill and EPD estimators provide results for estimating the extreme value index ( γ ). However, the estimates produced by these methods are limited to γ   >   0 and do not cater for γ     0 . As a result, for this study, the generalised Pareto distribution, the generaised Hill and the Moment estimators are used to try and determine the shape of the tail before fitting any of the three distributions. These estimators will assist in assessing which distribution is likely to provide a good tail fit. Additionally, to determine a potential candidate distribution for both the central and tail fit, the generalised QQ plot provides a good alternative and will be considered. The estimators are given by the following equations.
  • Generalised Hill [24]
    γ ^ k , n G H = 1 k j = 1 k l o g U H j , n l o g U H k + 1 , n = H k + 1 , n + 1 k j = 1 k ( l o g H j , n l o g H k + 1 , n ) ,
    where U H j , n = X n j , n H j , n .
  • Moment estimator [24,31]
    γ ^ k , n M = H k , n + 1 1 2 ( 1 H k , n 2 H k , n ( 2 ) ) 1 ,
    where H k , n ( 2 ) = 1 k j = 1 k ( log X n j + 1 , n log X n k , n ) 2 and H k , n 2 = ( H k , n ) 2 .
  • Generalised QQ plot [24]
To verify the choice of γ , the generalised QQ plot, given by the following equation is used:
( l o g n + 1 k + 1 , l o g X n k , n H k , n ) , k   = 1 , , n 1 ,
The generalised QQ plot, therefore, allows for any value of γ . A horizontal trend suggests the data belongs to the light tailed Gumbel domain, with γ   =   0 . A decreasing pattern suggests the data may belong to a lighter tailed than an Exponential distribution, while an increase in the generalised QQ plot indicates that the data is heavier than the Exponential distribution.

2.2. The Exponential Distribution

When modelling the of tails of statistical data distributions, the Exponential distribution plays a significant role as the baseline in the determination of the thickness/thinness of the data’s tail. The Exponential distribution is a special form of the Weibull distribution when the shape parameter τ   =   1 .

2.3. Weibull Distribution

The Weibull distribution with shape parameter, τ , scale parameter, λ , and distribution function,
F ( x ) = 1 exp ( λ x τ ) , x > 0 ,
is a first Box–Cox transformation of the Exponential distribution, and for 0 < τ < 1 , it is sub-exponential. When τ > 1 , the Weibull in Equation (4) is lighter tailed than the Exponential (LTE). Conversely, when τ < 1 , the Weibull is heavier tailed than the Exponential HTE. The Weibull distribution has extreme value index γ   =   0 and, thus, belongs to the Gumbel domain.
The QQ plot of the Weibull distribution is given as
( log [ log ( 1 i n + 1 ) ] , log X i , n ) , i = 1 , , n .
and the derivative plot is given as
( log x n k , n , H k , n W k , n )   o r   ( k , H k , n W k , n ) ,
where W k , n = 1 k j = 1 k log log n + 1 j log log n + 1 k + 1 and H k , n = 1 k j = 1 k log X n j + 1 log X n k , n . H k , n is the estimator of γ   = 1 / α [32].

2.4. Lognormal Distribution

The Lognormal distribution is obtained by transforming the data and fitting the normal distribution to the data. This is a HTE distribution with parameters μ and σ denoting the mean and standard deviation. The distribution function of the Lognormal is
F ( x ) = 1 1 σ 2 π x exp { ( log u μ ) 2 2 σ 2 } d u u = Φ ( log x μ σ ) ,   μ R ,   σ > 0 .
The Lognormal distribution has γ   =   0 and is also heavier than both Weibull distributions. The tail of the distribution is given by
F ¯ ( x ) ~ σ log x 2 π exp { ( log x μ ) 2 2 σ 2 } , x
The Lognormal QQ plot is given as follows
( Φ 1 ( i n + 1 ) , l o g X i , n ) , i = 1 , , n ,
where Φ 1 denotes the standard normal quantile function. Let φ denote the standard normal density, then the Lognormal derivative plot is the presented by
( log x n k , n , H k , n N k , n ) o r ( k , H k , n N k , n ) ,
with N k , n = n + 1 k + 1 φ ( Φ 1 ( 1 k + 1 n + 1 ) ) Φ 1 ( 1 k + 1 n + 1 ) since
1 k j = 1 k Φ 1 ( 1 j n + 1 ) Φ 1 ( 1 k + 1 n + 1 ) 0 1 Φ 1 ( 1 u k + 1 n + 1 ) d u Φ 1 ( 1 k + 1 n + 1 ) = N k , n .

2.5. Pareto Distribution

In statistics, the Pareto distribution is viewed as the prime example of a heavy-tailed distribution. The probability distribution function of a strict Pareto with shape and scale parameters given by α and β , respectively, is defined as
F ( x ) = 1 ( x β ) α , α > 0 ,   0 < β < x .
as a sub-exponential for all values of α . Suppose log X has an Exponential distribution with λ = α when X has a strict Pareto ( α ) distribution, then the Pareto QQ and derivative plots are presented by the following equations,
( log ( 1 i n + 1 ) , log X i , n ) ,   i = 1 , , n ,
and
( log x n k , n , H k , n ) o r ( k , H k , n ) ,
respectively.

2.6. Goodness-of-Fit Test

2.6.1. Kolmogorov–Smirnov (KS) and Anderson–Darling (AD) Tests

To assess how well each of the three distributions in this study fits the NO2 emissions data, the KS and AD tests are used. These tests are used to arrive at a decision of whether the data comes from a population with the specified distribution or not. These goodness-of-fit tests test are performed to test the following hypotheses:
H0. 
The NO2 emissions data comes from the specified distribution.
H1. 
The NO2 emissions data does not come from the specified distribution.
If the p-value is smaller than 0.05 (the 5% significance level), then there is strong evidence against the null hypothesis that NO2 emissions data comes from the specified distribution. The KS and AD test statistics are defined in [33] by Equations (15) and (16), respectively, as
D = m a x 1 i N [ F ( Y i ) i 1 N , i N F ( Y i ) ] ,
and
A 2 = N 1 N i = 1 N ( 2 i 1 ) { ln [ F ( Y i ) ] + ln [ 1 F ( Y N i + 1 ) ] } ,
where F is the theoretical cumulative distribution function of the tested or specified distribution and Y i , i   =   1 ,   2 ,   3 , ,   Y N , are the N ordered NO2 emissions data points. The AD test is a modification of the KS test in that it allocates more weight to the tails. The AD test is more sensitive to what happens at the tails of the distribution.
The p-values will be calculated using the parametric bootstrap because the traditional KS and AD critical values are invalid for parameter estimation [34,35,36,37]. In each of the three distributions (Weibull, Lognormal, and Pareto), N = 1000 bootstrap samples were generated, parameters were re-estimated, and empirical p-values were calculated [35].

2.6.2. BIC-Corrected Vuong Test

The BIC-corrected Vuong test with the null and alternative hypotheses given as H0: Both models are equally close to the true distribution against H1: One model is closer to the true distribution, compares a pair of non-nested distributions. If V, the test statistic, is greater than 0, then Model 1 (the first model) is preferred over Model 2 (the second model) for the data, and if V is less than 0, then Model 2 is preferred [30,38,39,40,41,42]. This test will be used for comparing the three distributions.

2.6.3. Cross-Validated Predictive Likelihood

The K-fold cross-validated likelihood is implemented to assess out-of-sample predictive performance [43,44,45,46,47]. The NO2 emissions data will be randomly partitioned into K = 5 approximately equally sized mutually exclusive folds. For each of the K = 1,2 , . . , 5 :
(1)
the remaining K 1 fold (the training set) will be used in the estimation of maximum likelihood parameters,
(2)
the fitted model will then be evaluated on the held-out fold (the testing set),
(3)
summing contributions of the predictive loglikelihoods across all folds, the cross-validated loglikelihood will be obtained.

2.6.4. The Akaike Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC)

The AIC and BIC criteria will be used to rank the performance of the distributions. Lower values of the AIC and BIC are desired. Thus, a distribution with the lowest value will be considered the best fitting distribution model from the three.

3. Results

3.1. Data and Data Decomposition

The current paper uses the monthly NO2 emissions data given in tons from Eskom’s Majuba coal-fired power station located in Volksrust, Mpumalanga, South Africa collected from April 2005 to March 2014. The data used for this study, as shown in Table A1 of Appendix A, only consists of aggregated monthly data. No information on the analytical technique (chemiluminescence, FTIR, or NDIR/UV), sampling method (hot–wet or cold–dry), collection frequency (hourly average, daily average, or monthly cumulative), and data quality control processes (internal or external quality assurance/quality control procedures, data calibration and validation, data quality issues or limitations) is available since the data was obtained from the power utility, Eskom, as is. From a temporal variability and uncertainty of the measurement’s perspective, this limitation is acknowledged and should be taken in to account in the interpretation of the findings.
The R version 4.4.2 (2024-10-31) statistical software with packages, ReIns (1.0.15) and fitdistrplus (1.2-2) was used to analyse the data.
The STL (Seasonal and Trend decomposition using Loess) [48,49] data decomposition is used to explore the presence of trends and seasonality in the NO2 emissions data, and the results are presented in Figure 1.
Figure 1 shows the STL decomposition of the NO2 emissions data used in this study. The plot in Figure 1 is divided into four components, the actual data, the seasonal component, the trend structure and the remainder (irregular). Let Y t be the raw NO2 emissions data, T t , the trend component, S t the seasonal component and R t the remainder component, then Y t = T t + S t + R t , t = 1 , . . , n [48,50]. Loess smoothing is used on the seasonal sub-series to obtain the seasonal component. By taking the mean, the smoothing is effectively replaced. The trend is obtained by removing the seasonal values and smoothing the remainder. The overall level is subtracted from the seasonal component and added to the trend component. Taking the residuals from the seasonal plus the trend fit, the remainder component is obtained. The advantages of this method can be found in reference [51].
In Figure 1, it can be observed that the data depicts some seasonality (regular patterns). This is supported by the Osborn, Chui, Smith, and Birchenhall (OCSB) test of seasonal unit root, with test statistic = −7.0139, 5% critical value = −1.803, indicating that the data is seasonally stationary [52]. This suggests that the seasonal pattern is likely deterministic, and therefore no differencing is required to de-seasonal it. On the other hand, there is some small-scale trend presence in Figure 1, as evidenced by the p-value = 0.000000006299 < 0.05 of the Mann–Kendall rank test, indicating sufficient evidence against the null hypothesis of absence of a trend.
Since the data shows presence of a trend and seasonality, it is detrended and de-seasonalised to approximate stationarity and weak dependence for the subsequent distributional (Weibull, Lognormal, and Pareto distributions) modelling. The resultant dataset is of the form
Y t * = Y t T t S t + M t = R t + M t ,
such that
E ( Y t * ) = E ( Y t ) E ( T t ) E ( S t ) + E ( M t ) = E ( R t ) + E ( M t ) = E ( M t ) ,
where Y t T t S t = R t , and M t is the recentering component, namely, the mean of Y t since in STL, E ( S t ) 0 and E ( R t ) 0 . Y t * is thus the detrended, seasonality adjusted NO2 emissions data [49,53,54].

3.2. Descriptive Statistics

Table 2 gives the descriptive statistics for the detrended, seasonally adjusted NO2 emissions data, Y t * .
The skewness (=0.2359) and kurtosis (=3.0447) values suggest asymmetry of the NO2 emissions data. These observations are confirmed by the QQ plot and histogram in Figure 2. In the upper tail of the QQ plot, the observations are above the theoretical normality line (red), indicating that the upper tail is indeed heavier than that of a normal distribution. Additionally, the histogram is not symmetric but somewhat right skewed.

3.3. Stationarity and Independence Tests

Table 3 presents the normality, stationarity, and independence tests for the NO2 emissions data, Y t * .
The normality of the NO2 emissions data is tested by making use of the KS and AD tests in Table 3. Both tests indicate strong evidence against the null hypothesis of normality of the NO2 emissions data since the p-values are smaller than 0.05.
Two stationarity tests employed in this study, namely, the Augmented Dickey–Fuller and Phillips–Perron Unit Root tests show strong evidence against the null hypothesis of non-stationarity of the NO2 emissions data. These are supported by the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) with p-value > 0.05, indicating strong evidence of failure to reject the null hypothesis of stationarity of the NO2 emissions data. On the other hand, all the tests presented in Table 3 used to assess the randomness of NO2 emissions data with the null hypothesis of independence (randomness) indicate strong evidence to support the null hypothesis of randomness (independence) of the data since all have p-value > 0.05 [14].
The limited sample size of 108 observations suggests that the stationarity and randomness tests indicate the NO2 emissions data, Y t * , to be approximately stationary time series data that may exhibit residual dependence structures, instead of a definitive proof of independent and identically distributed (iid) NO2 emissions data.

3.4. Shape of the Tail

Figure 3 presents the EVI estimates given by the generalised Hill, the moment, and the PoT estimates. The estimators are chosen due to their ability to accommodate not only the positive gamma values ( γ > 0 ) like the Hill and EPD, but they can account for values of gamma smaller than or equal to zero ( γ 0 ), since emission data is not always heavy-tailed. The Hill and EPD are also included for comparison purposes. Also included in Figure 3 is the generalised QQ plot.
In Figure 3, all the EVI estimator (generalised Hill, the moment and the PoT estimator) gave values that are either equal to zero or less than 0 ( γ ^ 0 ) for any value of k , the number of values in the upper tail. However, at the top six largest NO2 emissions observations, a heavy tail behaviour such as Pareto is suggested, with the generalised Hill, the moment and the PoT estimator greater than 0 ( γ ^ > 0 ) at this region. Additionally, the generalised QQ plot decreases, followed by an approximate constant behaviour, then an increasing behaviour at the top six largest observations. This supports the observations from the EVI estimates of a lighter or heavier tailed distribution depending on the choice of k . For example, if the number of values in the upper tail is six or less, k 6 , then the tail is heavy, belonging to a Pareto type tailed distribution, else if k > 6 , a light-to-moderately heavy-tailed distribution such as one belonging in the Gumbel tail may be appropriate for the upper tail. Thus, the Weibull and Lognormal distributions are good candidates to model the main body of the NO2 emissions data; however, the Pareto distribution is a good candidate solely for the very extreme upper tail ( k 6 ), and not for the main body of the data.

3.5. Distribution of the Data

Figure 4 illustrates the additional diagnostic value of the derivative plot when assessing the goodness-of-fit of data to statistical distributions. The derivative plot for the Weibull distribution shows an increasing behaviour indicating the appropriateness of the heavier tailed distribution in modelling NO2 emissions data. This is supported by the shape of the Weibull QQ plot that is showing an overall upward curve.
In Figure 5 above, the histogram, PP, QQ, and derivative plots indicate that the Lognormal is a good fit for the NO2 emissions data for most parts of the data. It can be noted that in the upper tail, a small-scale increasing pattern is observed (at the six largest observations) in the derivative plot, suggesting a small deviation from the Lognormal and a heavier tailed behaviour. This is supported by the QQ plot showing a good fit with minimal deviations from the straight line in the upper tail. The Lognormal distribution is a good fit for the data. However, the largest six observations are consistent with a Pareto-type tail behaviour.
All diagnostic plots in Figure 6 show that the data significantly deviates from the Pareto distribution. The PP plot is not linear. Additionally, the Pareto derivative plot has a decreasing pattern and the QQ plot is concave, implying that a lighter tailed distribution than the Pareto distribution may be a good fit for the data. This is not surprising since it has already been observed that the Lognormal derivative plot is constant. However, at the largest six points, a somewhat constant shape is observed.
In this study, it is important to note that the Pareto distribution is intended mainly as a benchmark for tail behaviour rather than as a genuine full-distribution competitor.

3.6. Goodness-of-Fit Test of the Data

Table 4 presents the bootstrap goodness-of-fit tests (KS and AD) and information criteria (AIC and BIC) for assessing how well a distribution fits the NO2 emissions data from Eskom’s Majuba power station.
The BIC-corrected Vuong test with the null H0: Both models are equally close to the true distribution is used to compare a pair of non-nested distributions in the following order, Lognormal vs. Weibull, Lognormal vs. Pareto, and Weibull vs. Pareto. The former of a pair is considered Model 1, while the latter is Model 2 [30,38,39,40,41,42].
In Table 4, both the bootstrap KS and AD tests give p-values that are greater than 0.05 for the Lognormal distribution, indicating strong evidence to support the null hypothesis that the data belongs to the Lognormal distribution. This implies that the Lognormal distribution is the best-performing model among the three candidates considered for the representation of the NO2 emissions data. Among the three distributions, the Lognormal distribution yielded the lowest test statistic and the highest p-value, indicating it as the most suitable candidate for modelling the NO2 emissions data. This is consistent with the lowest values of AIC = 1819.31 and BIC = 1824.67 generated by the Lognormal distribution. The BIC-corrected Vuong test in Table 5 demonstrates that the Lognormal distribution significantly outperforms the Weibull and Pareto distributions, yielding V > 0 with a p-value below 0.05 when compared to either distribution as the first model (i.e., when Model 1 is Lognormal) in both instances. The predictive performance of the Lognormal distribution surpasses that of the Weibull and Pareto distributions, as evidenced by its achievement of the highest cross-validated log-likelihood in Table 4. However, the cross-validated log-likelihood should be interpreted cautiously and only alongside the other criteria.
Table 6 presents the maximum likelihood (ML) parameter estimates of the three distributions. The very big value of the shape and scale parameters and the analysis software’s failure to produce standard errors for these parameters for the Pareto distribution indicates the inappropriateness of this distribution for the central fitting.

3.7. Tail Selection

It has already been determined that the Lognormal distribution outperformed the Weibull and Pareto distributions in modelling the full body of the data. In this section, the three distributions are compared for fitting in the tail of the NO2 emissions data. As observed earlier in the EVI estimates and QQ and derivative plots, the largest six values demonstrated a Pareto-type behaviour. However, for reliable results, the BIC-corrected Vuong test is performed for larger values of k ( 11 ). Table 7 presents the results for the test.
The results in Table 7 indicate that from the three distributions used in this paper, the Lognormal distribution is the most plausible distribution to model the upper tail of the NO2 emissions data since across all k values, values of the BIC are smallest and V > 0 with p-value < 0.05 for the BIC-corrected Vuong test where the Lognormal is the first model (Model 1). The consistency in the BIC, the statistic V and p-values, favouring the Lognormal distribution across k , signifies stability since data can show different tail behaviours due to effects of the sample size and composition [58].
The Pareto-type behaviour, that is, the potential presence of extremes was suggested by the EVI, QQ and derivative plots for the top six largest observations ( k 6 ), indicating potential extreme upper tail risk. The EVI estimates are, however, limited and very sensitive to only a few data points and thus lead to uncertainty. With an increase in upper tail sample size k   ( > 6 ) the estimates become more stable and move closer to a Lognormal distribution. This demonstrates the well-known variance-bias trade-off, that is, as the threshold increases, the number of observations is reduced, resulting in more variance and decreased bias. A reduced threshold results in an increased number of observations, a drop in variance, and an increase in bias.
The goal is modelling the overall more stable upper tail of the NO2 emissions data, which is better modelled by the Lognormal distribution, corresponding to typical exceedances of a certain value in risk assessment practice. However, for very extreme NO2 emissions values, a Pareto-type behaviour is suggested, corresponding to the worse-case scenarios in practice. This tail behaviour is common in application, see Bee et al. [58], and references therein. In theory, the fitted Lognormal or Pareto distributions can be used in the probability estimation of exceeding a regulatory limit, T , like those by the World Health Organisation or the South African National Air Quality Standards by calculating P ( X > T ) . This is, however, beyond the scope of the current paper.

4. Discussion

The current paper fitted and compared three statistical parent probability distributions, namely, the Weibull, Lognormal, and the Pareto distributions to the transformed monthly NO2 emissions data from Majuba power station. These distributions cater for varied heaviness of distributions of the tails and are simple (with only two parameters) and possess similar asymptotic behaviour to many more complex distributions [23]. The aim is the selection of the distributions most suitable, from the three evaluated, for the modelling of NO2 emissions data from Majuba power station throughout the levels of the emission.
The constant-to-increasing shape of the generalised QQ plot and negative to positive EVI estimates suggested the appropriateness of a light to heavy-tailed distribution depending on the choice of k , the number of exceedances over a selected threshold. The derivative plots for the Weibull show an overall increasing behaviour, while for the Pareto distribution, a decreasing behaviour is observed, indicating the appropriateness of a distribution with a lighter tail than Pareto distribution but heavier tailed than the Weibull distribution for the modelling of the full body of the NO2 emissions data, namely, in our case, the Lognormal distribution. The histogram, PP, and QQ plots show that the Lognormal provided the best fit among the three candidate models considered for the full body of the NO2 emissions data. This is supported by the bootstrap KS and AD tests, highest cross-validated loglikelihood values, lowest AIC and BIC values, and confirmed by the BIC-corrected Vuong test.
As in the modelling of the full body of the data, the BIC and BIC-corrected Vuong test were again used to compare the three distributions in the upper tail and the Lognormal proved to be a consistent fit in the upper tail across different k values. This indicates the stability of the chosen distribution in the upper tail. However, as shown in the EVI estimates and derivative plots, a Pareto behaviour is evident in the top six observations and cannot be ignored. The presence of such extreme emissions requires careful consideration, as extremes are inherently rare yet possess significant implications. A few extremes could determine whether the regulation limit results in a manageable consequence or one that could financially incapacitate the company.
The use of the Lognormal in modelling emissions is common, see references [59,60,61,62,63,64]. However, these studies did not pay particular attention to the tail-heaviness based on ordering of the distributions used, and did not apply the very flexible derivative plot to obtain the best fitting distribution. This is the novelty of the current study.
The heavy reliance of modern-day society in SA on electricity implies that any disturbance to its supply impacts severely on the society’s day-to-day life and economically. Since 2019, many of power stations in SA have reached the end of their expected life span. However, decommissioning is difficult due to the limited alternatives of renewable energy and the ever-increasing electricity demand, implying an increase in emissions. It is for these reasons understanding and quantification of the behaviour of emissions from coal-fired power stations is important for the quantification and management and reduction of these emissions. Graphical plotting techniques through the application of QQ and derivative plots of the Weibull, Lognormal, and Pareto distributions presented in this study offer a good option for quantifying emissions. This paper proposes the Lognormal distribution for the full dataset and the upper tail. However, for the very top six largest observations in the upper tail, a Pareto-type tail is evident. This result is common in application; see Bee et al. [58] and references therein. These findings may assist in understanding the patterns in the distribution of NO2 emissions. Practical implication of a heavy tail like the Pareto is that it models more frequent larger magnitude of NO2 emissions compared to lighter tails like the Weibull and Lognormal tails [23].

4.1. Limitations

Since the data showed the presence of a trend and seasonality, it was detrended and de-seasonalised based on the results of the STL decomposition. The remainder was then centred to achieve independence and identical distribution (iid) while maintaining the original scale of the data through M t (the mean of Y t ). Nevertheless, the limited sample size may compromise the data’s independence and identical distribution, despite the randomness tests indicating otherwise. The inability to reject the null hypothesis of randomness/independence should not be interpreted as conclusive evidence of the independent and identically distributed (iid) nature of the NO2 emissions data, but rather as a means to derive an approximate stationary time series that might display residual dependence structures. Furthermore, the variability arising from the STL decomposition is not considered in distribution fitting, as this decomposition is a smoothing-based technique that does not inherently address uncertainty in the subsequent modelling phases. Therefore, the parametric models employed in this study should not be interpreted as exact representations assuming independent and identically distributed (iid) data, but rather as approximations of the marginal distribution of the transformed dataset. However, where weak and short-range dependence is present, the current approach is practical and commonly used in environmental applications. The smoothing process is integrated into the software and operates automatically within the statistics package used.
The data used for this study only consists of aggregated monthly data and the analysis is thus conducted at this summed level. No details regarding the analytical technique (chemiluminescence, FTIR, or NDIR/UV), sampling method (hot–wet or cold–dry), collection frequency (hourly average, daily average, or monthly cumulative), and data quality control processes (internal or external quality assurance/quality control procedures, data calibration and validation, data quality issues or limitations) are accessible, as the data was sourced from the power utility, Eskom, in its original form. This limitation should be taken in to account in the interpretation of the findings, regarding time-related changes and measurements uncertainty.
This paper’s focus and limitation on the statistical characterisation of NO2 emissions may benefit from enhanced robustness and generalisability by incorporating (1) dispersion modelling, (2) the calculation of the probability of exceeding regulatory limits, (3) data modelling from other Eskom power stations in future studies, and (4) meteorological data since emissions are strongly related to meteorological conditions, e.g., rainfall events.
Derivative plots have a limitation that they are interpreted qualitatively (based on visual assessment) and this introduces subjectivity in their analysis. For example, in larger samples/components, the patterns (increasing, decrease or constant trend) in the plot are clear, while for smaller samples/components, it may be difficult to make a decision on the direction of the slope. Therefore, they are used as qualitative diagnostic in conjunction with quantitative tools, rather than as independent decision rules.
To facilitate their interpretation and ensure objectivity, in this study, they are interpreted together with quantitative tools such as the AIC, BIC, bootstrap goodness-of-fit tests (KS and AD), cross-validated loglikelihood, EVI estimates, and Vuong test. Additionally, various plots, mainly the QQ plot, along with PP and density plots, are used to support the derivative plot and arrive at a balanced conclusion [24]. Subjectivity is further reduced by using the k values in the derivative plots instead of the individual data points.

4.2. Future Studies

Future studies may consider more sophisticated approaches, such as explicit time series analysis models or models that cater for distributional properties and temporal structure (also catering for dependence).
For tail fitting, a more relevant family of distributions from the extreme value framework focusing on only the extreme values (very high or low values) may be considered, namely, the GEVD and GPD models. These distributions are more capable of handling such extremes at the expense of losing information provided by statistical parent distributions such as the Lognormal. However, the methods employed here lay a good foundation before fitting these extreme value distributions.

5. Conclusions

The aim of the current study was to find the most suitable distribution(s) to represent NO2 emissions data from Majuba power station. This was done by comparing the derivative plots together with the QQ, PP, and density plots of the three varied tailed distributions, namely, (1) the Weibull, which can be lighter or heavier tailed than the Exponential distribution if τ > 1 or τ < 1 , respectively, and two heavy-tailed distributions, (2) the Lognormal and (3) the Pareto distributions.
The derivative plot offers the benefit of piecewise distribution fitting, while the methodologies utilised in this paper provide a probabilistic framework for central and tail fitting of parent distributions to emissions data, in particular NO2 emissions from power utilities like Eskom. These techniques can be used to enhance assessment of emission related risk and future exceedance probability estimation when jointly used with health thresholds, atmospheric dispersion, or policy scenarios. The study also gives a good foundation before other sophisticated methods or distributions such as fitting of the generalised extreme value distribution (GEVD) and/or the generalised Pareto distribution (GPD) can be considered. These methods focus exclusively on the tail distribution of the data.

Author Contributions

Writing the original draft of this manuscript, M.W.M.; review, editing, and supervision, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Emissions of nitrogen dioxide (NO2) in tons.
Table A1. Emissions of nitrogen dioxide (NO2) in tons.
79517791938110,449.267799284.95593812,111.8888111,557.36505958013,005
795510,41011,50311,408.3455210,740.6256512,544.9974811,846.0153611,71213,257
8028733412,1069668.7937599779.45863611,123.6621111,277.7517111,11212,930
7844847811,35611,638.4262110,078.2627311,017.0832210,41610,14912,844
636910,54713,21911,547.94629286.89243210,370.1917711,26112,90310,808
74707999983713,179.707399618.8218489522.40698812,27712,82911,329
8538839410,25310,427.7096710,153.3249310,406.4933411,459.6489513,07510,999
7750630110,2218904.3694699943.66103211,380.8674510,686.198913,73910,641
6996540011,738919610,864.479311,687.0179313,027.6660510,4129670
9076882210,2149819.35951711,452.0464111,776.0146913,565.8781810,7439293
8131856010,2409582.8713412,242.4843310,611.6592510,540.09094930110,445
6146891612,76713,923.2453212,189.7276810,855.882910,706.9265613,07811,511

References

  1. Pollet, B.G.; Staffell, I.; Adamson, K.-A. Current Energy Landscape in the Republic of South Africa. Int. J. Hydrogen Energy 2015, 40, 16685–16701. [Google Scholar] [CrossRef]
  2. Nkambule, N.P.; Blignaut, J.N. Externality Costs of the Coal-Fuel Cycle: The Case of Kusile Power Station. S. Afr. J. Sci. 2017, 113, 9. [Google Scholar] [CrossRef] [PubMed]
  3. Nogaya, G.; Nwulu, N.I.; Gbadamosi, S.L. Repurposing South Africa’s Retiring Coal-Fired Power Stations for Renewable Energy Generation: A Techno-Economic Analysis. Energies 2022, 15, 5626. [Google Scholar] [CrossRef]
  4. Shikwambana, L.; Mhangara, P.; Mbatha, N. Trend Analysis and First Time Observations of Sulphur Dioxide and Nitrogen Dioxide in South Africa Using TROPOMI/Sentinel-5 P Data. Int. J. Appl. Earth Obs. Geoinf. 2020, 91, 102130. [Google Scholar] [CrossRef]
  5. Mukwevho, P.; Retief, F.; Burger, R.; Moolna, A. Identifying Critical Assumptions and Risks in Air Quality Management Planning Using Theory of Change Approach. Clean Air J. 2024, 34, 16571. [Google Scholar] [CrossRef]
  6. Boden, T.A.; Marland, G.; Andres, R.J. Global, Regional, and National Fossil-Fuel CO2 Emissions. In Carbon Dioxide Information Analysis Center (CDIAC) Datasets; U.S. Department of Energy Office of Scientific and Technical Information: Oak Ridge, TN, USA, 2010. [Google Scholar] [CrossRef]
  7. Foster, E.; Contestabile, M.; Blazquez, J.; Manzano, B.; Workman, M.; Shah, N. The Unstudied Barriers to Widespread Renewable Energy Deployment: Fossil Fuel Price Responses. Energy Policy 2017, 103, 258–264. [Google Scholar] [CrossRef]
  8. Monn, C. Exposure Assessment of Air Pollutants: A Review on Spatial Heterogeneity and Indoor/Outdoor/Personal Exposure to Suspended Particulate Matter, Nitrogen Dioxide and Ozone. Atmos. Environ. 2001, 35, 1–32. [Google Scholar] [CrossRef]
  9. Li, Z.; Yu, Y.; Jia, L.; Wu, Y.; Cheng, P.; Zhang, Z.; Li, Z.; Fan, C.; Guo, X. Thermal Characteristic Analysis and Performance Optimization of a Novel Heating Boiler Based on a Porous Media Model. Appl. Therm. Eng. 2026, 289, 130035. [Google Scholar] [CrossRef]
  10. Cheng, P.; Li, Z.; Zheng, Y.; Meng, Q.; Yu, Y.; Jin, Y.; Gao, X.; Guo, X.; Jia, L. Study on the Regulation of Performance and Hg0 Removal Mechanism of MIL-101(Fe)-Derived Carbon Materials. Sep. Purif. Technol. 2025, 379, 134939. [Google Scholar] [CrossRef]
  11. Marchant, C.; Leiva, V.; Cavieres, M.F.; Sanhueza, A. Air Contaminant Statistical Distributions with Application to PM10 in Santiago, Chile. In Reviews of Environmental Contamination and Toxicology; Springer: New York, NY, USA, 2013; pp. 1–31. [Google Scholar] [CrossRef]
  12. Kan, H.-D.; Chen, B.-H. Statistical Distributions of Ambient Air Pollutants in Shanghai, China. Biomed. Environ. Sci. 2004, 17, 366–372. [Google Scholar]
  13. Nwaigwe, C.C.; Ogbonna, C.J.; Achem, O. On the Modeling of Carbon Monoxide Flaring in Nigeria. Int. J. Stat. Probab. 2018, 7, 94. [Google Scholar] [CrossRef]
  14. Okorie, I.E.; Akpanta, A.C.; Osu, B.O. Flexible Heavy Tail Distributions for Surface Ozone for Selected Sites in the United States of America. Ozone Sci. Eng. 2019, 41, 473–488. [Google Scholar] [CrossRef]
  15. Plocoste, T.; Calif, R.; Euphrasie-Clotilde, L.; Brute, F.-N. The Statistical Behavior of PM10 Events over Guadeloupean Archipelago: Stationarity, Modelling and Extreme Events. Atmos. Res. 2020, 241, 104956. [Google Scholar] [CrossRef]
  16. Intarapak, S.; Supapakorn, T. Investigation on the Statistical Distribution of PM2.5 Concentration in Chiang Mai, Thailand. WSEAS Trans. Environ. Dev. 2021, 17, 1219–1227. [Google Scholar] [CrossRef]
  17. Oguntunde, P.E.; Odetunmibi, O.A.; Adejumo, A.O. A Study of Probability Models in Monitoring Environmental Pollution in Nigeria. J. Probab. Stat. 2014, 2014, 864965. [Google Scholar] [CrossRef]
  18. Giavis, G.M.; Kambezidis, H.D.; Lykoudis, S.P. Frequency Distribution of Particulate Matter (PM10) in Urban Environments. Int. J. Environ. Pollut. 2009, 36, 99. [Google Scholar] [CrossRef]
  19. Hamid, H.A.; Jaffar, I.; Raffee, A.F. Two-Parameter Central Fitting Distribution to Predict the Concentration of Ground Level Ozone: Case Study in Industrial Area. AIP Conf. Proc. 2018, 2013, 020055. [Google Scholar] [CrossRef]
  20. Lu, H.C.; Fang, G.C. Predicting the Exceedances of a Critical PM10 Concentration—A Case Study in Taiwan. Atmos. Environ. 2003, 37, 3491–3499. [Google Scholar] [CrossRef]
  21. Martins, L.D.; Wikuats, C.F.H.; Capucim, M.N.; de Almeida, D.S.; da Costa, S.C.; Albuquerque, T.; Barreto Carvalho, V.S.; de Freitas, E.D.; de Fátima Andrade, M.; Martins, J.A. Extreme Value Analysis of Air Pollution Data and Their Comparison between Two Large Urban Regions of South America. Weather Clim. Extrem. 2017, 18, 44–54. [Google Scholar] [CrossRef]
  22. El Adlouni, S.; Bobée, B.; Ouarda, T.B.M.J. On the Tails of Extreme Event Distributions in Hydrology. J. Hydrol. 2008, 355, 16–33. [Google Scholar] [CrossRef]
  23. Papalexiou, S.M.; Koutsoyiannis, D.; Makropoulos, C. How Extreme Is Extreme? An Assessment of Daily Rainfall Distribution Tails. Hydrol. Earth Syst. Sci. 2013, 17, 851–862. [Google Scholar] [CrossRef]
  24. Albrecher, H.; Beirlant, J.; Teugels, J.L. Reinsurance: Actuarial and Statistical Aspects; John Wiley & Sons: Hoboken, NJ, USA, 2017. [Google Scholar]
  25. Beirlant, J.; Bladt, M. Tail Classification Using Non-Linear Regression on Model Plots. Extremes 2025, 28, 345–369. [Google Scholar] [CrossRef]
  26. Albrecher, H.; Araujo-Acuna, J.C.; Beirlant, J. Tempered pareto-type modelling using weibull distributions. ASTIN Bull. 2021, 51, 509–538. [Google Scholar] [CrossRef]
  27. Jakata, O.; Chikobvu, D. Estimation of Financial Risk Using the Archimedean Gumbel Copula with Log-Normal Distributed Marginals. J. Stat. Appl. Probab. 2025, 14, 543–560. [Google Scholar] [CrossRef]
  28. Reynkens, T. Using the ReIns Package. Available online: https://cran.r-project.org/web/packages/ReIns/vignettes/ReIns.html (accessed on 23 February 2026).
  29. Bader, B.; Yan, J.; Zhang, X. Automated Threshold Selection for Extreme Value Analysis via Ordered Goodness-of-Fit Tests with Adjustment for False Discovery Rate. Ann. Appl. Stat. 2018, 12, 310–329. [Google Scholar] [CrossRef]
  30. Desmarais, B.A.; Harden, J.J. Testing for Zero Inflation in Count Models: Bias Correction for the Vuong Test. Stata J. Promot. Commun. Stat. Stata 2013, 13, 810–835. [Google Scholar] [CrossRef]
  31. Dekkers, A.L.M.; Einmahl, J.H.J.; De Haan, L. A Moment Estimator for the Index of an Extreme-Value Distribution. Ann. Stat. 1989, 17, 1833–1855. [Google Scholar] [CrossRef]
  32. Hill, B.M. A Simple General Approach to Inference About the Tail of a Distribution. Ann. Stat. 1975, 3, 1163–1174. [Google Scholar] [CrossRef]
  33. de Souza, A.; Aristone, F.; Fernandes, W.A.; Oliveira, A.P.G.; Olaofe, Z.; Abreu, M.C.; de Oliveira, J.F., Jr.; Cavazzana, G.; dos Santos, C.M.; Pobocikova, I. Analysis of Ozone Concentrations Using Probability Distributions. Ozone Sci. Eng. 2020, 42, 539–550. [Google Scholar] [CrossRef]
  34. D’Agostino, R.B.; Stephens, M.A. Goodness-of-Fit Techniques; Dekker: New York, NY, USA, 1986. [Google Scholar]
  35. Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
  36. Stephens, M.A. EDF Statistics for Goodness of Fit and Some Comparisons. J. Am. Stat. Assoc. 1974, 69, 730–737. [Google Scholar] [CrossRef]
  37. MacKinnon, J.G. Bootstrap Inference in Econometrics. Can. J. Econ. Can. D’écon. 2002, 35, 615–645. [Google Scholar] [CrossRef]
  38. Vuong, Q.H. Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses. Econometrica 1989, 57, 307. [Google Scholar] [CrossRef]
  39. Clarke, K.A. A Simple Distribution-Free Test for Nonnested Model Selection. Polit. Anal. 2007, 15, 347–363. [Google Scholar] [CrossRef]
  40. Karaivanov, A. Financial Constraints and Occupational Choice in Thai Villages. J. Dev. Econ. 2012, 97, 201–220. [Google Scholar] [CrossRef]
  41. Fafchamps, M. Sequential Labor Decisions Under Uncertainty: An Estimable Household Model of West-African Farmers. Econometrica 1993, 61, 1173. [Google Scholar] [CrossRef]
  42. Schneider, L.; Chalmers, R.P.; Debelak, R.; Merkle, E.C. Model Selection of Nested and Non-Nested Item Response Models Using Vuong Tests. Multivar. Behav. Res. 2020, 55, 664–684. [Google Scholar] [CrossRef] [PubMed]
  43. Stone, M. Cross-Validatory Choice and Assessment of Statistical Prediction. J. R. Stat. Soc. Ser. B 1974, 36, 111–147. [Google Scholar] [CrossRef]
  44. Geisser, S. The Predictive Sample Reuse Method with Applications. J. Am. Stat. Assoc. 1975, 70, 320–328. [Google Scholar] [CrossRef]
  45. Gelfand, A.E.; Dey, D.K.; Chang, H. Model Determination Using Predictive Distributions with Implementation via Sampling-Based Methods. In Bayesian Statistics 4; Oxford University Press: Oxford, UK, 1992; pp. 147–167. [Google Scholar] [CrossRef]
  46. Vehtari, A.; Gelman, A.; Gabry, J. Practical Bayesian Model Evaluation Using Leave-One-out Cross-Validation and WAIC. Stat. Comput. 2017, 27, 1413–1432. [Google Scholar] [CrossRef]
  47. Arlot, S.; Celisse, A. A Survey of Cross-Validation Procedures for Model Selection. Stat. Surv. 2010, 4, 40–79. [Google Scholar] [CrossRef]
  48. Rojo, J.; Rivero, R.; Romero-Morte, J.; Fernández-González, F.; Pérez-Badia, R. Modeling Pollen Time Series Using Seasonal-Trend Decomposition Procedure Based on LOESS Smoothing. Int. J. Biometeorol. 2017, 61, 335–348. [Google Scholar] [CrossRef] [PubMed]
  49. Cleveland, R.B.; Cleveland, W.S.; McRae, J.E.; Terpenning, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. J. Off. Stat. 1990, 6, 3–73. [Google Scholar]
  50. Wang, X.; Smith, K.; Hyndman, R. Characteristic-Based Clustering for Time Series Data. Data Min. Knowl. Discov. 2006, 13, 335–364. [Google Scholar] [CrossRef]
  51. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
  52. Osborn, D.R.; Chui, A.P.L.; Smith, J.P.; Birchenhall, C.R. Seasonality and the Order of Integration for Consumption. Oxf. Bull. Econ. Stat. 1988, 50, 361–377. [Google Scholar] [CrossRef]
  53. Brockwell, P.J.; Davis, A.R. Introduction to Time Series and Forecasting, 2nd ed.; Springer: Cham, Switzerland, 2002. [Google Scholar]
  54. Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications, 4th ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
  55. Mateus, A.; Caeiro, F. An R Implementation of Several Randomness Tests. AIP Conf. Proc. 2014, 1618, 531–534. [Google Scholar] [CrossRef]
  56. Moore, G.H.; Wallis, W.A. Time Series Significance Tests Based on Signs of Differences. J. Am. Stat. Assoc. 1943, 38, 153–164. [Google Scholar] [CrossRef]
  57. Cox, D.R.; Stuart, A. Some Quick Sign Tests for Trend in Location and Dispersion. Biometrika 1955, 42, 80–95. [Google Scholar] [CrossRef]
  58. Bee, M.; Riccaboni, M.; Schiavo, S. Pareto versus Lognormal: A Maximum Entropy Test. Phys. Rev. E 2011, 84, 026104. [Google Scholar] [CrossRef]
  59. Deepa, A.; Shiva Nagendra, S.M. Statistical Distribution Models for Urban Air Quality Management. In Advances in Geosciences Volume 16: Atmospheric Science (AS); World Scientific: Singapore, 2010; pp. 285–297. [Google Scholar] [CrossRef]
  60. Taylor, J.A.; Jakeman, A.J.; Simpson, R.W. Modeling Distributions of Air Pollutant Concentrations—I. Identification of Statistical Models. Atmos. Environ. 1986, 20, 1781–1789. [Google Scholar] [CrossRef]
  61. Gulia, S.; Nagendra, S.M.S.; Khare, M. Extreme Events of Reactive Ambient Air Pollutants and Their Distribution Pattern at Urban Hotspots. Aerosol Air Qual. Res. 2017, 17, 394–405. [Google Scholar] [CrossRef]
  62. Sharma, S.; Sharma, P.; Khare, M.; Kwatra, S. Statistical Behavior of Ozone in Urban Environment. Sustain. Environ. Res. 2016, 26, 142–148. [Google Scholar] [CrossRef]
  63. Aleksandropoulou, V.; Eleftheriadis, K.; Diapouli, E.; Torseth, K.; Lazaridis, M. Assessing PM 10 Source Reduction in Urban Agglomerations for Air Quality Compliance. J. Environ. Monit. 2012, 14, 266–278. [Google Scholar] [CrossRef]
  64. Maciejewska, K.; Juda-Rezler, K.; Reizer, M.; Klejnowski, K. Modelling of Black Carbon Statistical Distribution and Return Periods of Extreme Concentrations. Environ. Model. Softw. 2015, 74, 212–226. [Google Scholar] [CrossRef]
Figure 1. The STL decomposition of NO2 emissions from Majuba power station: actual data, seasonal, trend and remainder components.
Figure 1. The STL decomposition of NO2 emissions from Majuba power station: actual data, seasonal, trend and remainder components.
Atmosphere 17 00415 g001
Figure 2. The QQ plot (a) and histogram (b) of NO2 emissions from Majuba power station. In both graphs, the red line denotes the fitted normal distribution (theoretical). The blue dotted line in the QQ plot represents the data points, and the dotted grey line denotes the 95% confidence interval.
Figure 2. The QQ plot (a) and histogram (b) of NO2 emissions from Majuba power station. In both graphs, the red line denotes the fitted normal distribution (theoretical). The blue dotted line in the QQ plot represents the data points, and the dotted grey line denotes the 95% confidence interval.
Atmosphere 17 00415 g002
Figure 3. The EVI estimates plot (a) and the generalised quantile–quantile (QQ) plot (b). The circles in the generalised QQ plot indicate the NO2 emissions data points.
Figure 3. The EVI estimates plot (a) and the generalised quantile–quantile (QQ) plot (b). The circles in the generalised QQ plot indicate the NO2 emissions data points.
Atmosphere 17 00415 g003
Figure 4. The histogram (a), PP plot (b), QQ plot (c) and the derivative plot (d) for the Weibull distribution. The circles in the PP, QQ and derivative plots indicate the NO2 emissions data points, while the straight line in the PP plot indicates the expected fit if the distribution well represents the data.
Figure 4. The histogram (a), PP plot (b), QQ plot (c) and the derivative plot (d) for the Weibull distribution. The circles in the PP, QQ and derivative plots indicate the NO2 emissions data points, while the straight line in the PP plot indicates the expected fit if the distribution well represents the data.
Atmosphere 17 00415 g004
Figure 5. The histogram (a), PP plot (b), QQ plot (c), and the derivative plot (d) for the Lognormal distribution. The circles in the PP, QQ and derivative plots indicate the NO2 emissions data points, while the straight line in the PP plot indicates the expected fit if the distribution well represents the data.
Figure 5. The histogram (a), PP plot (b), QQ plot (c), and the derivative plot (d) for the Lognormal distribution. The circles in the PP, QQ and derivative plots indicate the NO2 emissions data points, while the straight line in the PP plot indicates the expected fit if the distribution well represents the data.
Atmosphere 17 00415 g005
Figure 6. The histogram (a), PP plot (b), QQ plot (c), and the derivative plot (d) for the Pareto distribution. The circles in the PP, QQ and derivative plots indicate the NO2 emissions data points while, the straight line in the PP plot indicates the expected fit if the distribution well represents the data.
Figure 6. The histogram (a), PP plot (b), QQ plot (c), and the derivative plot (d) for the Pareto distribution. The circles in the PP, QQ and derivative plots indicate the NO2 emissions data points while, the straight line in the PP plot indicates the expected fit if the distribution well represents the data.
Atmosphere 17 00415 g006aAtmosphere 17 00415 g006b
Table 1. Comparison of past studies and the current study.
Table 1. Comparison of past studies and the current study.
StudyType of Data Used (Location).Methodology UsedAnalysis of Tail Distribution?Main Study Limitations
Kan et al. [12]
Nwaigwe et al. [13]
Intarapak et al. [16]
Oguntunde et al. [17]
Giavis et al. [18]
Hamid et al. [19]
Pollutant dataParametric modelling, GOF testsNoFocus on overall distribution, no tail emphasis
Okorie et al. [14]Surface ozone (USA)Flexible heavy-tailed distributionsLimitedTail considered, but no threshold-based EVT framework
Plocoste et al. [15]PM10 (Guadeloupe)Parametric modelling, EVT tools, GOF testsYesFocus on overall distribution, and tail distribution considered but not exceedances
Albrecher et al. [24]Actuarial loss dataParametric modelling, EVT, QQ plots, derivative plots, tail heaviness rankingYesFocus on overall and tail distribution, Introduces derivative QQ plots for tail classification
Beirlant et al. [25]Theoretical/EVTNonlinear regression on model plots with mention of advantages of derivative plots, tail heaviness rankingYesAdvanced tail classification methodology
Albrecher et al. [26]Actuarial loss dataWeibull-tempered Pareto modelling, QQ and derivative diagnostics, tail heaviness rankingYesFocus on overall and tail distribution, Uses derivative plots for tail discrimination
Jakata et al. [27]Financial risk dataUses derivative plots for tail characterisation before Copula modelling with lognormal marginals, tail heaviness rankingYes Tail dependence modelling, Uses derivative plots for tail characterisation
Current studyNO2 emissions from a Majuba power plant (2005–2014)STL decomposition, seasonal and trend adjustment (see next subsection for details), parametric modelling, GOF tests (bootstrap, Vuong test, cross-validated likelihood, etc.), EVT diagnostics, derivative QQ plots, tail heaviness rankingYes (threshold-based)Focus on overall and tail distribution, Uses derivative plots for tail discrimination, Small sample, tail inference uncertainty
Table 2. Descriptive statistics for nitrogen dioxide (NO2), Y t * , emissions given in tons.
Table 2. Descriptive statistics for nitrogen dioxide (NO2), Y t * , emissions given in tons.
NMeanMedianStandard DeviationMinimumMaximumSkewnessKurtosis
10810,445.5910,311.791090.097858.5913,507.590.23593.0447
Table 3. Tests of normality, stationarity, and independence for the NO2 emissions tons from Majuba power station.
Table 3. Tests of normality, stationarity, and independence for the NO2 emissions tons from Majuba power station.
TestStatisticp-Value
NormalityKolmogorov–Smirnov (KS) test for normality of data12.2 × 10−16
Anderson–Darling (AD) test for normality of dataInf5.556 × 10−6
StationarityAugmented Dickey–Fuller (ADF) test−8.3474<0.01
Phillips–Perron Unit Root Test−69.24<0.01
Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test for Level Stationarity0.014691>0.1
Independence
(Randomness) [55]
Difference-sign test of randomness [53,55,56]0.82950.4068
Mann–Kendall rank test of randomness [55]−0.0477870.9619
Cox–Stuart test of randomness [57]290.6835
Wald–Wolfowitz Runs Test-Two sided [55]−1.93360.05317
Table 4. Bootstrap goodness-of-fit tests, information criteria and cross-validated loglikelihood for the Weibull, Lognormal, and Pareto distributions.
Table 4. Bootstrap goodness-of-fit tests, information criteria and cross-validated loglikelihood for the Weibull, Lognormal, and Pareto distributions.
DistributionAICBICCross Validated
Loglikelihood
Bootstrap KSBootstrap AD
Statisticp-ValueStatisticp-Value
Weibull1833.551838.92−916.22520.09010.0311.54090.001
Lognormal1819.311824.67−908.67380.04290.9070.23050.812
Pareto2218.852224.21−1107.43370.5366040.1160
Table 5. BIC-corrected Vuong test for the Weibull, Lognormal, and Pareto distributions.
Table 5. BIC-corrected Vuong test for the Weibull, Lognormal, and Pareto distributions.
Comparison
Model 1 vs. Model 2
BIC-Corrected Vuong Test
Statistic (V)p-Value
Lognormal vs. Weibull2.0840350.03716
Lognormal vs. Pareto26.883949<0.0001
Weibull vs. Pareto26.789151<0.0001
Table 6. ML parameter estimates for the Weibull, Lognormal, and Pareto distributions.
Table 6. ML parameter estimates for the Weibull, Lognormal, and Pareto distributions.
DistributionEstimateStandard ErrorEstimateStandard Error
Weibull τ λ
9.86990.68710,941.9598 113.1302
Lognormal (values are in logscale) μ σ
9.24850.010.1040.0071
Pareto α β
4.19 × 106 4.38 × 1010
Table 7. BIC-corrected Voung test for the comparison of the Weibull, Lognormal, and Pareto distributions.
Table 7. BIC-corrected Voung test for the comparison of the Weibull, Lognormal, and Pareto distributions.
kBICBIC-Corrected Vuong TestBest Model
Lognormal vs. WeibullLognormal vs. ParetoWeibull vs. Pareto
WeibullLognormalParetoStatistic (V)p-ValueStatistic (V)p-ValueStatistic (V)p-Value
11174.04171.78234.292.34880.018813.4507013.12380Lognormal
12190.67188.17255.22.3010.021414.9088013.70760Lognormal
13207.29204.51276.12.16460.030416.1086014.30980Lognormal
14223.74220.55296.982.17120.029917.0232014.90650Lognormal
15240.05236.35317.852.24770.024617.6449015.47430Lognormal
16256.29252.05338.72.34280.019118.0508015.9540Lognormal
17272.45267.61359.552.45710.01418.2801016.35980Lognormal
18288.74283.39380.382.52680.011518.485016.72240Lognormal
19304.95299.04401.22.62220.008718.6069017.00470Lognormal
20321.29314.894222.67610.007418.7306017.33310Lognormal
21337.59330.68442.792.75040.00618.8289017.60240Lognormal
22354.26347.13463.542.72380.006518.959017.82650Lognormal
23370.92363.51484.292.7380.006219.1409018.04340Lognormal
24387.55379.82505.032.77360.005519.3557018.32230Lognormal
25404.2396.16525.762.81340.004919.5581018.57770Lognormal
26420.82412.4546.482.87190.004119.7686018.78470Lognormal
27437.46428.69567.192.92660.003419.9751019.0160Lognormal
28454.05444.86587.892.9990.002720.1686019.2430Lognormal
29470.64461.05608.593.06780.002220.3542019.43070Lognormal
30487.21477.16629.283.14750.001620.5208019.64260Lognormal
31503.77493.28649.973.22260.001320.6809019.83580Lognormal
32520.4509.5670.653.2860.00120.8441020.00330Lognormal
33537.01525.66691.323.3620.000820.9974020.20790Lognormal
34553.64541.87711.983.42030.000621.1522020.36530Lognormal
35570.22557.99732.643.49440.000521.2889020.51080Lognormal
36586.79574.08753.293.56850.000421.4181020.67490Lognormal
37603.37590.18773.953.63920.000321.5471020.83980Lognormal
38619.92606.24794.593.71690.000221.6589020.97580Lognormal
39636.46622.27815.233.79280.000121.7655021.11420Lognormal
40653.01638.3835.873.86330.000121.8583021.23930Lognormal
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mamba, M.W.; Chikobvu, D. Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa. Atmosphere 2026, 17, 415. https://doi.org/10.3390/atmos17040415

AMA Style

Mamba MW, Chikobvu D. Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa. Atmosphere. 2026; 17(4):415. https://doi.org/10.3390/atmos17040415

Chicago/Turabian Style

Mamba, Mpendulo Wiseman, and Delson Chikobvu. 2026. "Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa" Atmosphere 17, no. 4: 415. https://doi.org/10.3390/atmos17040415

APA Style

Mamba, M. W., & Chikobvu, D. (2026). Statistical Analysis of NO2 Emissions from Eskom’s Majuba Coal-Fired Power Station in Mpumalanga, South Africa. Atmosphere, 17(4), 415. https://doi.org/10.3390/atmos17040415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop