2.3. Feature Selection
The dependent variable in this study was the amount of total CO2 emission measured in MT, while the independent variable is the year. In this study, data were divided into training data and testing data; the train–test split was maintained at an 8:2 ratio. Training data were used in the CO2 emission estimation process of the model while testing data were used to determine the accuracy of the prediction of CO2 emissions. Data from December 1751 to December 1994 were considered as training data and data from December 1995 to December 2018 were used as testing data for the pre-COVID-19 model. A similar train–test split was also effected for the three remaining datasets by stepping one year forward for each of them.
To augment data for the years from 2020 to 2023, the authors used the data from the source given here [
1]. The respective changes in data for the other years were determined by considering the similar rates that were found in 2020 to 2021. The data trend is visualized in
Figure 2. In
Figure 1, actual and augmented data are clearly visible.
Figure 2a,b present data from before the COVID-19 period. From
Figure 2c, it can be seen that CO
2 emissions decreased radically during the pandemic. Similar behavior can also be observed in
Figure 2d for the remaining augmented cases such as 2021, 2022, etc. This data augmentation process takes advantage of developing the actual CO
2 emission model to trace future emission behavior. Moreover, the data augmentation for near-term CO
2 emissions will help to reduce modeling errors; thus, it helps in building real and suitable models. The best performing model can help in creating robust policies for the future to fight against CO
2 emission problems across the world. For example, if one chooses to build a model based on the data available in
Figure 2a, the model could predict wrong values. The model may not explain current or future emission behavior well during the changing environment of the pandemic. Moreover, there is a large chance that a forecasting model only based on data from 1751 to 2018 could be inaccurate. Seasonal data during COVID-19 should be included. The varied data values for different datasets are presented in
Table 3.
2.4. Data Modeling
To model CO
2 emissions, the authors used a time series-based ML technique named Autoregressive Integrated Moving Average (ARIMA) as well as Seasonal Autoregressive Integrated Moving Average (SARIMA). SARIMA is similar to ARIMA but seasonality is added to it. These two algorithms are regarded as the robust model and they are capable of presenting both stationary and non-stationary time series data. To forecast time series, three conditions need to be checked: (a) tentative identification, (b) parameter estimation and (c) diagnostic checking. Auto-regressive models are adroit in modeling different kinds of time series; (a) auto-regressive (AR), (b) moving average (MA), (c) auto-regressive moving average (ARMA) and (d) ARIMA. The base for the ARIMA model is the Box–Jenkin method [
37]. ARIMA is written as ARIMA (p,d,q) where the seasonal parameter is absent and SARIMA is written as SARIMA (p, d, q) (P, D, Q)
where S is the seasonal parameter. During the ARIMA model optimization process
was found to be the best seasonality parameter, thus the ARIMA model turns into a SARIMA model and is presented as SARIMAX.
The SARIMA model can be written as:
where:
In the equations above,
is the number of observations up to time t; B is the backshift operator defined by
;
is called a regular (non-seasonal) autoregressive operator of order p;
is a seasonal autoregressive operator of order p;
is a regular moving average operator of order q;
is a seasonal moving average operator of order Q;
is identically and independently distributed as normal random variables with mean zero, variance
and
; p is the auto-regressive term; q is the moving average order; P is the seasonal period length of the model, S, of the auto-regressive term; Q represents the seasonal period length of the model, S, of the moving average order; D represents the order of seasonal differencing; d represents the order of ordinary differencing [
38].
While fitting data to a SARIMA model, the values of d and D are estimated initially; this gives good results during seasonality issues. The remaining values of p, q and Q need to be chosen by the auto-correlation function (ACF) and the partial auto-correlation function (PACF). AFC and PACF were automatically calculated by the program developed for data modeling. To control overfilling in the models, hold-outs (test–train split), feature selection and data augmentation techniques were used.
To evaluate the model, we use some prediction metrics, namely mean absolute percentage error (MAPE) [
39], mean squared error (MSE) [
40], root mean squared error (RMSE) [
39] and mean absolute deviation (MAD) [
39]. For simplicity and integrity, MAPE scores were finally presented in this paper for model accuracy comparison.
To build different models, the ARIMA algorithm was repeatedly executed using the author-developed optimization algorithm. After checking efficiency issues, the most efficient model was used. Models that were found to be efficient with regard to this work were as follows; for the pre-COVID-19 period, the ARIMA (2,1,2)(0,1,1) [19] (SARIMAX(2, 1, 2)x(0, 1, 1, 19)) model with ACF = 0.88 and MAPE = 0.32; for the start-COVID-19 period, the ARIMA(1,1,2)(0,1,1) [19] (SARIMAX(1, 1, 2)x(0, 1, 1, 19)) model with ACF 0.93 and MAPE = 0.28; for the trans-COVID-19 period, the ARIMA(0,2,1)(1,1,1) [19] (SARIMAX(0, 2, 1)x(1, 1, 1, 19)) model with ACF 0.90 and MAPE = 0.19; and for the post-COVID-19 period, the ARIMA(0,2,1)(1,1,1) [19] (SARIMAX(0, 2, 1)x(1, 1, 1, 19)) model with ACF 0.88 and MAPE = 0.09.
Here, the authors introduced three metrics, namely
,
and
, to calculate the forecasting value difference between different models to show the error propagation among different models. These metrics can show us the forecasting error between different models. This is the difference between the pre-COV (pre-COVID-19) model predicted values and the respective models during the COVID-19 surge (e.g., start-COVID-19, trans-COVID-19 and post-COVID-19). This means that if one chooses to forecast actual CO
2 during and after the COVID-19 period, one needs to select any model other than pre-COVID-19; otherwise, a substantial error will spread in the forecasting value over time.
After the model is successfully developed, it is time to create visual representations of the modeling outcomes.
Figure 3 and
Figure 4 present the outcomes of the models that were built beforehand.
Figure 3 presents the internal forecasting results related to the time period (either for 2018, 2019, 2020 or 2021) of the datasets and
Figure 4 presents the external or future (time periods beyond the datasets, that is, 2022, 2023, etc.) forecasting behavior.
During data modeling, four things can happen—one can build a model: (1) based on the base data (pre-COVID-19 data as shown in
Figure 2a) before the COVID-19 pandemic; (2) based on the data (start-COVID-19 data as shown in
Figure 2b) when the COVID-19 pandemic starts surging; (3) based on the data (trans-COVID-19 data as shown in
Figure 2c) while the COVID-19 pandemic is spreading globally; (4) based on the data (post-COVID-19 data as shown in
Figure 2d) after the COVID-19 pandemic is over.
If one intended to build a CO2 emission model with option (1), forecasting results will probably not represent the real situation concerning emissions observed due to the COVID-19 pandemic. Option (2) would not be the justified option for the same reason, with regard to the period of the COVID-19 pandemic just beginning to spread over China. Option (3) would be the viable option for building a model to forecast CO2 emissions because in this time period the COVID-19 pandemic spreads over the world and massive lockdown processes have already shut down a huge number of CO2 emission sources acriss the world. Option (4) would be a supplementary one if COVID-19 finishes its surge over the globe in this period. The time span of 2022 to 2023 can be considered the period when the COVID-19 pandemic will finish its surge; if it is not, then this period can be considered as part of the extended transmission period. As seen from the global situation, the COVID-19 surge ended in 2021. So, option (4) can only be a post-COVID-19 situation.
If one wants to forecast the exact behavior of CO2 emissions well, these authors suggest building all four models (at least 3 models from 2 to 4), so that exact emission behavior can be covered. No single model can forecast the exact CO2 emissions well. These authors in the end chose a model the reflects the CO2 emissions in the near or far future.
2.5. Model Validation
To validate the models presented in this paper,
Figure 3 is sufficient for the evidence.
Figure 3 shows the forecasting behavior of emissions for the current (1751–2021) and future (2022 and beyond) years. The pink shadow in
Figure 3 is the confidence interval (the upper and lower bound of forecast). Furthermore, a number of performance parameters are presented here to better understand the forecasting values.
Table 4 presents the modeling error and accuracy parameters as found during the model development.
As seen from
Table 5, the error scores for the models are 32%, 28%, 19% and 9% for the respective models. In accordance with
Table 4 [
41], the accuracy intensities for the respective models are named Reasonable, Reasonably Better, Accurate and Highly Accurate.
Hence, the best models are the models that use data from during the COVID-19 pandemic surge. The outcomes of the models exactly resemble the reality of CO
2 emissions across the globe. As the COVID-19 pandemic reaches its mild stage across the globe and lockdowns end, this situation can be treated as the post-COVID-19 period. As a result, to be in line with the real world situation, the post-COVID-19 model actually reflects the current situation concerning CO
2 emissions. We predicted some values of CO
2 emissions for a few years and compared them with real emission data [
42] as well as a benchmark IPCC model [
27]. The comparison results shows the model performs well against real world and benchmark IPCC models. All the results are measured in ppp and giga tons (GT). They are presented in
Table 6.
As seen from
Table 6, forecasting models are justified and accurate enough to represent real CO
2 emission behavior for the current and near future. It is inferred that far future predictions would be justified too. For purposes of further forecasting, in the end the most accurate model (post-COVID-19 model) was chosen to present the different forecasting scenarios.