Next Article in Journal
Light-Driven Enhancement of Oxygen Evolution for Clean Energy Conversion: Co3O4-TiO2/CNTs P-N Heterojunction Catalysts Enabling Efficient Carrier Separation and Reduced Overpotential
Previous Article in Journal
Multi-Criteria Optimization and Techno-Economic Assessment of a Wind–Solar–Hydrogen Hybrid System for a Plateau Tourist City Using HOMER and Shannon Entropy-EDAS Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Stage Feature Selection and Explainable Machine Learning Framework for Forecasting Transportation CO2 Emissions

by
Mohammad Ali Sahraei
1,*,†,
Keren Li
2 and
Qingyao Qiao
3,*,†
1
Department of Civil Engineering, College of Engineering, University of Buraimi, Al Buraimi 512, Oman
2
School of Aerospace Engineering, Beijing Institute of Technology, Beijing 100811, China
3
Guangzhou Institute of Energy Conversion, Chinese Academy of Sciences, Guangzhou 510640, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Energies 2025, 18(15), 4184; https://doi.org/10.3390/en18154184
Submission received: 9 June 2025 / Revised: 16 July 2025 / Accepted: 31 July 2025 / Published: 7 August 2025

Abstract

The transportation sector is a major consumer of primary energy and is a significant contributor to greenhouse gas emissions. Sustainable transportation requires identifying and quantifying factors influencing transport-related CO2 emissions. This research aims to establish an adaptable, precise, and transparent forecasting structure for transport CO2 emissions of the United States. For this reason, we proposed a multi-stage method that incorporates explainable Machine Learning (ML) and Feature Selection (FS), guaranteeing interpretability in comparison to conventional black-box models. Due to high multicollinearity among 24 initial variables, hierarchical feature clustering and multi-step FS were applied, resulting in five key predictors: Total Primary Energy Imports (TPEI), Total Fossil Fuels Consumed (FFT), Annual Vehicle Miles Traveled (AVMT), Air Passengers-Domestic and International (APDI), and Unemployment Rate (UR). Four ML methods—Support Vector Regression, eXtreme Gradient Boosting, ElasticNet, and Multilayer Perceptron—were employed, with ElasticNet outperforming the others with RMSE = 45.53, MAE = 30.6, and MAPE = 0.016. SHAP analysis revealed AVMT, FFT, and APDI as the top contributors to CO2 emissions. This framework aids policymakers in making informed decisions and setting precise investments.

1. Introduction

1.1. Background

Currently, the transportation sector accounts for 21% of global carbon emissions, and according to the International Energy Agency (IEA), the transport demand is anticipated to double by 2070, specifically, a 60% rise in car usage rate and a threefold increase in freight aviation [1]. Road travel accounts for approximately 75% of transport, or 15% of global carbon emissions, of which 45.1% is attributed to passenger automobiles and 29.4% is derived from lorries engaged in freight transportation. Aviation, shipping, rail, and other systems contribute 11.6%, 10.6%, 1%, and 2.2%, respectively [2].
Figure 1 displays the historical data of the United States (U.S.) CO2 emissions since 1990. The transport field was the second biggest sector until 2016, when it surpassed electric power, turning into the greatest origin of CO2 emissions [3]. Containing CO2 emissions in the transport field is crucial in minimizing climate change and attaining sustainability. Precisely forecasting upcoming CO2 emissions is a necessary initial phase, as dependable forecasts allow stakeholders and policymakers to establish powerful techniques and apply cleaner technologies, as well as encourage sustainable transport options.
The development of big data and artificial intelligence (AI) has triggered an astonishing revolution in many research fields. A range of Machine Learning (ML) Methods were used to forecast the level of CO2 emission around the world, such as Artificial Neural Networks (ANN) [4,5,6], Support Vector Machine (SVM) [5], regression methods [4,6,7,8,9,10,11], time series analysis [12,13,14], Deep Learning (DL) [5], decision trees [4,15], and hybrid methods [16,17,18].
Although several researchers have utilized different AI and empirical techniques to predict CO2 emissions, limited attention has been concentrated on its relationship with the transportation field in the USA. Furthermore, prior research on CO2 emission prediction has either focused on a limited number of preselected features or utilized short-term data collection methods. For example, four ML algorithms were developed by Chukwunonso et al. [19] to evaluate total CO2 emissions in the United States with data from 1973 to 2022, with eight indicators included. Ahmed et al. [20] explored the energy usage and greenhouse gas (GHG) emissions of the United States, India, China, and Russia using ML methods based on features including gas, petroleum liquids, coal, and renewable power usage. Mishra et al. [21] implemented the wavelet transform method to examine the relationship among economic development, the transport sector, tourism development, and CO2 emissions, evaluating a period of 16 years after 2001. Table 1 summarizes some of the latest research addressing the forecasting of CO2 emissions in different areas inside and outside of the United States. It is observed that feature selection (FS) was less concerned, which is a crucial factor due to ML forecasting fully relies on and is sensitive to features selected [22,23,24]. Additionally, a proper interpretation of feature importance is vital for decision/policy making. In general, precise forecasting of CO2 emission based on the right features serves as the fundamental factor in supporting policymaking.

1.2. Research Novelty

The novelty of this study can be concluded in the following points:
  • This research proposed a sophisticated multi-stage ML framework for an accurate forecasting of transportation CO2 emission in the U.S. potential factors were selected based on a comprehensive literature review, including energy usage, economic and demo-graphic parameters like population, unemployment rate, Gross Domestic Product (GDP), urban population, total fossil fuels consumed, etc. This broad and varying information allows the model to comprehend driving contributors to CO2 emissions in the transport field, which also provides a comprehensive method in comparison with conventional methods that concentrate exclusively on energy metrics.
  • Given the extremely multicollinear features like total primary energy production, GDP, fossil fuel consumption, population, air transport freight, road mileage, and so on, conventional forecasting usually faces issues associated with decreased precision. To this, a main concentrate of the study was a two-stage FS procedure starting with time series clustering to determine groups of correlated factors, accompanied by an advanced FS voting mechanism based on Boruta FS and Spearman correlation analysis to identify the best appropriate features. This arranged method ensured the robustness and reliability of forecasting results.
  • Four different ML models, like SVR, eXtreme Gradient Boosting (Xgboost), ElasticNet, and Multilayer perceptron (MLP), were employed to improve the accuracy of CO2 emissions forecasting in the transportation field. The SHAP method was then applied to uncover the black box of ML, identifying the global and local relationship between input features and CO2 emissions. The explainable ML methods ensured offering valuable insights for policymakers to support efficient techniques regarding decreasing CO2 emissions in the USA. Figure 2 depicts the conceptual framework for forecasting transportation CO2 emissions based on the above procedure. All machine learning coding was implemented based on python 3.11.9, scikit-learn 1.5.1, xgboost 2.1.0 and shap 0.47.0.
The rest of the paper is arranged as below: Section 2 briefs the related literature review in transportation CO2 emission forecasting, and Section 3 describes data collected and the methodology. Section 4 presented the final results and discussion, followed by Section 5 indicating policy recommendations and limitations, and Section 6 finalized the conclusion and future research.

2. Literature Review

Much research on CO2 emission in the transportation sector has been developed utilizing ML algorithms and statistical methods. An overview of the recent research concerning the abovementioned topic with a variety of methods is provided as an initial step in the review process.

2.1. Machine Learning for CO2 Prediction

Qiao et al. [22] attempted to utilize ML techniques to predict the CO2 emissions and energy usage in the United kingdom’s transport field. The experimental outcomes revealed that carbon intensity in the road was the most influential factor for CO2 emissions and energy usage, while GDP per capita and population were less crucial. Chukwunonso et al. [19] considered CO2 emissions from 1973 to 2022 in the USA based on four ML algorithms, of which the Recurrent Neural Network (RNN) algorithm outperformed all others. Nevertheless, this method is generally recognized to be challenging in long-term dependencies because of vanishing gradient issues, positioning it less powerful in capturing long-range temporal behavior. Additionally, the importance of features was not discussed in their research. To predict transportation-related GHG emissions of China, Yin et al. [4] introduced a new method that combined information extraction and managed ML techniques. The results recommended that ANN had exceptional prediction precision in comparison with other models. Fu et al. [23] suggested integration of ML, localized emissions information, and satellite imaging to create an accurate and applicable system for checking GHG emissions throughout roads. The outcomes revealed the capability of integrating ML techniques with satellite imagery in effectively checking GHG emissions. Similarly, Javanmard et al. [16] used hybrid multi-objective ML models to forecast transport emissions and energy needs, which indicated an increase in CO2 emissions and energy by 0.02% and 50.02% from 2019 to 2048 throughout the Canadian transport field. Ahmed et al. [20] reviewed the energy usage of the USA, India, China, and Russia, as well as its pattern in GHG emissions using ML methods. The predicted outcomes with the long-short term memory technique verified a rise in CO2, methane, and Nitrous oxide (N2O) emissions regarding India and China and a slowdown pattern in the USA and Russia. Although Ahmed et al. [20] utilized ML methods to forecast GHG emissions. Ulussever et al. [24] analyzed sector-based energy usage indicators from 1973 to 2021 as independent features to calculate CO2 emissions in the USA. The empirical results show the outperformance of ML over the time series econometric models. Three ML algorithms are generally utilized to predict transport-based CO2 emissions by Li et al. [25], where the top 30 CO2 emissions-producing nations from 1960 to 2020 were selected. The results showed that the gradient boosting regression model with variables mixing transport and socioeconomic elements has the greatest efficiency for transport CO2 emission prediction. Although Li et al. [25] integrated both socioeconomic and transportation variables for CO2 emission forecasting, their research concentrated on a worldwide range as well as did not particularly evaluate the USA transport field. Three ML algorithms were utilized by the Ağbulut [5] to predict the transport CO2 emission and energy needs throughout Turkey. It was predicted that the yearly growth rate regarding carbon dioxide emission and transport energy needs within Turkey would cumulatively increase by 3.65% and 3.7%, respectively, and will be almost 3.4 times greater in the year 2050 compared to those of today. Although Ağbulut [5] effectively predicted upcoming trends, mainly concentrating on Turkey’s transport field utilizing a restricted set of input parameters. To predict the peak of transportation CO2 emissions throughout China, a new bio-inspired forecasting model was suggested by Wang and Wang [26], namely, the Manta Rays Foraging Optimization-Extreme Learning Machine (MRFO-ELM) with the mean impact value technique used to examine and identify the significance of 13 important variables. The scientific outcomes show that the suggested MRFO-ELM has outstanding functionality regarding the optimization searching speed and forecasting precision. Alfaseeh et al. [27] employed an LSTM method to forecast GHG Emission Rate (ER) depending on the most important features, for example, velocity, density, and GHG ER of prior time period steps. Compared with clustering and the autoregressive integrated moving average (ARIMA) method, the LSTM model with velocity, GHG ER, density, and velocity from three prior minutes carried out the greatest action. Qin and Gong [28] used RF and decision trees to figure out the variables influencing CO2 emissions in the eastern areas with greater economic development throughout China, for example, GDP, foreign investment, and general financial budget revenue can impact CO2 emissions. The results show that locations with severe CO2 emissions, like Chongqing, Tianjin, and Shanghai, ought to be prioritized as locations for low-carbon economic development.
Despite the abovementioned research, their efforts mainly concentrated on using existing ML approaches without integrating enhanced FS methods or taking into consideration a broad variety of impacting variables. Moreover, their strategy lacks the level of interpretability, which is a crucial part of current research, promoting its applicability in real-world policy decisions.

2.2. Feature Selection for the CO2 Prediction

FS technique can be utilized in data pre-processing to obtain effective data reduction while maintaining the informative features. A suitable FS process is essential in forecasting CO2 emission. We provide an overview of several of the latest research that considers FS techniques within the fields of energy and CO2 emission.
The effects of FS techniques on ML models for prediction of CO2 emission and energy consumption throughout the UK were evaluated by Qiao et al. [22]. A novel voting strategy for FS was introduced, including both integrated and filter methodologies. Wang et al. [29] introduced a hybrid model merging scenario analysis with the ML method. Prior to the training models, FS (i.e., stepwise regression) was a substantial step to simplify the model and prevent overfitting. Ma et al. [30] suggested a multivariate evaluation of the most essential elements for the national level of air quality from 171 features varying in economic, energy, environmental, meteorological, and demographical aspects. To tackle huge information, the FS method and Extreme Gradient Boosting were used to model the connection and evaluate the feature significance. Li and Sun [31] utilized a set of open access information and ML algorithms to forecast CO2 emissions throughout China. Eighteen out of thirty-one factors were chosen based on the FS technologies to create forecasting models of carbon dioxide emissions. They discovered that the statistical indicators of city environment pollution were generally the most essential features in terms of the urban-level carbon dioxide emissions within China.
Van Zyl et al. [32] evaluated the effectiveness of explainable AI techniques, particularly Shapley Additive Explanations and Gradient-weighted Class Activation Mapping, for the FS method regarding national energy need prediction. Amiri et al. [33] used FS for forecasting household transport energy utilization. A new predicting energy model depending on the SVR algorithm with a wrapper FS method utilizing multi-objective optimization method was designed by Karasu et al. [34]. A modeling scheme by Tang et al. [35] developed a DL for predicting the NOx emission depending on a mix of FS and the JAYA optimization algorithm. Although prior research (Van Zyl et al. [32], Amiri et al. [33], Karasu et al. [34], and Tang et al. [35]) has made significant contributions to FS, they generally ignore the challenge of multicollinearity that can skew model forecasts as well as decrease the performance of FS techniques.
Regarding the above review of prior research works, Table 1 summarizes reviews from all over the world related to CO2 evaluation in the field of transportation. Given the sophistication of transport emissions, which are generally affected by different technological, environmental, and socioeconomic variables, the insufficient, considerable study, including FS, highlights a significant gap. Existing research mostly concentrated on traditional forecasting methods without methodically optimizing input parameters, possibly resulting in decreased predictive precision and model interpretability. This highlights the requirement for more extensive research that examines FS techniques tailored to CO2 emission prediction, providing strong and reliable predictive frameworks. Based on such premises, we proposed a comprehensive feature selection framework to address the multicollinearity issue while maintaining the informative features, providing an advantage that balances precision, strength, and interpretability, and generating it extremely appropriately regarding predicting CO2 emissions as well as offering more dependable inputs for policymaking.
Table 1. Summary of the literature review in predicting CO2 emissions.
Table 1. Summary of the literature review in predicting CO2 emissions.
ArticleCountryTime PeriodFeature SelectionMachine LearningApplied Models
MLHybridOthers
Qiao et al. [22]UK1990–2019YesYesLSTMSVR-RBF---
Javanmard et al. [16]Canada NoYes---multi-objective mathematical model with data-driven ML---
Ahmed et al. [20]China, India, the USA, and Russia1992–2018NoYesSVM, ANN, LSTM------
Qin and Gong [28]China2000–2019NoYesRF, decision tree------
Ulussever et al. [24]USA1973–2021NoYesMLP, SVM, RF, MARS, k-NN---Time Series Econometric
Li et al. [25]30 Countries1960–2020NoYesOLS, SVM, GBR------
Ağbulut [5]Turkey1970–2016NoYesDL, SVM, ANN------
Li and Sun [31]China2010YesYesGBM, SVM, RF, and XGBoost------
Chukwunonso et al. [19]USA1973–2022NoYesRNN, FFNN, CNN------
Wang and Jixian [26]China2000–2012NoYes---MRFO-ELM---
Alfaseeh et al. [27]CanadaN/ANoYesLSTM---ARIMA
Yin et al. [4]ChinaN/ANoYesDecision Tree, MLR, ANN---MLR
Fu et al. [23]worldwide2021NoYesGNN, CNN------
Wang et al. [29]China1980–2014YesYesSVM, GPR, BPNNPSO-SVM---
Peng et al. [36]ChinaN/ANoYes---GAPSO-SVR---
Chadha et al. [37]IndiaN/ANoYesXGBoost, (SVR), RF, and Ridge Regression------
Anonna et al. [38]USA1970–2020NoYesRF, Support Vector Classifier, and Logistic Regression------
Jha et al. [39]USA2004–2023NoYes11 AI algorithms------
Li et al. [40]USAN/ANoYesDL-LSTM------
Tian et al. [41]USA1973–2022NoYesk-nearest neighbors (KNN), RF, MLR, gradient boosting, decision tree and SVR------
Ajala et al. [42]China, India, the USA, and the EU27&UK2022–2023NoYes14 AI algorithms------
Current ResearchUSA19902023YesYesXGBoost, MLP, SVR, Elastic Net--------
Note: Deep Learning (DL), Feed-Forward Neural Network (FFNN), Convolutional neural network (CNN), Multinomial Logistic Regression (MLR), Generalized Neural Network (GNN).

3. Materials and Methods

Figure 2 systematically describes the research outline of the proposed multi-step FS based explainable ML approach for forecasting total CO2 emissions of transportation TCT (t + 1). The study in total involved 4 steps, with the first step being data identification and collection, features deemed crucial based on literature review were then collected from multiple official websites and merged properly in CSV files. Considering the likelihood of including redundant and irrelevant variables collected in step 1. A comprehensive multi-step FS method based on hierarchical feature clustering, Boruta FS and Spearman correlation was established in step 2 to mitigate data irrelevancy and redundancy challenges. Inspired by the Autoregressive model, the study employed autocorrelation analysis to determine the number of historical TCTs that were the most relevant to TCT (t+1) as additional independent variables. Step 3 involved ML forecasting based on three feature set scenarios, e.g., a feature subset based on the proposed FS method, the original feature set, and a feature subset generated by Recursive feature elimination-Random Forest (RFE-RF) [38]. Data was split into training and testing datasets at a ratio of 80:20 without shuffle. The main reason for an 80:20 ratio is not only due to it being a typical ratio for ML applications but also due to the concern of limited sample size; 80% of training data primarily ensures ML models fully learn the data. Ten-Fold Time Series Split Grid Search Cross Validation (TSGSCV) by Liu and Zhou [39] was also employed to tune and determine the best hyperparameter for each ML model based on different feature setting scenarios. Three evaluation metrics, namely Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), were introduced to validate the trained ML models performance using the rest of the 20% of the testing dataset. Feature subsets and ML models that indicate the best performance were selected for Shapley analysis to quantify the contribution of the most relevant and important variables to TCT (t + 1). The source code and the data of the research could be found in the GitHub repository (https://github.com/qiaoqingyao/CANADACO2/tree/master (accessed on 14 July 2025)).

3.1. Data Collection

Figure 3 provides the record of CO2 emissions in the transportation sector in the USA from 1990 to 2023. An overall rising trend was observed from almost 1600 (Mt) in 1990 to 2026 (Mt) in 2007. Followed by a moderate decrease until 2012, probably affected by financial elements. After that, emissions increased once again from 2012 (1776 Mt) to 2019 (1924 Mt). In 2020, there was a sharp decrease (i.e., 1633 Mt, which is lower than 1994), probably because of decreased transportation activities during the COVID-19 pandemic. After 2020, emissions started to rebound despite staying under the pre-pandemic crisis level until 2023 (1858 Mt).
In this research, 25 variables, including 24 independent and one dependent variable, were selected. Table 2 lists these variables along with their abbreviations and statistical descriptions. Generally, we divide all the input features into four categories: transportation-based, socioeconomic, environment-based, and energy-based features. Transportation-based features comprise air passengers—domestic and international—air transport freight, railway passengers, railway goods transported, number of motor vehicles registered, annual vehicle miles traveled, road mileage, and motor vehicle licensed drivers. Socioeconomic features include population, urban population rate, GDP, unemployment rate, gasoline prices, crude oil first purchase price, and annual percent change. Environmental factors include average annual temperature. Energy-based features consist of fossil fuels consumed, biomass energy consumed, electricity sales to ultimate customers, primary energy production, primary energy consumption, petroleum (excluding biofuel imports), and primary energy imports.
In the current study, all data was extracted from the World Bank Group, https://data.worldbank.org/ (accessed on 26 July 2023), International Energy Agency, https://www.iea.org/data-and-statistics/, (accessed on 14 July 2025), U.S. Energy Information Administration, https://www.eia.gov/ (accessed on 14 July 2025), U.S. Department of Energy, https://afdc.energy.gov/data (accessed on 1 May 2023), U.S. Department of Transportation, https://www.bts.gov/, (accessed on 14 July 2025), National Weather Service, https://www.weather.gov/ (accessed on 1 January 2025), Federal Reserve Bank of Minneapolis, https://www.minneapolisfed.org/ (accessed on 1 January 2025).

3.2. Feature Selection Methods

3.2.1. Hierarchical Clustering

In order to deal with multicollinearity of independent variables, a hierarchical clustering was first implemented, which is a clustering technique in data analysis that aims to construct a tree-like structure of clusters [43]. This method involves grouping data points into a hierarchy of clusters, where each cluster is formed by combining smaller clusters or individual data points. There are two main approaches to hierarchical clustering: agglomerative (bottom-up), which involves merging data points into larger clusters, and divisive (top-down), which entails splitting a cluster into smaller sub-clusters. Agglomerative clustering approach was employed in this study. The key task of the agglomerative method is an effective cluster combination, which in general is achieved by comparing the dissimilarity between observation pairs based on some distance metrics; for instance, in this study, Ward’s Minimum variance method:
A · B A B μ A μ B 2 = x A B x μ A B 2 x A x μ A 2 x B x μ B 2
when the cluster tree is established, the next step in hierarchical clustering is determining the optimal number of clusters. Silhouette score was employed to decide the best number for clustering.
S = b i a i m a x   ( a i , b i )
where a i signifies the average distance of point i to other data points within the same clusters and b i calculates the average distance of point i to all other clusters.
The number with the smallest Silhouette is regarded as the optimal number of clusters and the independent variables will be clustered accordingly.

3.2.2. Boruta Feature Selection (BFS) and Spearman Correlation

The independent variables in each cluster implied a significant multicollinearity, which, in other words, represents similar information. In this step, the task is to determine whether each cluster is relevant to TCT (t + 1) and, meanwhile, choose the best candidate variable from each cluster. BFS and Spearman correlation were employed for this task.
The BFS algorithm is generally a wrapper method around the RF. It considers the variations in the average accuracy loss of trees in the forest as well as utilizing the average drop precision, which is the Z score, to calculate the significance. Generally, depending on the full variables, the correlation among the variable and the forecasted magnitude is removed by generating a combined shadow variable prior to selection, which is considerably more beneficial for processing information with more powerful variable correlation [44]. A comprehensive process for BFS is iterated below:
  • Add randomness to the variables by generating shuffled copies (shadow variables) of the whole variables and then combine the shadow variables using original variables to create extended variables.
  • Set up an RF procedure for the extended variables as well as calculate the significance variable (the mean decreased precision Z value). The greater the Z value, the more significant the variable; the greatest Z value of the shadow variable is determined as Zmax.
  • Throughout every iteration, if the Z value of the variable is greater than Zmax, after that, the variable is taken into account as significant and will be retained. Otherwise, the variable is considered highly insignificant as well as will be eliminated through the variables.
  • The earlier mentioned procedure stops whenever either all variables are generally rejected or verified, or BFS gets to the highest range of iterations [42].
Spearman correlation is a nonparametric statistical technique that evaluates the rank correlation between two variables by assessing the extent to which their rankings are related through a monotonic function [43]. It provides insights into the strength and direction of the association between variables based on their ordinal rankings. Monotonical nature enables Spearman correlation to effectively measure the nonlinear relationship between variable pairs.
For each cluster, a conservative FS mechanism based on BFS and Spearman correlation analysis was proposed as follows:
  • If multiple variables are tested to be relevant in a cluster, the variable that has the highest correlation score with statistical significance is determined as the best candidate variable for this cluster. In case no variable indicates statistical significance, a compromise is that the variable that has the highest correlation score is still regarded as the best one.
  • If only one variable is either tested to be relevant or indicates the highest correlation score with statistical significance in a cluster, then this variable is the chosen variable for this cluster.
  • If variables in a cluster show neither relevancy nor statistical significance in Spearman correlation, then this cluster is discarded and no variable will be selected for the next step of ML.

3.3. Machine Learning Models

3.3.1. eXtreme Gradient Boosting (Xgboost)

Xgboost is an ensemble technique designed to amalgamate inadequate learning models to formulate a considerably more resilient model via a repetitive process. At every repetition, the residual from the prior estimate may be utilized to modify as well as improve the loss function. This approach often uses a binary decision tree as the principal learner, integrating regularization into the loss function to enhance performance and reduce overfitting, as seen in Equation (3). In this regard, l as a loss function will calculate the distinction between the real magnitude y i and forecasting y ^ i regarding every sample i.
L = i = 1 n l y i . y ^ i + k Ω ( k )
Since this method pertains to a classification/regression issue, the Mean Squared Error (MSE) can be used as the loss function. The word Ω indicates a control technique and is also used to penalize the intricacy of every model. The intricacy of a decision tree k is identified through its framework and may be expressed as Equation (4). Here, w j indicates the rating of every leaf j and T denotes the quantity of leaves integrated within a decision tree.
Ω K = γ . T + 1 2 λ i = 1 T w j 2
Typically, both γ and λ serve as the factors for scaling the penalty. Let y ^ i ( t ) represent the outcome of the i-th tree at the t-th reputation that will be reformulated as Equation (5), with q t denoting the additive instance at the t-th reputation. According to this specific method of substitution, the goal function will alter Equation (6). Subsequently, to assist in the optimization procedure, a second-order Taylor expansion is carried out to approximate the specific objective function, as seen in Equation (7).
y ^ i ( t ) = y ^ i ( t 1 ) + q t ( x i )
L ( t ) = i = 1 n l y i . y ^ i t 1 + q t x i + γ . T + 1 2 λ j = 1 T w j 2
L ( t ) i = 1 N l y i , y ^ i t 1 + g i q t x i + 1 2 h i q t 2 ( x i ) + γ . T + 1 2 λ j = 1 T w j 2
Given that previous iterations (t − 1) have been completed, the initial term l ( y i , y ^ i ( t 1 ) ) is effectively constant and will be excluded from the optimization process. Moreover, each example will finally reside in a single leaf node; hence, the function will be expressed as Equation (8).
L ^ ( t ) = j = 1 T i ϵ I j g i w j + 1 2 i ϵ I j h i + λ w j 2 + γ . T
where I j represents the sample sets that correspond to leaf node j [45].

3.3.2. Multilayer Perceptron (MLP)

The MLP consists of several layers, starting with the input layer as well as ending with the output layer, having intermediate layers known as hidden layers. The common framework of an MLP is represented within Figure 4. Neurons at various levels are generally linked, and every link has a specific weight. Every node inside the hidden layer is certainly capable of carrying out two basic procedures, including summation and activation [46,47]. The summing procedure utilizes the multiplication of weights, input, and bias, as presented in Equation (9).
S j = i = 1 n w i j I i + β j
Here, n indicates the quantity of inputs, Ii signifies the i-th input magnitude, w i j signifies the link weight, and lastly, β j indicates the bias. The examination of the ANN’s efficiency is generally proven via the use of a loss function. A common technique consists of utilizing the MSE as the specified loss function [46,47]. The MSE calculates the cumulative sum of squared variances between the real and predicted magnitudes, as mentioned in Equation (10):
M S E = i = 1 n y i y i ^ 2 n

3.3.3. Support Vector Regression (SVR)

This is a supervised learning method depending on the statistical learning theory presented by Vapnik [48]. The essential concept of SVR is the mapping of input features in a greater dimensional variable space utilizing a nonlinear mapping method. A linear regression method in the corresponding variable space is then acquired [49,50]. For a set of training datasets, {(x1, y1), …, (xn, yn)} from the true model, where xi ϵ Rn is the feature vector, and yi ϵ R1 is the output information. The regular form of the support vector regression under the provided variables C > 0 and ε > 0 could be explained in Equation (11):
m i n w , b , ξ ,   ξ 1 2 w T w + C i = 1 l ξ i + C i = 1 l ξ i *
Subject to the constraints outlined as follows:
w T   φ x i + b y i ε + ξ i y i w T φ x i b ε + ξ i * ξ i , ξ i * 0 ,   i = 1 ,   ,   l
where w is the vector feature, C indicates the trade-off between the tube violation and also regularization, and ξi* and ξi denote the lower and upper tube violations. ε is the vertical tube width, b is the parameter that determines the linear parameterization, and φ(xi) maps xi into a higher-dimensional space. The dual problem acquired utilizing the optimization technique (Equation 13) is given as:
m i n α , α * 1 2 α α * T Q ( α α * ) + ε i = 1 l ( α i + α i * ) + i = 1 l y i ( α i + α i * )
with the constraints (i.e., α* and α are the dual features regarding every data point):
e T α α * = 0 0 α i , α i * C ,                     i = 1 ,   ,   l
Following the dual problem is certainly solved, the SVR function can be composed as
f x = i = 1 l a i + a i * K x i , x + b
where K(xi, x) is the Kernel function. Accordingly, the Sigmoid kernel was utilized as the Kernel function, that can be composed as
K x i ,   x = t a n h ( a x i T x + r )
where r is the shifting variable that controls the threshold of mapping and also indicates the scaling variable of the input information. In the training of SVR, the optimization of the penalty component and Kernel function variable was acquired through cross-validation [50].

3.3.4. Elastic Net

This method is an ML and statistical approach, which includes the advantages of Lasso (L1) and Ridge (L2) regularization to enhance model interpretability and forecasting performance. It is specifically well-suited for handling information having remarkably correlated variables or when the range of predictors surpasses the range of real data. Prior to utilizing Elastic Net, it is important to comprehend the two primary elements it integrates:
  • L2 as ridge regression reduces coefficients in the direction of zero through penalizing their squared degree. It performs properly when variables are extremely correlated, however, it does not carry out FS.
  • L1 as Lasso promotes sparsity through penalizing the total magnitudes of coefficients, frequently establishing some coefficients to zero as well as carrying out FS. Nevertheless, this method is challenged when predictors are usually extremely correlated, as it randomly chooses one and neglects some others.
Elastic Net addresses the restrictions of Lasso and Ridge by merging their penalties (Equation (17)):
M i n i m i z e :   y X β 2 2 + λ 1 β 1 + λ 2 β 2 2
where β: the coefficients to be approximated, X: the independent features (predictors), y: the dependent feature (target), λ2: controls the L2 penalty (Ridge-like behavior), and λ1: controls the L1 penalty (Lasso-like behavior).
Elastic Net presents an extra parameter, α, that ascertains the harmony between L1 and L2 penalties:
P e n a l t y =   α β 1 + ( 1 α ) β 2 2
where α = 0: Equal to Ridge regression, α = 1: Equal to Lasso regression, 0 < α < 1: A combination of both, providing Elastic Net its flexibility [51].
The hyperparameter matrix space of the above models, i.e., Xgboost, MLP, SVR and Elastic Net, is detailed in Table 3.

3.4. Model Evaluation

To examine the functionality of the ML methods, statistical indices, for example, Standard Deviation (StD) of Mean Error (ME), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) (Equations (19)–(21)), were used, where y ^ i is the forecasted CO2 emission, y i indicates the actual information, and n indicates sample size.
M E = 1 n   i = 1 n y i y ^ i
S t D = i = 1 n ( y i y ^ i M E ) 2 n
M A E = 1 n   i = 1 n y i y ^ i
R M S E = i = 1 n y i y ^ i 2 n
M A P E = 1 n i = 1 n y i y ^ i y i

3.5. Shapley Analysis

The ML model that indicates the best performance was determined based on the SHAP method. This method can be utilized in ML to evaluate the contribution of all variables in the model, which collectively delivers the forecasting [52]. The Shapley value regarding variable Xj within a model is generally provided by:
S h a p l e y   X j = S N \ { j } K ! P K 1 ! p ! ( f S j   f ( S ) )
where p is actually the overall range of variables, N\{j} is a set of entire feasible mixtures of variables not including Xj, S is a parameter in N\{j}, f(S) is generally the model forecasting with variables in S, as well as f (S∪{j}) is the model forecasting with variables in S plus variable Xj.
The interpretation of Equation (20) is that the Shapley value of a particular variable is generally its marginal contribution to model forecasting averaged over all feasible models with distinct combinations of variables. Shapley value has a quantity of beneficial properties like symmetry, efficiency, additivity, and dummy [53,54]. Symmetry indicates that two variables possess similar Shapley values if they play an equal role in the model. Efficiency pertains to the requirement that all variable contributions total up to the distinction between the forecasting and the average. Additivity requires that the aggregate of forecasting through specific models be identical to the forecasting from the mixture of almost all models. Dummy indicates that a variable has a Shapley value of actually zero when its marginal contribution to all feasible models is zero. The results of this method are reasonable as well as special if all the above attributes are satisfied [55,56].

4. Results and Discussion

4.1. Multicollinearity Analysis and FS Method

The annual rate of the dependent/independent variables between 1990 and 2023 is visualized in Figure 5. Considering the dramatic impact of COVID-19 on almost every aspect of daily life, in this study, data in 2020 was excluded to avoid any potential adverse impact of outliers on CO2 emission forecasting. The dependent variable, i.e., total CO2 by transportation TCT (t + 1), indicates a similar pattern with many independent variables such as Total Fossil Fuels Consumed by the Transportation (FFT), Total Energy Consumed by the Transportation (TET), Total Primary Energy Consumption (TPEC), etc. In addition, similar patterns were also frequently observed within independent variables (e.g., Population, Urban Population rate (UPR), GDP, Motor vehicle licensed drivers (MVLD)). For ML, even though the redundant features would not necessarily sacrifice the forecasting performance. However, the multicollinearity issue, to a certain extent, hinders proper interpretation of feature importance, which plays a significant role in decision-making for stakeholders/policymakers.
In order to mitigate the multicollinearity in the dataset, hierarchical feature clustering was implemented on independent variables only, and the Silhouette score was also employed to determine the optimal number of clusters. The results are illustrated in Figure 6. As shown in Figure 6a, variable pairs with a minimum ward distance were iteratively combined into hierarchical feature clusters with different colors until all variables were clustered. For instance, TPI and TPEI, which indicate a significant similarity in Figure 4, had the smallest Ward’s distance as shown on the right side of the purple clusters. Silhouette score of Figure 6b confirms the optimal number of clusters is 8. Detailed feature cluster is presented in Figure 7. Specifically, time series scaling was implemented in order to have a comparable range for the input variables. As the unit of variables in the same clusters varied, the unit description was not presented.
After clustering independent features, the next step is to determine the best candidate from each cluster. The results of the conservative feature selection mechanism based on Boruta feature selection and Spearman correlation are listed in Table 4. Based on the proposed multi-step feature selection framework (hierarchical clustering-Boruta feature selection-Spearman correlation), variables including TPEI, FFT, AVMT, APDI and UR were determined as the selected independent feature subset for the following TCT (t + 1) forecasting task.
For one-step forecasting, i.e., TCT (t + 1), previous studies by Qiao et al. [22] also included the last historical value of CO2 emission, specifically TCT (t) as an independent variable to improve ML model performance, which, however, did not statistically consider the autocorrelation of time series data. Inspired by Autoregressive model, this study measured the autocorrelation of TCT (t + 1), and the result indicated a significant relation with the last 3 steps of historical TCTs, i.e., T C T ( t + 1 ) ~ T C T ( t ) + T C T ( t 1 ) + T C T ( t 2 ) as illustrated in Figure 8.
Based on the above analysis, the final feature subset was determined with a total number of 8, including TPEI, FFT, AVMT, APDI, UR, TCT (t), TCT (t − 1), and TCT (t − 2). In order to examine the performance of our proposed multistep feature selection framework, this study additionally employed another popular embedded feature selection method named 10-fold Cross Validated Recursive Feature Elimination with RF as the base model (RFE-RF) for comparison. The selected feature subset by RFE-RF was a total of 10 features, including FFT, TET, TPEP, TPEC, Population, UPR, GDP, TPI, MVLD, TCT (t), and TCT (t − 1). The original feature set without any feature selection was also employed as a baseline.

4.2. Modeling and Evaluation

The four popular ML methods, including Xgboost, SVR, MLP and ElasticNet, were employed for forecasting tasks. The dataset was split with a training-testing ratio of 80:20 without data shuffle. A grid search with 10-fold time series split was concluded to determine the best hyperparameter setting of each ML method based on different feature scenarios (the proposed, RFE-RF and original feature sets). The performance in terms of TCT (t + 1) forecasting for each ML model based on the best hyperparameter setting was listed in Table 5. Figure 9 visualized the performance regarding RMSE. It is observed that despite Xgboost showing the best training performance, the significant increase in test error suggests an overfitting issue of Xgboost in all scenarios, which means this ML method hard remembered all training information but failed to learn the general pattern from the dataset. MLP surprisingly indicated a bad prediction performance no matter the feature settings. It may perhaps be related to the difficulty in hyperparameter tuning and the small dataset. Further research is required to understand such an unsatisfying performance of MLP. SVR in general reflected the same overfitting problem as Xgboost. Despite a higher training error compared to Xgboost and SVR, the Elastic Net indicated the best testing performance based on the proposed FS framework, and the testing error was slightly higher than the training error, which means Elastic Net learned the general pattern of the training dataset and proved its generalization capability. Comparing different datasets, it is noticed that the performance of MLs based on the feature set generated by the proposed FS framework is comparable to the original and RFE-RF feature sets without significant compromise. However, considering the significant reduction in the number of independent variables, the requirement for data acquisition and computation power/time is modified from the original 24 features to now only five features. Additionally, the redundant features were also eliminated compared to RFE-RF which mitigated the multi-collinearity issue and improved the interpretation capability.
The optimal hyperparameter setting of Elastic Net based on 10-fold TSGSCV tuning and the proposed feature subset was: ‘alpha’: 0.1, ‘l1_ratio’: 0.9 and ‘tolerance’: 0.01. Detailed training and testing performance of Elastic Net based on different feature settings is presented in Figure 10. All feature set scenarios showed, in general, a promising training performance, with using the original feature set outperforming the other two feature settings. While the dramatic decrease in testing indicates the overfitting of the feature set. It, on the other hand, proved the philosophy of “garbage in, garbage out”. Learning information about redundant or irrelevant features would significantly deteriorate the capability of ML models in capturing the main pattern within the data. In comparison, RFE-RF determined a sub-optimal feature subset, which, despite using fewer variables, had overall performance on both the training and testing datasets that was acceptable without significant deviation from the actual value of TCT (t + 1). When modelling with a feature subset generated by the proposed feature selection framework, the best testing performance was achieved without overfitting or underfitting.

4.3. SHAP Analysis

Considering the outperformance of Elastic Net in TCT (t + 1) forecasting, this study only employed Elastic Net for explainable ML. The result of SHAP analysis is illustrated in Figure 11. The vertical color bar at the right side of the Figure 11a,b signifies the value of independent variables (e.g., red indicates a higher value, and blue indicates a lower value); the x-axis reflects the SHAP value or the contribution of an independent variable to the change of TCT (t + 1) compared to its average value. The ordering of variables from top to bottom indicates the feature contribution rank. It is observed that AVMT, FFT and APDI were the top three major contributors to TCT (t + 1), with AVMT and FFT contributions positively associated with TCT (t + 1), while a negative association existed between APDI contribution and TCT (t + 1). UR and TPEI contribute the least with, respectively, negative and positive associations to TCT (t + 1). It is reasonable to observe the importance of historical TCTs in predicting TCT (t + 1). However, it is interesting to find that TCT (t − 1) and TCT (t − 2) contributed negatively to TCT (t + 1), which is in contrast to the positive association between TCT (t) and TCT (t + 1). Figure 11c depicted the feature distribution of independent variables and their individual SHAP values. A linear relation between independent variables and their associated SHAP contribution to TCT (t + 1) was detected. Such a linear relationship may probably relate to the linear nature of the Elastic Net method, which also emphasized that the embedded model in SHAP analysis has a critical and fundamental impact on SHAP values.
In this study, we proposed a multistep feature selection framework that combined the hierarchical clustering method, Boruta feature selection, and spearman correlation analysis. The core idea of the framework is to remove the redundant/multicollinear and irrelevant variables that hinder a proper interpretation of feature importance in ML applications. Based on the proposed multistep feature selection framework, a total of 24 original variables were effectively reduced to 5 of the most representative variables. The Elastic Net with the selected five variables outperformed all other models in all feature set scenarios. The results proved the feasibility and practicality of the proposed framework.
It is noticed that, for small datasets, the simple linear Elastic Net outperformed the more advanced ML methods, including SVR, Xgboost and MLP. Perhaps due to that, advanced ML is more capable of learning information in training datasets, which, to a certain degree, compromises its generalization capabilities. This also emphasizes the theory of Occams’ Razor by Webb [57]. Relying on more advanced and complicated ML or DL methods, which may not be the optimal model for a small dataset like in this study with a dataset of 34 samples. Meanwhile, small datasets may also amplify the importance of individual independent variables. In other words, feature selection and ML prediction may be highly sensitive to the value of individual variables, especially some extreme values. The reason is that no matter the feature selection (measuring the distance between variables) or model prediction (gradient descent for SVR, Elastic Net, and MLP, and enthalpy for Xgboost), the core is variance measurement, which is very sensitive to the extreme value, especially when it comes to small datasets. A proper feature selection primarily determines the quality of feature importance interpretation.

5. Policy Recommendations and Limitations

5.1. Policy Recommendations

The results of this particular research presented an extensive framework for comprehending and predicting CO2 emissions throughout the transport industry, providing useful insights pertaining to market stakeholders, urban planners, and policymakers. By discovering important contributors, including AVMT, FFT, APDI, TPEI, and UR, this study highlighted the interconnected characters of micro- and macro-level variables in creating emissions trends.
At the macro stage, variables, for example, FFT and TPEI emphasized the crucial requirement for shifting from fossil fuel power systems to alternative and low-carbon power resources. Policymakers must prioritize opportunities in clean energy infrastructure, for example, green hydrogen creation as well as electrified transport systems, to decrease reliance on non-renewable sources. Economic parameters, such as the UR, additionally underline the significance of developing environmental plans with socioeconomic policy to make sure that techniques for emission decline are generally sustainable and equitable.
As a micro stage, variables such as AVMT and APDI to CO2 emissions signify an urgent request to address transport efficiency and need. Policies promoting multimodal transport networks, such as improved public transportation systems and shared mobility methods, as well as active transport (e.g., infrastructure for walking and biking), will considerably decrease individual automobile reliance. Motivating the ownership of low-emission and electric automobiles via financial assistance, tax rewards, as well as charging infrastructure development will additionally minimize emissions at the community and individual stages.
Furthermore, the incorporation of data-driven methods, such as the suggested multi-stage predicting system in policymaking, will allow live monitoring and dynamic modifications to strategies. The 33-year information utilized in this particular study offers a powerful base for analyzing the long-term influences of historical policy interventions, as well as predicting upcoming developments under different circumstances. Policymakers are able to leverage this kind of examination to model adaptive procedures, which take into account economic and technical development and demographic adjustments in the transport sector.
By concentrating on both systemic and immediate options, the structure offered in the following research provides policymakers with the methods required to create knowledgeable, data-driven judgements. These techniques will assist eco-friendly advancement through attaining substantial reductions in CO2 emissions, contributing to worldwide attempts to overcome climate change as well as changeover to a low-carbon future.
Besides its functional worth in predicting United States transport emissions, the suggested explainable ML structure and multi-stage FS are inherently adaptable, as well as can be used in other application domains, regions, and datasets. Its scientific incorporation of SHAP-based model interpretation, Boruta-Spearman evaluation, and hierarchical clustering guarantees powerful efficiency even in situations with limited data availability and huge feature multicollinearity. Consequently, the structure can function as a general application for environmental forecasting projects beyond the United States, allowing worldwide policymakers to extract valuable information from localized data.

5.2. Limitations

While this particular research delivers useful information on CO2 emissions forecasting in the transport industry, some restrictions need to be considered.
The study first incorporated an extensive dataset of 24 explanatory parameters and one dependent parameter, addressing a broad variety of transport, macroeconomic, environmental, and energy elements. Nevertheless, through FS techniques (Boruta-Spearman correlation and Spearman correlation analysis), only 5 essential parameters were kept for the forecasting system. Though this specific technique improves model performance and decreases multicollinearity, it might unintentionally exclude some other possibly substantial parameters. For example, components including BET, EUT, RGT, or AAT could have had an indirect, however substantial, impact on CO2 emissions and were not investigated completely. Although the chosen parameters (TPEI, FFT, AVMT, APDI, and UR) were demonstrated to be the most influential regarding prediction, this concentrates on essential predictors and restricts the capability to discover relationships among less notable parameters. This specific restriction limits the comprehensive knowledge of emissions dynamics, especially within complicated systems with various interconnected variables.
The research depends on historical information for a period of 33 years from 1990. Although this extensive dataset delivers a strong foundation for evaluation, it may not completely account for emerging trends, for example, quick progress in modifications in urbanization behavior or electric automobile ownership. These kinds of development could impact the precision and applicability of the suggested predicting system in upcoming scenarios.
Finally, the research only focuses on the transport industry throughout the United States. Because of this, the results might not be applicable to other nations or areas without some changes that consider differences in transport systems, energy resources, and socioeconomic circumstances. Upcoming investigations could discover cross-country evaluations or broaden the examination to consist of further areas, for example, business or residential energy utilization.
Despite these types of restrictions, the suggested framework displays the possibility of combining FS and ML methods to forecast CO2 emissions, paving the way for more enhanced and flexible models in upcoming research.

6. Conclusions and Future Research

In this research, we proposed a multi-stage framework for forecasting transportation CO2 emissions using FS and ML in the United States. A total of 25 variables related to transportation, socioeconomics, environment, and energy were originally selected. Hierarchical clustering, Boruta FS and Spearman were employed to select representative but not repeated variables. Eight important variables, including TPEI, FFT, AVMT, APDI, UR, TCT (t), TCT (t − 1), and TCT (t − 2), were selected as the independent feature subset for the following TCT (t + 1) forecasting task. For CO2 emissions forecasting task, 4 popular ML methods, including Xgboost, SVR, MLP and ElasticNet, were employed. The results suggested an overfitting issue of Xgboost and SVR. MLP implied tuning difficulty. Elastic Net in general indicated the best testing performance (RMSE = 45.53, MAE = 30.6, and MAPE = 0.016) based on the proposed FS framework, which means Elastic Net learned the general pattern of the training dataset and proved its generalization capability. Accordingly, it is noticed that the performance of MLs based on feature sets generated by the proposed FS framework is comparable to the original and RFE-RF feature sets without significant compromise. The results of the SHAP analysis showed that AVMT, FFT and APDI were the top three major contributors to TCT (t + 1), with AVMT and FFT contributions positively associated with TCT (t + 1), while a negative association existed between APDI contribution and TCT (t + 1). UR and TPEI contribute the least with, respectively, negative and positive associations to TCT (t + 1). It is interesting to find that TCT (t − 1) and TCT (t − 2) contributed negatively to TCT (t + 1), which is in contrast to the positive association between TCT (t) and TCT (t+1). Upcoming investigation could discover the components of extra factors, for example, alternative energy adoption rates, electric automobiles, traffic jam factors, and policy involvement results (e.g., fuel performance specifications, carbon taxes) to improve the comprehensiveness of the predicting model. Performing comparative research throughout various nations with different financial conditions and transport infrastructures could offer further information into the worldwide trends of transport emissions. Additionally, upcoming research could utilize more superior AI-based techniques, for example, deep learning and hybrid methods, to capture intricate temporal dependencies as well as non-linear associations among parameters.

Author Contributions

Conceptualization, M.A.S. and Q.Q.; methodology, M.A.S. and Q.Q.; software, M.A.S., Q.Q. and K.L.; validation, Q.Q. and K.L.; formal analysis, M.A.S., Q.Q. and K.L.; resources, M.A.S. and Q.Q.; supervision, M.A.S. and Q.Q.; data curation, M.A.S.; writing—original draft preparation, M.A.S. and Q.Q.; writing—review and editing, M.A.S. and Q.Q.; visualization, Q.Q. and K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Data Availability Statement

The data can be made available upon request to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. IEA. Energy Technology Perspectives. 2020. Available online: https://www.iea.org/reports/energy-technology-perspectives-2020 (accessed on 1 September 2020).
  2. Our World in Data. Cars, Planes, Trains: Where Do CO2 Emissions from Transport Come from? Available online: https://ourworldindata.org/co2-emissions-from-transport (accessed on 6 October 2020).
  3. EIA. U.S. Energy-Related Carbon Dioxide Emissions. 2023. Available online: https://www.eia.gov/environment/emissions/carbon/ (accessed on 29 May 2025).
  4. Yin, C.; Wu, J.; Sun, X.; Meng, Z.; Lee, C. Road transportation emission prediction and policy formulation: Machine learning model analysis. Transp. Res. Part D Transp. Environ. 2024, 135, 104390. [Google Scholar] [CrossRef]
  5. Ağbulut, Ü. Forecasting of transportation-related energy demand and CO2 emissions in Turkey with different machine learning algorithms. Sustain. Prod. Consum. 2022, 29, 141–157. [Google Scholar] [CrossRef]
  6. Janhuaton, T.; Ratanavaraha, V.; Jomnonkwao, S. Forecasting Thailand’s Transportation CO2 Emissions: A Comparison among Artificial Intelligent Models. Forecasting 2024, 6, 462–484. [Google Scholar] [CrossRef]
  7. Bamrungwong, N.; Vongmanee, V.; Rattanawong, W. The development of a CO2 emission coefficient for medium-and heavy-duty vehicles with different road slope conditions using multiple linear regression, and considering the health effects. Sustainability 2020, 12, 6994. [Google Scholar] [CrossRef]
  8. Huang, S.; Xiao, X.; Guo, H. A novel method for carbon emission forecasting based on EKC hypothesis and nonlinear multivariate grey model: Evidence from transportation sector. Environ. Sci. Pollut. Res. 2022, 29, 60687–60711. [Google Scholar] [CrossRef]
  9. Sangeetha, A.; Amudha, T. A novel bio-inspired framework for CO2 emission forecast in India. Procedia Comput. Sci. 2018, 125, 367–375. [Google Scholar] [CrossRef]
  10. Singh, P.K.; Pandey, A.K.; Ahuja, S.; Kiran, R. Multiple forecasting approach: A prediction of CO2 emission from the paddy crop in India. Environ. Sci. Pollut. Res. 2022, 29, 25461–25472. [Google Scholar] [CrossRef]
  11. Xu, B.; Lin, B. Factors affecting carbon dioxide (CO2) emissions in China’s transport sector: A dynamic nonparametric additive regression model. J. Clean. Prod. 2015, 101, 311–322. [Google Scholar] [CrossRef]
  12. Solaymani, S. CO2 emissions patterns in 7 top carbon emitter economies: The case of transport sector. Energy 2019, 168, 989–1001. [Google Scholar] [CrossRef]
  13. Saboori, B.; Sapri, M.; bin Baba, M. Economic growth, energy consumption and CO2 emissions in OECD (Organization for Economic Co-operation and Development)’s transport sector: A fully modified bi-directional relationship approach. Energy 2014, 66, 150–161. [Google Scholar] [CrossRef]
  14. Jabali, O.; Van Woensel, T.; De Kok, A. Analysis of travel times and CO2 emissions in time-dependent vehicle routing. Prod. Oper. Manag. 2012, 21, 1060–1074. [Google Scholar] [CrossRef]
  15. Zagow, M.; Elbany, M.; Darwish, A.M. Identifying urban, transportation, and socioeconomic characteristics across US zip codes affecting CO2 emissions: A decision tree analysis. Energy Built Environ. 2024, 6, 484–494. [Google Scholar] [CrossRef]
  16. Javanmard, M.E.; Tang, Y.; Wang, Z.; Tontiwachwuthikul, P. Forecast energy demand, CO2 emissions and energy resource impacts for the transportation sector. Appl. Energy 2023, 338, 120830. [Google Scholar] [CrossRef]
  17. Sun, W.; Wang, C.; Zhang, C. Factor analysis and forecasting of CO2 emissions in Hebei, using extreme learning machine based on particle swarm optimization. J. Clean. Prod. 2017, 162, 1095–1101. [Google Scholar] [CrossRef]
  18. Xu, G.; Schwarz, P.; Yang, H. Determining China’s CO2 emissions peak with a dynamic nonlinear artificial neural network approach and scenario analysis. Energy Policy 2019, 128, 752–762. [Google Scholar] [CrossRef]
  19. Chukwunonso, B.P.; Al-Wesabi, I.; Shixiang, L.; AlSharabi, K.; Al-Shamma’a, A.A.; Farh, H.M.H.; Saeed, F.; Kandil, T.; Al-Shaalan, A.M. Predicting carbon dioxide emissions in the United States of America using machine learning algorithms. Environ. Sci. Pollut. Res. 2024, 31, 33685–33707. [Google Scholar] [CrossRef]
  20. Ahmed, M.; Shuai, C.; Ahmed, M. Analysis of energy consumption and greenhouse gas emissions trend in China, India, the USA, and Russia. Int. J. Environ. Sci. Technol. 2023, 20, 2683–2698. [Google Scholar] [CrossRef]
  21. Mishra, S.; Sinha, A.; Sharif, A.; Suki, N.M. Dynamic linkages between tourism, transportation, growth and carbon emission in the USA: Evidence from partial and multiple wavelet coherence. Curr. Issues Tour. 2020, 23, 2733–2755. [Google Scholar] [CrossRef]
  22. Qiao, Q.; Eskandari, H.; Saadatmand, H.; Sahraei, M.A. An interpretable multi-stage forecasting framework for energy consumption and CO2 emissions for the transportation sector. Energy 2024, 286, 129499. [Google Scholar] [CrossRef]
  23. Fu, H.; Li, H.; Fu, A.; Wang, X.; Wang, Q. Transportation emissions monitoring and policy research: Integrating machine learning and satellite imaging. Transp. Res. Part D Transp. Environ. 2024, 136, 104421. [Google Scholar] [CrossRef]
  24. Ulussever, T.; Kılıç Depren, S.; Kartal, M.T.; Depren, Ö. Estimation performance comparison of machine learning approaches and time series econometric models: Evidence from the effect of sector-based energy consumption on CO2 emissions in the USA. Environ. Sci. Pollut. Res. 2023, 30, 52576–52592. [Google Scholar] [CrossRef]
  25. Li, X.; Ren, A.; Li, Q. Exploring patterns of transportation-related CO2 emissions using machine learning methods. Sustainability 2022, 14, 4588. [Google Scholar] [CrossRef]
  26. Wang, W.; Wang, J. Determinants investigation and peak prediction of CO2 emissions in China’s transport sector utilizing bio-inspired extreme learning machine. Environ. Sci. Pollut. Res. 2021, 28, 55535–55553. [Google Scholar] [CrossRef]
  27. Alfaseeh, L.; Tu, R.; Farooq, B.; Hatzopoulou, M. Greenhouse gas emission prediction on road network using deep sequence learning. Transp. Res. Part D Transp. Environ. 2020, 88, 102593. [Google Scholar] [CrossRef]
  28. Qin, J.; Gong, N. The estimation of the carbon dioxide emission and driving factors in China based on machine learning methods. Sustain. Prod. Consum. 2022, 33, 218–229. [Google Scholar] [CrossRef]
  29. Wang, L.; Xue, X.; Zhao, Z.; Wang, Y.; Zeng, Z. Finding the de-carbonization potentials in the transport sector: Application of scenario analysis with a hybrid prediction model. Environ. Sci. Pollut. Res. 2020, 27, 21762–21776. [Google Scholar] [CrossRef] [PubMed]
  30. Ma, J.; Ding, Y.; Cheng, J.C.; Jiang, F.; Tan, Y.; Gan, V.J.; Wan, Z. Identification of high impact factors of air quality on a national scale using big data and machine learning techniques. J. Clean. Prod. 2020, 244, 118955. [Google Scholar] [CrossRef]
  31. Li, Y.; Sun, Y. Modeling and predicting city-level CO2 emissions using open access data and machine learning. Environ. Sci. Pollut. Res. 2021, 28, 19260–19271. [Google Scholar] [CrossRef]
  32. Van Zyl, C.; Ye, X.; Naidoo, R. Harnessing eXplainable artificial intelligence for feature selection in time series energy forecasting: A comparative analysis of Grad-CAM and SHAP. Appl. Energy 2024, 353, 122079. [Google Scholar] [CrossRef]
  33. Amiri, S.S.; Mostafavi, N.; Lee, E.R.; Hoque, S. Machine learning approaches for predicting household transportation energy use. City Environ. Interact. 2020, 7, 100044. [Google Scholar] [CrossRef]
  34. Karasu, S.; Altan, A.; Bekiros, S.; Ahmad, W. A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series. Energy 2020, 212, 118750. [Google Scholar] [CrossRef]
  35. Tang, Z.; Wang, S.; Li, Y. Dynamic NOX emission concentration prediction based on the combined feature selection algorithm and deep neural network. Energy 2024, 292, 130608. [Google Scholar] [CrossRef]
  36. Peng, T.; Yang, X.; Xu, Z.; Liang, Y. Constructing an environmental friendly low-carbon-emission intelligent transportation system based on big data and machine learning methods. Sustainability 2020, 12, 8118. [Google Scholar] [CrossRef]
  37. Chadha, A.S.; Shinde, Y.; Sharma, N.; De, P.K. Predicting CO2 emissions by vehicles using machine learning. In Proceedings of the International Conference on Data Management, Analytics & Innovation, Virtual Conference, 14–16 January 2022; pp. 197–207. [Google Scholar]
  38. Anonna, F.R.; Mohaimin, M.R.; Ahmed, A.; Nayeem, M.B.; Akter, R.; Alam, S.; Nasiruddin, M.; Hossain, M.S. Machine Learning-Based Prediction of US CO2 Emissions: Developing Models for Forecasting and Sustainable Policy Formulation. J. Environ. Agric. Stud. 2023, 4, 85–99. [Google Scholar]
  39. Jha, R.; Jha, R.; Islam, M. Forecasting US data center CO2 emissions using AI models: Emissions reduction strategies and policy recommendations. Front. Sustain. 2025, 5, 1507030. [Google Scholar] [CrossRef]
  40. Li, S.; Tong, Z.; Haroon, M. Estimation of transport CO2 emissions using machine learning algorithm. Transp. Res. Part D Transp. Environ. 2024, 133, 104276. [Google Scholar] [CrossRef]
  41. Tian, L.; Zhang, Z.; He, Z.; Yuan, C.; Xie, Y.; Zhang, K.; Jing, R. Predicting Energy-Based CO2 Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change. Sustainability 2025, 17, 2843. [Google Scholar] [CrossRef]
  42. Ajala, A.A.; Adeoye, O.L.; Salami, O.M.; Jimoh, A.Y. An examination of daily CO2 emissions prediction through a comparative analysis of machine learning, deep learning, and statistical models. Environ. Sci. Pollut. Res. 2025, 32, 2510–2535. [Google Scholar] [CrossRef]
  43. Monath, N.; Dubey, K.A.; Guruganesh, G.; Zaheer, M.; Ahmed, A.; McCallum, A.; Mergen, G.; Najork, M.; Terzihan, M.; Tjanaka, B. Scalable hierarchical agglomerative clustering. In Proceedings of the Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1245–1255. [Google Scholar]
  44. Tang, R.; Zhang, X. CART decision tree combined with Boruta feature selection for medical data classification. In Proceedings of the 2020 5th IEEE International Conference on Big Data Analytics (ICBDA), Xiamen, China, 8–11 May 2020; pp. 80–84. [Google Scholar]
  45. Hu, L.; Wang, C.; Ye, Z.; Wang, S. Estimating gaseous pollutants from bus emissions: A hybrid model based on GRU and XGBoost. Sci. Total Environ. 2021, 783, 146870. [Google Scholar] [CrossRef]
  46. Adegboye, O.R.; Ülker, E.D.; Feda, A.K.; Agyekum, E.B.; Mbasso, W.F.; Kamel, S. Enhanced multi-layer perceptron for CO2 emission prediction with worst moth disrupted moth fly optimization (WMFO). Heliyon 2024, 10, e31850. [Google Scholar] [CrossRef]
  47. Afzal, S.; Ziapour, B.M.; Shokri, A.; Shakibi, H.; Sobhani, B. Building energy consumption prediction using multilayer perceptron neural network-assisted models; comparison of different optimization algorithms. Energy 2023, 282, 128446. [Google Scholar] [CrossRef]
  48. Vapnik, V.N. The support vector method. In Proceedings of the International Conference on Artificial Neural Networks, Lausanne, Switzerland, 12 June 1997; pp. 261–271. [Google Scholar]
  49. Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
  50. Cao, C.; Liao, J.; Hou, Z.; Wang, G.; Feng, W.; Fang, Y. Parametric uncertainty analysis for CO2 sequestration based on distance correlation and support vector regression. J. Nat. Gas Sci. Eng. 2020, 77, 103237. [Google Scholar] [CrossRef]
  51. Otten, N.V. What is Elastic Net Regression? Available online: https://spotintelligence.com/2024/11/20/elastic-net/ (accessed on 20 November 2024).
  52. Štrumbelj, E.; Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst. 2014, 41, 647–665. [Google Scholar] [CrossRef]
  53. Datta, A.; Sen, S.; Zick, Y. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 23–25 May 2016; pp. 598–617. [Google Scholar]
  54. Wen, X.; Xie, Y.; Wu, L.; Jiang, L. Quantifying and comparing the effects of key risk factors on various types of roadway segment crashes with LightGBM and SHAP. Accid. Anal. Prev. 2021, 159, 106261. [Google Scholar] [CrossRef]
  55. Li, Z. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
  56. Shapley, L.S. A value for n-person games. Contrib. Theory Games 1953, 2, 307–317. [Google Scholar]
  57. Webb, G.I. Occam’s Razor. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; p. 735. [Google Scholar]
Figure 1. USA energy related CO2 emission by sector (1990–2023).
Figure 1. USA energy related CO2 emission by sector (1990–2023).
Energies 18 04184 g001
Figure 2. The conceptual framework for forecasting transportation CO2 emissions using multi-step FS-based explainable ML.
Figure 2. The conceptual framework for forecasting transportation CO2 emissions using multi-step FS-based explainable ML.
Energies 18 04184 g002
Figure 3. CO2 emissions for the transportation sector in the USA from 1990 to 2023.
Figure 3. CO2 emissions for the transportation sector in the USA from 1990 to 2023.
Energies 18 04184 g003
Figure 4. Multilayer perceptron architecture [46].
Figure 4. Multilayer perceptron architecture [46].
Energies 18 04184 g004
Figure 5. The annual value of independent/dependent variables between 1990 and 2023.
Figure 5. The annual value of independent/dependent variables between 1990 and 2023.
Energies 18 04184 g005
Figure 6. Summary of hierarchical feature clustering. (a) Hierarchical feature clustering results based on Ward’s distance and (b) the optimal number of clusters.
Figure 6. Summary of hierarchical feature clustering. (a) Hierarchical feature clustering results based on Ward’s distance and (b) the optimal number of clusters.
Energies 18 04184 g006
Figure 7. Hierarchical feature clustering based on a cluster number of 8.
Figure 7. Hierarchical feature clustering based on a cluster number of 8.
Energies 18 04184 g007
Figure 8. Autocorrelation plot of TCT.
Figure 8. Autocorrelation plot of TCT.
Energies 18 04184 g008
Figure 9. RMSE of ML methods on TCT prediction based on different feature sets.
Figure 9. RMSE of ML methods on TCT prediction based on different feature sets.
Energies 18 04184 g009
Figure 10. The training and testing performance of Elastic Net based on different feature sets.
Figure 10. The training and testing performance of Elastic Net based on different feature sets.
Energies 18 04184 g010
Figure 11. The SHAP value of independent variables to TCT (t + 1) with (a) bee swarm, (b) heatmap, and (c) scatter plots.
Figure 11. The SHAP value of independent variables to TCT (t + 1) with (a) bee swarm, (b) heatmap, and (c) scatter plots.
Energies 18 04184 g011
Table 2. List of variables, abbreviations, and descriptive statistics.
Table 2. List of variables, abbreviations, and descriptive statistics.
VariablesAbbr.UnitMeanStdMinMax
Total CO2 by TransportationTCTMtCO21814.09125.511565.002026.00
Total Fossil Fuels Consumed (transportation field)FFTTrillion Btu25,439.211691.2721,994.7628,142.77
Biomass Energy Consumed (transportation field)BETTrillion Btu722.10612.0860.421788.41
Electricity Sales to Ultimate Customers (transportation field)EUTTrillion Btu22.114.1816.0627.89
Total Energy Consumed (Transportation field)TETTrillion Btu26,224.571982.7122,114.3228,810.91
Total Primary Energy ProductionTPEPQuadrillion Btu76.3110.7866.20102.85
Total Primary Energy ConsumptionTPECQuadrillion Btu92.944.4782.2198.97
Population------297,726,251.5326,600,955.72249,623,000.00343,477,335.00
Urban Population rateUPR%79.902.2475.3083.30
GDP Per CapitaGDPUSD46,036.5515,435.3023,888.6081,632.25
Total Unemployment rateUR%5.771.633.659.63
Retail Gasoline PricesGPUSD per Gallon2.170.941.033.95
Total Petroleum, Excluding Biofuels, ImportsTPIQuadrillion Btu22.183.8416.3529.20
Total Primary Energy ImportsTPEIQuadrillion Btu26.024.8118.3334.68
Air Passengers-Domestic and internationalAPDIMillion668,920,157.79138,802,316.46369,501,000.00926,737,000.00
Air transport, freightAPFmillion ton-km32,826.189782.8014,486.2047,716.00
Railways PassengersRPCmillion passenger-km28,411.975851.3512,460.0036,393.11
Railways, goods transportedRGTmillion ton-km2,330,457.76227,003.111,906,206.02,702,736.0
Number of motor vehicles registeredMVR1000 s242,272.7128,968.71192,314.00286,300.00
Annual Vehicle Miles TraveledAVMTMiles2,811,389.09342,132.552,117,716.003,284,596.00
Road MilageRMMile4,032,513.13111,366.493,866,926.004,200,000.00
Motor vehicle licensed driversMVLDNumber202,325,783.0520,679,848.46167,015,250.00236,404,000.00
Average annual temperatureAAT°F53.971.6450.3056.60
Crude Oil First Purchase PriceCOPPDollars per Barre46.7029.2110.8795.99
Annual Percent ChangeAPC%0.030.020.000.08
Abbr.: Abbreviation.
Table 3. Summary of the hyperparameter matrix space of ML models.
Table 3. Summary of the hyperparameter matrix space of ML models.
ModelHyperparameter
Xgboost
  • n_estimators: [10, 30, 50, 100, 200]
  • learning_rate: [0.001, 0.005, 0.01, 0.05, 0.1]
  • max_depth: [1, 2, 3, 4, 5]
MLP
  • hidden_layer_size: [hidden_layer_size: [5, 10, 50, 80, 100]
  • activiation: [‘relu’, ‘tanh’, ‘logistic’]
  • alpha: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
  • learning_rate: [‘aptive’, ‘invscaling’, ‘constant’]
SVR
  • C: [1, 10, 12, 14, 16, 18, 20, 22]
  • gamma: [0.001, 0.01, 0.1, 1, 2, 5]
  • epsilon: [0.001, 0.01,0.1, 1, 2, 4]
  • kernel: [‘rbf’, ‘poly’, ‘sigmoid’]
Elastic Net
  • alpha: [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
  • l1_ratio: [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
  • tolerance: [0.0001, 0.001, 0.01]
Table 4. Summary of feature selection.
Table 4. Summary of feature selection.
VariableCorrelationp-ValueBoruta RankingClusterDecision
TPI0.684<0.00121
TPEI0.711<0.00111
FFT0.881<0.00112
TPEC0.789<0.00112
BET0.44<0.0513
TPEP0.0750.6813
Population0.449<0.0113
UPR0.449<0.0113
GDP0.45<0.0113
APF0.517<0.0113
MVR0.486<0.0113
AVMT0.572<0.00113
RM0.418<0.0523
MVLD0.448<0.0513
AAT0.2450.1874
RPC−0.0560.7695
RGT0.1940.29105
EUT0.459<0.0536
GP0.2950.146
APDI0.503<0.0116
COPP0.2720.1356
UR−0.483<0.0167
APC−0.180.3288
Note: ● implies inclusion and ○ signifies exclusion.
Table 5. The performance of ML methods on TCT prediction based on different feature sets.
Table 5. The performance of ML methods on TCT prediction based on different feature sets.
MLMetricsThe Proposed FSOriginalRFE-RF
Training TestingTrainingTestingTrainingTesting
XgboostStD0.04558.8230.00339.5620.00239.334
RMSE0.04659.2800.00347.8490.00258.574
MAE0.04645.2270.00244.5400.00252.498
MAPE1.99 × 10−50.02431.32 × 10−60.0231.22 × 10−60.027
MLPStD135.22037.638169.13037.408143.54038.363
RMSE1796.8481861.0181760.3531848.111788.911860.122
MAE1792.0461860.6371752.731847.7381783.0301859.74
MAPE0.9870.9930.9640.9860.9820.993
SVRStD8.88094.99816.89063.66920.707170.830
RMSE8.99296.99416.97889.62926.67689.045
MAE5.43379.1238.34663.08417.32564.5
MAPE0.0020.0420.0040.0340.0090.035
Elastic NetStD28.97745.25713.83741.65235.40448.434
RMSE28.97745.53313.837136.46436.87355.999
MAE22.46230.69.821129.95229.39354.405
MAPE0.0120.0160.0050.0690.0150.029
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sahraei, M.A.; Li, K.; Qiao, Q. A Multi-Stage Feature Selection and Explainable Machine Learning Framework for Forecasting Transportation CO2 Emissions. Energies 2025, 18, 4184. https://doi.org/10.3390/en18154184

AMA Style

Sahraei MA, Li K, Qiao Q. A Multi-Stage Feature Selection and Explainable Machine Learning Framework for Forecasting Transportation CO2 Emissions. Energies. 2025; 18(15):4184. https://doi.org/10.3390/en18154184

Chicago/Turabian Style

Sahraei, Mohammad Ali, Keren Li, and Qingyao Qiao. 2025. "A Multi-Stage Feature Selection and Explainable Machine Learning Framework for Forecasting Transportation CO2 Emissions" Energies 18, no. 15: 4184. https://doi.org/10.3390/en18154184

APA Style

Sahraei, M. A., Li, K., & Qiao, Q. (2025). A Multi-Stage Feature Selection and Explainable Machine Learning Framework for Forecasting Transportation CO2 Emissions. Energies, 18(15), 4184. https://doi.org/10.3390/en18154184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop