Exploring Patterns of Transportation-Related CO2 Emissions Using Machine Learning Methods

Xiaodong Li; Ai Ren; Qi Li

doi:10.3390/su14084588

,

and

¹

School of Economics and Management, Anhui Polytechnic University, Wuhu 241000, China

²

School of Business, State University of New York at New Paltz, New Paltz, NY 12561, USA

^*

Author to whom correspondence should be addressed.

Sustainability2022, 14(8), 4588;https://doi.org/10.3390/su14084588

This article belongs to the Special Issue Low Carbon Economy, Green Innovation, Renewable Energy and Sustainable Development

Version Notes

Order Reprints

Abstract

While the transportation sector is one of largest economic growth drivers for many countries, the adverse impacts of transportation on air quality are also well-noted, especially in developing countries. Carbon dioxide (CO₂) emissions are one of the direct results of a transportation sector powered by burning fossil-based fuels. Detailed knowledge of CO₂ emissions produced by the transportation sectors in various countries is essential for these countries to revise their future energy investments and policies. In this framework, three machine learning algorithms, ordinary least squares regression (OLS), support vector machine (SVM), and gradient boosting regression (GBR), are used to forecast transportation-based CO₂ emissions. Both socioeconomic factors and transportation factors are also included as features in the study. We study the top 30 CO₂ emissions-producing countries, including the Tier 1 group (the top five countries, accounting for 61% of global CO₂ emissions production) and the Tier 2 group (the next 25 countries, accounting for 35% of total CO₂ emissions production). We evaluate our model using four-fold cross-validation and report four frequently used statistical metrics (

R^{2}

, MAE, rRMSE, and MAPE). Of the three machine learning algorithms, the GBR model with features combining socioeconomic and transportation factors (GBR_ALL) has the best performance, with an

R^{2}

value of 0.9943, rRMSE of 0.1165, and MAPE of 0.1408. We also find that both transportation features and socioeconomic features are important for transportation-based CO₂ emission prediction. Transportation features are more important in modeling for 30 countries, while socioeconomic features (especially GDP and population) are more important when modeling for Tier 1 and Tier 2 countries.

Keywords:

carbon dioxide emission prediction; transportation sector; socioeconomic factors

1. Introduction

Global climate change has been recognized as the biggest threat to all living beings in the sea, on land, and in the atmosphere [1]. Unprecedented challenges, such as extreme weather, loss of species, shifting rainfall patterns, glaciers melting, and rising global mean sea level have affected the survival and growth of humanity globally [2]. According to the IPCC Fourth Assessment Report 2014, since 1750, the concentration of carbon dioxide (CO₂) in the atmosphere has increased by 40%, while the same measure was 31% in 2001 [2,3]. As of 2010, the transportation sector accounted for 14% of global greenhouse gas emissions [3].

The transportation sector plays an essential role in humanity’s activities and affects the global economy. For example, the transportation of people and goods provides people with mobility, sustainable daily lives, local and international merchandise trade, and economic development [4,5,6,7]. However, most activities in the transportation sector are fueled by fossil-based energy sources, which are not renewable [8]. This implies that, while contributing to the global economy, the transportation sector has negative impacts on global climate change.

Considering its scale and the growing speed in energy consumption, the transportation sector has become the second largest CO₂ emitter in the world [9]. The transportation of people and goods accounts for about 25% of total world energy consumption [10], and about 25% of greenhouse emissions in the European Union (EU) [11]. From 1990 to 2015, the share of CO₂ emissions from the transportation sector in EU countries increased from 32% to 45% [12]. Given the continued growth in fossil-based energy usage and transportation-based CO₂ emissions, building a sustainable transportation sector and reducing its CO₂ emissions have become critical, especially for the 196 parties that adopted the Paris Agreement on December 12, 2015 [13]. Therefore, in this study, we focus on automatic prediction methods for transportation sector-related CO₂ emissions, and their factor analysis.

The contribution of this paper is threefold: (1) we study transportation-based CO₂ emission at the global level using the data from 30 countries; (2) we study the feature sets that not only include the transportation-related features (air, railroad, and highway vehicles), but also social economical features (population, GDP, and GDPs from different sectors); (3) we find that CO₂ emission patterns are different in different countries: all transportation-related features are important in the models for all countries, while socioeconomic features are more important in the top five CO₂ emission countries (Tier 1) and the next 25 CO₂ emission countries (Tier 2).

Literature Review

Existing studies have examined the impact of transportation activities on economic development. The causal relationship between logistics development and economic growth in both the short and long term was studied using a dynamic structural model [14,15,16]. The causal relationship between transportation and income was investigated using a panel dataset [17] that included the data of 15 EU countries from 1970 to 2008, and the authors found an endogenous relationship between income and transportation. The impact of roadway and railway infrastructure on India’s economic growth was studied using the vector error correction model [18] and weak short-term effects were found. By examining data from India from 1970 to 2010, these authors found unidirectional causality from railway transportation to economic growth. Another study investigating data from Turkey from 1970 to 2005 showed the impact of highway infrastructure on Turkey’s economic growth [19]. Similarly, the causal relationships between transportation infrastructure investment and economic growth were also studied in China using time series data from 1978 to 2008 [20]. However, unlike previous findings from other countries, these authors found unidirectional Granger causality from economic growth to transportation sector infrastructure development at the national level. By grouping the 107 countries in the dataset into high-income, middle-income, and low-income countries, Liddle and Lung found Granger causality runs from GDP per capita to transportation energy consumption per capita by analyzing International Energy Agency data from 1971 to 2009 using panel methods [21]. These authors also found sufficient evidence that many countries exhibited significant Granger causality running from transportation sector energy consumption to GDP. Although these results were not exactly consistent [22,23], the existing literature suggests casual relationships between transportation sector infrastructure development and economic growth, and we included selected economic features in our models. Therefore, we considered using socioeconomic features (including GDP, income-level, and GDP from different sectors) in our prediction model.

Another stream of existing literature has studied the connections between transportation sector activities and the related CO₂ emissions. Lakshmanan and Han suggested that the growth in people’s propensity to travel drove up U.S. transportation energy use and related CO₂ emissions from 1970 to 1991. Using a decomposition scheme analysis, the authors also revealed that freight transportation played a more important role than passenger transportation in U.S. transportation energy use and CO₂ emissions [24]. Similarly, Scholl et al. used a comparative analysis approach and studied the changes in CO₂ emissions from passenger transportation activities in nine OECD countries [25]. By analyzing the data from 1973 to 1992, the authors observed a sharp increase in travel-related energy use and CO₂ emissions from travel-related activities and discussed the impact of fuel shifts within the transportation sector on the increase of CO₂ emissions. In a study conducted by Lu et al., highway vehicle activity was identified as the major driving factor that increased transportation CO₂ emissions from 1990 to 2002 in Germany, Taiwan, South Korea, and Japan [26]. Similar studies of transportation sector activity in selected Asian countries or regions suggested that travel-related activity was one of the major potential factors increasing CO₂ emissions [27,28,29,30,31]. These studies all suggested that the transportation sector has a direct impact on CO₂ emissions and listed it as the key explanatory variable for CO₂ emissions at the national level. In our study, we included transportation related features (air, railroad, and vehicle transportation) in our prediction models.

To forecast CO₂ emissions, existing studies have adopted different approaches. Some have used time series analysis methods such as exponential smoothing models and ARIMA [32,33,34]. Similar studies used grey models to predict CO₂ emissions in China, Iran, and Turkey [35,36]. Many other studies used time series models to predict CO₂ emissions in China, the U.S., Malaysia, Iran, and Zimbabwe [37,38]. Some studies used neural network methods for CO₂ emission prediction [39]. The gradient boosting decision tree (GBDT) algorithm was also used in predicting CO₂ from envelope renovation projects in Taiwan [40]. The support vector machine model was also used in CO₂ emission prediction in the Chengdu area [41]. All these studies used a dataset from one single nation and did not employ cross-validation using another nation’s dataset to evaluate the model. To fill this gap, this study aims to predict transportation-related CO₂ emissions using socioeconomic features and transportation sector features. We deploy the support vector machine (SVM) model and the gradient boosting regression (GBR) model to compare to the baseline model, the ordinary least squares (OLS) model, in order to find the best model.

The rest of this paper is structured as follows: Section 2 provides the details of the data collection and sources used in our study and summarizes the features used in our model, the descriptive statistics of the dataset, the machine learning (ML) algorithms, and the evaluation methods; Section 3 presents and discusses the results obtained by our automatic models, including OLS, SVM, and GBR, with three types of features (transportation-related features, or TRAN; socioeconomic features, or SoEco; and a combination of both TRAN and SoEco features, or ALL) for different levels of CO₂ in emissions-producing countries; Section 4 provides further discussion; and Section 5 concludes the study.

2. Materials and Methods

This section provides detailed information about the dataset and the data sources, ML algorithms, and the model evaluation method.

2.1. Dataset

The World Development Indicators (WDI) constitute the World Bank’s primary collection of development indicators, compiled from officially recognized international sources. In this study, we used the WDI 2020 release obtained from Kaggle [42]. The WDI dataset provides access to approximately 1437 indicators for 263 countries or regions, from 1960 to 2020. The database helps users find information related to development, both current and historical. The topics covered in the WDI include poverty and inequality (poverty, prosperity, consumption, income distribution), people (population, education, labor, health, gender), the environment (agriculture, climate change, energy, biodiversity, water, sanitation), the economy (growth, economic structure, income and savings, trade, labor productivity), states and markets (business, stock markets, military, communications, transport, technology), and global links (debt, trade, aid dependency, refugees, tourism, migration). We adopted the CO₂ emissions data, most of the transportation-related factors, and socioeconomic factors from the WDI dataset.

The International Organization of Motor Vehicle Manufacturers (OICA) provides a dataset of motor vehicle production statistics obtained from national trade organizations and OICA members or correspondents, including production statistics (1999–2021), sales statistics (2019–2020), and vehicles in use (2005–2015). We adopted the vehicle-in-use data from the OICA dataset.

The United Nations’ (UN’s) Department of Economic and Social Affairs conducts a yearly economic analysis. In its 2014 World Economic Situation and Prospects (WESP) report [43], countries are classified according to their degree of development, as either high income, upper middle income, lower middle income, or low income. The report also categorized countries as either fuel-exporting or not. We adopted these features from this UN dataset.

The WDI dataset does not directly provide transportation-related CO₂ emissions. Instead, it provides the total CO₂ emissions and the percentage of CO₂ emissions from different sectors. Therefore, we calculated transportation-related CO₂ emissions by multiplying the total CO₂ emissions (in kt units) by the percentage of CO₂ emissions from transportation.

Based on an overview of the existing literature, we chose two types of features, socioeconomic and transportation factors, as the independent variables.

2.1.1. Socioeconomic Features

With respect to economic features, we selected GDP, which shows the strong relationship between the economy and transportation [16,17,18,19,20,21,22]. We also used the value added from different sectors in order to represent the level of socioeconomic features. Therefore, we used unemployment, value added from agriculture, value added from industry, and value added from services. We also included features indicating whether countries are developing or developed, their income level, and whether they are fuel-exporting.

2.1.2. Transportation Features

According to global transportation emissions statistics sourced from the International Energy Agency (IEA) in 2018, individual road travel accounts for nearly three-quarters of transportation-related emissions, and the other 29.4% comes from trucks carrying freight [44]. We chose three types of transportation features: airport transport, railway transportation, and vehicle transportation (including passengers and freight transportation).

We compared three feature combinations: transportation features only (TRAN), socioeconomic features only (SoEco), and transportation features plus socioeconomic features (ALL). Our goal was to evaluate whether inclusion of the socioeconomic features helps to improve the prediction of CO₂ emissions from transportation. Table 1 shows the details (descriptions, indicator abbreviations, and sources) of the features we used in the model.

Table 1. Summary of Features.

2.1.3. CO₂ Emissions by Sector, Year, and Country

Figure 1 describes the overall yearly increase of CO₂ emissions in our dataset, including five different sectors: electricity and heat production (ETOT), manufacturing industries and construction (MANF), transportation (TRAN), residential buildings and commercial and public services (BLDG), and other sectors (OTHX) including agriculture. As shown in Figure 1, since 1971, the CO₂ emissions from ETOT increased fivefold and have been the biggest part of all CO₂ emissions ever since. The two sectors with the second highest CO₂ emissions are MANF and TRANS; the lines on the graph cross several times, and emissions in these sectors have tripled in the past 30 years. The two remaining sectors are BLDG and OTHX, which were stable in this 30-year range. In this paper, we focused on CO₂ emissions from the TRAN sector.

Figure 1. Yearly CO₂ emissions by sector. Transportation-based CO₂ emissions have tripled in the past 30 years.

Figure 2 shows total CO₂ emissions (in kt) and transportation-based CO₂ emissions by countries, from 2005 to 2014. According to the percentage of total global CO₂ emissions, we divided the countries into two tiers: those that contribute more than 4% of the global total (blue, Tier 1) and those that contribute less than 4% (grey, Tier 2). We found that the trends of CO₂ emissions in total are very different from CO₂ emissions from transportation. For example, although India is in Tier 1 for total CO₂ emissions, it is in Tier 2 for TRAN CO₂ emissions, which means that India has a relatively lower TRAN portion of CO₂ emissions than other countries.

Figure 2. Total CO₂ emissions vs. CO₂ emissions from TRAN by country. CO₂ emissions (in kt) are summarized from 2005–2014 data. Percentages represent each country’s contribution to global total CO₂ emissions. (a) Total CO₂ emissions; (b) CO₂ emissions from TRAN.

In this paper, we defined our country tiers according to their contribution to the total CO₂ emissions because we proposed to study the biggest contributors to global CO₂ emissions. Table 2 summarizes the countries and their associated CO₂ emission tiers. The top 30 countries emitted 96% of global CO₂ in 2004–2015, so we focused our study on these 30 countries.

Table 2. Summary of transportation-based CO₂ emissions from the top 30 CO₂ emissions-producing countries.

Within these top 30 CO₂ emissions-producing countries, we further defined Tier 1 countries, which are the top five countries and contribute 61% of global CO₂ emissions, and Tier 2 countries, which are the other 25 countries and contribute 35% of global CO₂ emissions. As seen in Table 2, which compares total CO₂ emissions, some countries have higher proportions of CO₂ emissions from transportation, or even higher proportions per capita than total CO₂ emissions. For example, the U.S., Saudi Arabia, and the United Arab Emirates produce, on average, over 4 kt CO₂ emissions per capita yearly, while China, India, and Pakistan average less than 0.5 kt per capita yearly. We assumed that the number of vehicles in use is an important factor for the first category and that the country’s population is an important factor for the second category. Therefore, we introduced the population and vehicles-in-use as factors in building our model for transportation-based CO₂ emissions prediction.

The descriptive statistics of the dataset used in our study are provided in detail in Table 3. This dataset contains 300 instances (30 countries and 10 years). On average, countries emitted 173,770 kt of CO₂ yearly from 2005–2014, the average yearly GDP per country was

1.72 \times 10^{12}

, and the average country population was

1.60 \times 10^{8}

.

Table 3. Descriptive statistics of the dataset.

However, we found many values missing from Table 3. We analyzed the missing values and detected two scenarios. First, for certain countries, the feature values may be missing from the data. For example, the United Arab Emirates has a very limited the railway network, so the IS.RRS.TOTL.KM feature is 0 for the UAE. Second, some feature values are only missing in certain years. In that scenario, we replaced the missing year information with the following year’s value for that country.

To avoid biased estimations due to the magnitude of scale differences between features, we normalized the numeric features by converting each feature as follows,

x_{i}^{ʹ} = \frac{x_{i} - x_{m i n}}{x_{m a x} - x_{m i n}}

where

x_{i}

refers to the actual value of a feature and

x_{m i n}

is the minimum value among all

x_{i}

values of the feature in the dataset. Similarly,

x_{m a x}

represents the maximum

x_{i}

value for this feature in the dataset. Accordingly,

x_{i}^{ʹ}

shows the normalized

x_{i}

value, and its value range is between 0 and 1. Instead of using Z-score normalization, we preferred to have feature values between 0 and 1.

With these normalized feature values, we calculated the correlation between the features and our target variable, transportation-based CO₂ emissions (i.e., EN.CO2E.TRAN.KT), as shown as the heatmap of Pearson correlation in Figure 3. We found that IS.AIR.DPRT, ISAIR.GOOD.MT.K1, NY.GDP.MKTP.CD(10^11), IS.AIR.PSGR, IS.RRS.TOTAL.KM, IS.VHL.COM, and IS.VHL.PSGR have very strong relationships with EN.CO2E.TRAN.KT. NV.IND.EMPL.KD, NV.AGR.EMPL.KD, NV.SRV.EMPL.KD, and SP.POP.TOL have weak relationships with EN.CO2E.TRAN.KT. Year and SL.UEM.TOTAL.ZS have very weak relationships with EN.CO2E.TRAN.KT. Accordingly, the correlation coefficient can be interpreted as follows, based on existing literature [45,46].

Figure 3. Pearson correlation matrix between the features and the target variable (transportation-based CO₂ emissions).

|r| ≥ 0.8, very strong relationship;

0.6 ≤|r| < 0.8, strong relationship;

0.4 ≤|r| < 0.6, moderate relationship;

0.2 ≤|r| < 0.4, weak relationship;

|r| < 0.2, very weak relationship.

Accordingly, those inputs with a correlation coefficient over 0.2 were used to train the ML algorithms for predicting transportation-based CO₂ emissions.

We also examined the correlations between categorical values (countries, country tiers, development levels, fuel-exporting characteristics, and income levels) and transportation-based CO₂ emissions, as shown in Table 4. We found that most countries have a weak correlation with transportation-based CO₂ emissions, except for the U.S. and China.

Table 4. Pearson correlation coefficient between the categorical variables and transportation-based CO₂ emissions.

2.2. Method

The data analysis showed that CO₂ emissions are continuous values, so we used regression models to predict continuous values. We used three ML methods, OLS (the baseline system), SVM, and GBR, to build automated ML models for prediction.

2.2.1. Ordinary Least Squares

OLS is a type of linear regression model of a set of explanatory variables using the principle of least squares of the difference between the observed dependent variables and predicted values.

Given a set of data points,

G = {\{(X_{j}, y_{j})\}}_{j}^{n}

, where

X_{j}

is the input vector of a data point

j

with

m

features,

y_{j}

is the desired value (CO₂ emissions from the transportation sector), and

n

is the dataset. In a linear regression model,

y_{j}

is a linear combination of features

X_{j}

, as shown in Equation (1),

y_{j} = α_{j} + \sum_{i = 1}^{m} β_{i} x_{ij} + ε_{j}

(1)

where

ε_{j}

is the error term and α, β are the true parameters of the regression. The goal of linear regression is to find those parameters α and β for which the error term is minimized. That is,

{minimize}_{α, β_{i}} \sum_{j = 1}^{n} {(y_{j} - α_{j} - \sum_{i = 1}^{m} β_{i} x_{ij})}^{2}

(2)

Or

m i n i m i z e_{α, β_{i}} \sum_{j = 1}^{n} {(ε_{j})}^{2}

(3)

This procedure is called ordinary least squared error, or OLS.

The OLS is the most popular approach for regression with feature coefficients [47].

2.2.2. Support Vector Machine

The support vector machine model, or SVM, is another supervised learning method that can be applied to regression and classification problems. After it was introduced by AT&T Bell Laboratories in 1992 [48] for a binary classification problem, it drew growing interest from researchers and has been extended into various related problems, including regression [49], high-dimension classification problems [50], clustering problems [51], multiclass problems [52], and Bayesian data argumentation [53]. It was also applied to CO₂ emission in the Chengdu area [41].

The goal of SVM is to find a function

f (X)

that has at most ε deviation from the actually obtained targets

y_{j}

for all the training data, while also remaining as flat as possible. In other words, we do not care about errors, as long as they are less than ε, but we will not accept any deviation larger than ε. SVM regression approximates the function,

f (X)

, using the following form

f (X_{i}) = \sum_{i = 0}^{m} ω_{i} x_{i}

(4)

where

ω_{i}

represents the weight for the feature

x_{i}

, and

ω_{0}

is the bias with

x_{0} = 1

. Flatness in this case means that one seeks a small

ω

. One way to ensure this is to minimize the norm, i.e.,

ω^{2} = ω \cdot ω

, which is the dot product. Therefore, we can write the problem as a convex optimization problem:

m i n i m i z e \frac{1}{2} ω^{2} s u b j e c t t o \{\begin{matrix} y_{j} - ω \cdot x_{j} \leq ε \\ ω \cdot x_{j} - y_{j} \leq ε \end{matrix}

(5)

The assumption in Equation (5) is that a function

f

exists that makes the difference between the real value of

y_{j}

and the estimated value

f (X_{i})

with

ε,

or, in other words, that the convex optimization problem is feasible. LaGrange multipliers and optimality constraints are used to solve Equation (5). One key requirement in achieving high accuracy and high performance using the SVM method is to select the proper kernel function, C, and ε parameters. Here, we selected those parameters using the grid search technique. Accordingly, the best results for CO₂ prediction were obtained when the kernel function type was dot and C was equal to 0.01.

2.2.3. Gradient Boosting Regression

Unlike many ML models, which focus on high quality prediction generated by a single model, boosting algorithms seek to improve prediction power by training a sequence of weak models, each compensating for the weaknesses of its predecessors. One type of such an approach, the gradient boosting regression (GBR) models, uses an ensemble of weak prediction models, such as decision trees, to make predictions [54]. GBR models can be used in both regression and classification tasks. Tsay etc. used the GBR method to model CO₂ emission [40]

The goal of GBR is to find the best predicted values of function

\hat{y} = F (X)

by minimizing the mean squared errors of loss function between actual

y

and

\hat{y}

, as in Equation (6):

m i n i m i z e \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{j}})}^{2}

(6)

where

i

is the index of a training set of size

n

.

The gradient boosting regression tree builds the model in a stage-wise fashion and updates the model by minimizing the expected value with a certain loss function. Generic gradient boosting at the m^th step would fit a decision tree

h_{m} (X)

to pseudo-residuals. Let

J

be the number of its leaves. The tree partitions the input space

J_{m}

into disjoint regions

R_{1 m}, R_{2 m}, \dots, R_{J_{m} m},

and predicts a constant value in each region. Using the indicator notation, the output of

h_{m} (X)

for input X can be written as the following sum:

h_{m} (x) = \sum_{j = 1}^{J} b_{j^{m}} I_{R_{j} m} (x)

where

b_{j m}

is the value predicted in region

R_{j m}

, and

I_{R_{j} m} (x) = \{\begin{matrix} 1, i f x \in R_{j m} \\ 0, o t h e r w i s e \end{matrix}

.

Using a regression tree to predict

h_{m} (x)

in the generic gradient boosting method, the model updates the equations and gradient descent step size.

The parameter

b

is also referred to as the learning rate and controls the contribution of each base model by shrinking its contributions by a factor of

0 \leq b \leq 1

. There is a tradeoff between the number of iterations and the learning rate. With the same number of iterations, a larger value of learning rate tends to lead to a larger error. The more iterations occur, the better the performance becomes. Therefore, we preferred a small b. Based on our experience, we chose a learning rate of 0.1.

Another parameter, tree complexity, also influences model performance. The algorithm restricts all trees to the same size, which is the number of features divided by the number of tree models. The size of the trees thus reflects the maximum depth of variable interactions. In our experiment, we set maximum depth of variable (tree) at 3, maximum features at none (no limitation), and maximum leaf nodes at none (no limitation.)

2.3. Evaluation

We evaluated the performance of our prediction results obtained from the ML method by employing four statistical metrics that are heavily used in the literature. These metrics are mean absolute error (MAE), relative root mean square error (rRMSE), mean absolute percentage error (MAPE), and the determination coefficient (

R^{2}

). We preferred to evaluate the percentage improvement of one model by comparing it to another model. The definitions of these four metrics are set out below.

2.3.1. MAE

MAE is a metric that evaluates the absolute error between predicted value and actual value, and MAE is also used as the model goal in our GBR algorithm [55]. MAE is defined as follows,

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

where

n

is the number of samples,

y_{i}

is the actual value (real CO₂ emissions), and

\hat{y_{i}}

is the model predicted value, which is the predicted CO₂ emissions of the

i

^th sample. MAE carries the values from zero to

+ \infty

, and small MSE values are desirable [56].

2.3.2. rRMSE

rRMSE is achieved by dividing the RMSE value by the mean actual value, which is the relative RMSE.

r R M S E = \frac{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}}{\bar{y}} \times 100

rRMSE ranges between 0 and 100%. A result becomes more desirable as it approaches zero [57,58].

2.3.3. MAPE

MAPE reflects the size of the errors as a percentage. It is a statistical benchmark for how accurate a prediction model is, since it is scale-independent and interpretable [59].

M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{i} - \hat{y_{i}}}{y_{i}}|

A smaller MAPE value means that prediction results are more desirable.

2.3.4. R²

R^{2}

is the most important index for verifying the accuracy of the predicted result of a regression algorithm, and it has a range [0, 1]. It gives a clue of how well the trend of the model results is able to track the trend of actual data with a normalized value [58,60]. The definition of

R^{2}

is

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

where

\bar{y}

is the mean of the actual value of

y_{i} .

, an

R^{2}

value of 1 would mean that the regression model makes predictions without any error. Therefore, the larger the

R^{2}

value, the better the model fitting result.

2.3.5. N-fold Cross-Validation

Cross-validation is primarily used in applied ML to estimate a model on unseen data. That is, it uses a limited sample in order to best estimate how the model is generally expected to perform when used to make predictions on data that was not used during the model’s training. The regression model performance in our study was evaluated by applying an n-fold cross-validation process [61]. The validation procedure has a single parameter

n

that refers to the number of subsets that a given data sample is to be split into. As such, the procedure is often called n-fold cross-validation. The general procedure is as follows:

Randomly shuffle the dataset and split the dataset into n subsets.
For each subset in the n subsets:
- Take the subset as a holdout or test dataset.
- Take the remaining n–1 subset as a training dataset.
- Fit a model on the training set and evaluate it on the test set.
- Retain the evaluation score and discard the model.
Average the evaluation score of n iterations.

The commonly chosen n is 4 or 10. In this study, we performed fourfold cross-validation for our experiments.

3. Results

We analyzed the ML method to predict transportation-based CO₂ emissions for all of the top 30 CO₂ emissions countries, as well as for Tier 1 and Tier 2 countries. The results are based on fourfold cross-validation.

3.1. Prediction of Transportation-Based CO₂ Emissions Using ML Methods

Table 5 shows the performance of three ML methods in predicting the transportation-based CO₂ emissions of 30 countries between 2005 and 2014, with three different feature sets. The statistical metrics of the performance are all based on fourfold cross-validation.

Table 5. Statistical metric evaluation of transportation-based CO₂ emissions prediction for the top 30 CO₂ emissions-producing countries using ML algorithms. Colors indicate the level of performance.

Based on the MAE results in Table 5, the best value of MAE metrics is 0.0061 from GRB_ALL. Comparing the three ML models, the best values of the MAE metric are 0.0061 (GRB_ALL), 0.0069 (SVM_ALL), and 0.0111 (OLS_ALL) for GBR, SVM, and OLS, respectively. Comparing the three feature combination strategies, the best values of the MAE metric are 0.0061 (GRB_ALL), 0.0067 (GRB_TRAN), and 0.0091 (GRB_SoEco) for ALL, TRAN and SoEco features, respectively. The difference among the MAE metric values of other algorithms is much smaller. Even if we consider RMAE (square root of MAE), the metric values of different models are still very small. Therefore, it will be useful to discuss the results of other metrics in evaluating the success of the algorithms’ predictions of transportation-based CO₂.

Based on the calculated results of the metrics, the

R^{2}

value for transportation-based CO₂ emission prediction varied between 0.9577 and 0.9943.

R^{2}

was the most frequently used metric in discussing the success of prediction results with respect to the actual data, and it provided an idea of how the predicted curves follow those of the actual data. The

R^{2}

value of GBR_ALL had the best value of 0.9943, while OLS_SoEco had the worst

R^{2}

result of 0.9577. Of the three ML algorithms, GBR performed better than SVM, and SVM performed better than OLS. Comparing the three feature combination sets, we found that ALL is better than TRAN, and TRAN is better than SoEco.

The MAPE metric gives the percentage error of the prediction results. Previous studies suggested evaluating the success of the MAPE metric by classifying it in four ways [62]. Accordingly,

When MAPE ≤ 10%, the prediction results can be classified as having high prediction accuracy.

When 10% < MAPE ≤ 20%, the prediction results can be classified as having good prediction accuracy.

When 20% < MAPE ≤ 50%, the prediction results can be classified as having reasonable prediction accuracy.

When MAPE > 50%, the prediction results can be classified as having inaccurate prediction accuracy.

Based on this commonly used classification, it is possible to say that the prediction results for each algorithm can be categorized as having good prediction accuracy. In other words, the MAPE metrics in predicting transportation CO₂ emissions were between 10% and 20% for the GBR algorithm used in this study, with all three feature combination strategies.

rRMSE scales the magnitudes between 0 and 100. In the available literature, it was a commonly used classification for better understanding the performance of the algorithms. This classification indicates how an algorithm presents the better results in terms of the rRMSE metric [63]). In this classification,

When rRMSE < 10%, the prediction results can be classified as excellent.

When 10% < rRMSE < 20%, the prediction results can be classified as good.

When 20% < rRMSE < 30%, the prediction results can be classified as fair.

When rRMSE > 30%, the prediction results can be classified as poor.

As shown in Table 6, the rRMSE values of the SLO, SVM, and GBR algorithms, when using ALL features, are 17.7%, 13.91%, and 11.65%, respectively. Based on this classification, it is possible to say that the prediction results of all algorithms with all features can be classified as good. However, the best model was GBR-TRAN, with an rRMSE score of 11.63%.

Table 6. Statistical metric results for transportation-based CO₂ emissions prediction for Tier 1 countries (the top 5 CO₂ emissions-producing countries), using ML methods.

In sum, each ML algorithm presented very good results for predicting transportation-based CO₂ emissions, with GBR algorithms providing the best results of the three. The ALL feature was always the best choice, while the TRAN features were only a close second.

Features for Prediction Analysis

Figure 4 shows the feature importance of our best model, GBR_ALL, in light of the top 30 CO₂ emissions-producing countries. The feature importance is the Gini impurity-based feature importance of the GBR method: the higher it is, the more important the feature is. The importance of a feature is calculated as the (normalized) total reduction of the criterion brought by the feature. That is, the values of all feature importance comprise a sum equal to 1.

Figure 4. Importance analysis of the GBR-ALL model for the top 30 CO₂ emissions-producing countries. Passenger air transportation (TRAN) is the most important feature, followed by railroad total transportation (TRAN) and air registered carrier transportation (TRAN), as well as other TRAN features.

According to Figure 4, the most important feature for predicting transportation-based CO₂ emissions is air transportation, which is followed by the railroad and vehicle transportation factors. Socioeconomic factors—GDP and population—are also important for the model. Interestingly, when compared to all 30 countries, CHN is another important feature in the model, which means that China has its own transportation CO₂ emissions pattern. Among all TRAN features, air transportation features are more important than railroad features, and railroad features are more important than vehicle-in-use features.

3.2. Predicting Transportation-Based CO₂ Emissions for Tier 1 Countries

Table 6 shows the statical metric results for transportation-based CO₂ emissions prediction for Tier 1 countries (the top five CO₂ emissions-producing countries) using ML algorithms. The overall model performance in terms of

R^{2}

value ranged between 0.8481 and 0.9948, with the best being GBR_TRAN. According to the rRMSE metrics, the best model was SVM_SoEco with a value of 0.0738, which was classified as excellent. According to MAPE classification, the GBR_SoEco was the best model, with a value of 0.0551, which was also classified as excellent. For Tier 1 countries, the SoEco feature set performed better than the other two, according to the MAPE.

Figure 5 shows the feature importance of the GBR_ALL model for the top five CO₂ emissions-producing countries. In this category, the U.S. was the most important feature for the model, followed by the SoEco feature of GDP, with vehicle-in-use third. Almost all TRAN features were important for the model. Vehicle-in-use features were more important than air-related features, and air-related features were more important than rail-related features.

Figure 5. Feature importance analysis of the GBR-ALL model for Tier 1, the top 5 CO₂ emissions-producing countries. USA (SoEco) is the most important feature, followed by GDP (SoEco) and passenger vehicle-in-use (TRAN), as well as other TRAN features.

3.3. Predicting Transportation-Based CO₂ Emissions for Tier 2 Countries

Table 7 shows the statical metric results for transportation-based CO₂ emissions prediction for Tier 2 countries (25 countries) using ML algorithms. The overall model performance in terms of

R^{2}

value ranged between 0.7917 and 0.9780. The best model was SVM_ALL, with an

R^{2}

value of 0.9780 and an rRMAE value of 0.1465, which was classified as good. According to the MAPE classification, GBR_TRAN was a good model with a value of 0.1277. For Tier 2 countries, the TRAN feature set performed better than the SoEco feature sets, according to MAPE, although combining both types of features using the SVM method performed best most of the time.

Table 7. Statistical metric results for transportation-based CO₂ emissions prediction for Tier 2 countries, using ML methods.

Figure 6 shows the feature importance of the GBR-ALL model for Tier 2 CO₂ emissions-producing countries. In this category, the SoEco feature of GDP was the most important feature, followed by vehicle-in-use for passengers, and then by another SoEco feature, population. In this category, not all TRAN features remained important. Air-related features were still important, but rail-related features were not. Other SoEco features—GDP added value in industry, service, and agriculture—were also important.

Figure 6. Feature importance analysis of the GBR-ALL model for Tier 2 CO₂ emissions-producing countries. GDP (SoEco) is the most important feature, followed by passenger vehicle-in-use (TRAN) and population (SoEco).

4. Discussion

Although most of the top five total countries are also the top transportation-related CO₂ emissions-producing countries, the ranking of countries varies. India is one of the top five total CO₂ emissions-producing countries, but not in the transportation sector. China has the highest overall CO₂ emissions, but the U.S. has the highest transportation-related CO₂ emissions.

According to the Pearson correlation coefficient, there are strong correlations between both TRAN features (air passengers carried, total railway length, vehicles in use) and SoEco features (country GDP, country total population, GDP value added by agriculture, industry, and service) and transportation-related CO₂ emissions. The U.S. and China are specifically recognized as important factors, which means that these countries have their own special trends. However, the year, and the most of countries, are not recognized as important factors.

Considering all metrics, the GBR and SVM models had better performance in transportation-based CO₂ emissions prediction, while the OLS model generally provided the worst results of the three methods. Nevertheless, the results demonstrated that each ML algorithm presents very satisfactory results in predicting CO₂ emissions.

Based on the widely used rRMSE classification found in the literature, all models were categorized as good in predicting CO₂ emissions for all 30 countries, in terms of the rRMSE metric, with the exception of OLS-TRAN and OLS-SoEco. The SVM_ALL and SVM-SoEco models were both categorized as excellent in predicting CO₂ emissions of the top five countries, in terms of the rRMSE metric. However, the SVM_ALL and SVM-SoEco models were categorized as good in predicting CO₂ emissions for Tier 2 countries, in terms of the rRMSE metric. We believe there are other important features to determine Tier 2 countries’ CO₂ emissions from transportation that are not yet included in our model.

According to MAPE metric classifications in the literature, all GBR models showed good prediction accuracy for all 30 countries, excellent prediction accuracy for the top five CO₂ emissions-producing countries, and good prediction accuracy for Tier 2 countries (except for the GBR_ALL model). This result further reinforced the finding that there are other important features to determine Tier 2 countries’ CO₂ emissions from transportation that are not yet included in our model. Other important features should be explored to further improve the Tier 2 country categories.

According to the prediction feature analysis of GBR models, we found that GDP is always one of the most important features for predicting CO₂ emissions in any type of country. TRAN features were the most important features for transportation CO₂ emissions prediction for all 30 countries and for the top five emissions-producing countries. However, in the Tier 2 emissions-producing countries, SoEco features, including population and GDP added value in industry, service, and agriculture, were also very important for prediction. If we consider that our models’ prediction for Tier 2 countries was less effective than the prediction for Tier 1 countries, we see the necessity of further exploring other SoEco features to help facilitate Tier 2 transform-based CO₂ prediction.

5. Conclusions

This paper aimed to predict transportation-based CO₂ emissions using three ML methods (OLS, SVM, and GBR). We used three types of features, TRAN (transportation-related features only), SoEco (socioeconomic features only), and ALL (a combination of the TRAN and SoEco features). Four statistical metrics were used to assess the performance success of the algorithm. Three types of countries were targeted: all top 30 CO₂ emissions-producing countries (accounting for 96% of global CO₂ emissions), Tier 1 countries (the top five CO₂ emissions-producing countries, accounting for 61% of global CO₂ emissions), and Tier 2 countries (the next 25 CO₂ emissions-producing countries, accounting for 35% of global CO₂ emissions).

In sum, three ML algorithms can predict countries’ CO₂ emissions arising from the transportation sector. Of these methods, GBR performs better than OLS and SVM. TRAN features are the most important features for transportation CO₂ emissions prediction. However, SoEco features, such as GDP, also affect the top five emissions-producing countries, while TRAN features are the most influential factors in the CO₂ emissions prediction for Tier 2 countries. Our prediction approach and the identified influential factors may aid in near-future attempts by decision-makers to reduce the growth rate of transportation-related CO₂ emissions.

Author Contributions

Conceptualization, A.R. and Q.L.; methodology, A.R., X.L., and Q.L.; software, A.R., X.L., and Q.L.; validation, A.R. and X.L.; formal analysis, A.R., X.L., and Q.L.; investigation, A.R., X.L., and Q.L.; data curation, A.R., X.L., and Q.L.; writing—original draft preparation, A.R., X.L., and Q.L; writing—review and editing, A.R., X.L., and Q.L.; supervision, Q.L.; project administration, Q.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Anhui Philosophy and Social Science Planning Project, grant number AHSKY2017D24.

Conflicts of Interest

The authors declare no conflict of interest.

References

United Nation. Climate Change, ‘Biggest Threat Modern Humans Have Ever Faced’, World-Renowned Naturalist Tells Security Council, Calls for Greater Global Cooperation. 2021. Available online: https://www.un.org/press/en/2021/sc14445.doc.htm (accessed on 10 January 2022).
IPCC 2001. Climate Change 2001 Synthesis Report: Mitigation. In Contribution of Working Group III to the Third Assessment Report of the Intergovernmental Panel on Climate Change, 2001; Watson, R.T., Ed.; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Pachauri, R.K.; Meyer, L.A. (Eds.) IPCC 2014 Climate Change 2014: Synthesis Report. In Contribution of Working Groups I, II and III to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change; IPCC: Geneva, Switzerland, 2014; p. 151. [Google Scholar]
Nallapaneni, M.K.; Dash, A. Internet of things: An opportunity for transportation and logistics. In Proceedings of the International Conference on Inventive Computing and Informatics, ICICI, Coimbatore, India, 23–24 November 2017; pp. 194–197. [Google Scholar]
Onat, N.C.; Kucukvar, M.; Tatari, O. Towards life cycle sustainability assessment of alternative passenger vehicles. Sustainability 2014, 6, 9305–9342. [Google Scholar] [CrossRef]
Garcia-Lopez, M.-L.; Pasidis, I.; Viladecans-Marsal, E. Express delivery to the suburbs: The effects of transportation in Europe’s heterogeneous cities. SSRN Electron. J. 2015. [Google Scholar] [CrossRef][Green Version]
Danish, T.H.; Baloch, M.A.; Suad, S. Modeling the impact of transport energy consumption on CO₂ emission in Pakistan: Evidence from ARDL approach. Environ. Sci. Pollut. Res. 2018, 25, 9461–9473. [Google Scholar] [CrossRef] [PubMed]
Tarhan, C.; Çil, M.A. A study on hydrogen, the clean energy of the future: Hydrogen storage methods. J. Energy Storage 2021, 40, 102676. [Google Scholar] [CrossRef]
Giannakis, E.; Serghides, D.; Dimitriou, S.; Zittis, G. Land transport CO₂ emissions and climate change: Evidence from Cyprus. Int. J. Sustain. Energy 2020, 39, 634–647. [Google Scholar] [CrossRef]
U.S. Energy Information Administration. International Energy Outlook 2016; U.S. Energy Information Administration: Washington, DC, USA, 2016. Available online: https://www.eia.gov/outlooks/ieo/pdf/transportation.pdf (accessed on 2 January 2022).
Sajida, M.J.; Cao, Q.; Kanga, W. Transport sector carbon linkages of EU’s top seven emitters. Transp. Policy 2019, 80, 24–38. [Google Scholar] [CrossRef]
Gonzalez, R.M.; Marrero, G.; Rodriguez-Lopes, J.; Marrero, A. Analyzing CO₂ emissions from passenger cars in Europe: A dynamic panel data approach. Energy Policy 2019, 129, 1271–1281. [Google Scholar] [CrossRef]
United Nations Treaty Collection. Paris Agreement; United Nations Treaty Collection: New York, NY, USA, 2015; Available online: https://unfccc.int/process-and-meetings/the-paris-agreement/the-paris-agreement (accessed on 2 January 2022).
Lean, H.H.; Huang, W.; Hong, J. Logistics and economic development: Experience from China. Transp. Policy 2014, 32, 96–104. [Google Scholar] [CrossRef]
Yaacob, N.F.F.; Mat Yazid, M.R.; Abdul Maulud, K.N.; Ahmad Basri, N.E. A Review of the Measurement Method, Analysis and Implementation Policy of Carbon Dioxide Emission from Transportation. Sustainability 2020, 12, 5873. [Google Scholar] [CrossRef]
Song, W.; Ren, A.; Li, X.; Li, Q. The Orchestrating Role of Carbon Subsidies in a Capital-Constrained Supply Chain. Math. Probl. Eng. 2021, 2021, 8920624. [Google Scholar] [CrossRef]
Beyzatlar, M.A.; Karacal, M.; Yetkiner, H. Granger-causality between transportation and GDP: A panel data approach. Transp. Res. 2014, 63, 43–55. [Google Scholar] [CrossRef]
Pradhan, R.P.; Bagchi, T.P. Effect of transportation infrastructure on economic growth in India: The VECM approach. Res. Transp. Econ. 2013, 38, 139–148. [Google Scholar] [CrossRef]
Kustepeli, Y.; Gulcan, Y.; Akgungor, S. Transportation infrastructure investment, growth and international trade in Turkey. Appl. Econ. 2012, 44, 2619–2629. [Google Scholar] [CrossRef]
Yu, N.; De Jong, M.; Storm, S.; Mi, J. Transport infrastructure, spatial clusters and regional economic growth in China. Transp. Rev. 2012, 32, 3–28. [Google Scholar] [CrossRef]
Liddle, B.; Lung, S. The long-run causal relationship between transport energy consumption and GDP: Evidence from heterogeneous panel methods robust to cross-sectional dependence. Econ. Lett. 2013, 121, 524–527. [Google Scholar] [CrossRef]
Lean, C.S. Empirical tests to discern linkages between construction and other economic sectors in Singapore. Constr. Manag. Econ. 2001, 19, 355–363. [Google Scholar]
Eruygur, A.; Kaynak, M.; Mert, M. Transportation-communication capital and economic growth: A VECM analysis for Turkey. Eur. Plan. Stud. 2012, 20, 341–363. [Google Scholar] [CrossRef]
Lakshmanan, T.R.; Han, X. Factors underlying transportation CO₂ emissions in the U.S.A.: A decomposition analysis. Transp. Res. Transp. Environ. 1997, 2, 1–15. [Google Scholar] [CrossRef]
Scholl, L.; Schipper, L.; Kiang, N. CO₂ emissions from passenger transport: A comparison of international trends from 1973 to 1992. Energy Policy 1996, 24, 17–30. [Google Scholar] [CrossRef]
Lu, I.J.; Lin, S.J.; Lewis, C. Decomposition and decoupling effects of carbon dioxide emission from highway transportation in Taiwan, Germany, Japan and South Korea. Energy Policy 2007, 35, 3226–3235. [Google Scholar] [CrossRef]
Timilsina, G.R.; Shrestha, A. Transport sector CO₂ emissions growth in Asia: Underlying factors and policy options. Energy Policy 2009, 37, 4523–4539. [Google Scholar] [CrossRef]
Zhu, X.; Li, R. An Analysis of Decoupling and Influencing Factors of Carbon Emissions from the Transportation Sector in the Beijing-Tianjin-Hebei Area, China. Sustainability 2017, 9, 722. [Google Scholar] [CrossRef]
Liang, Y.; Niu, D.; Wang, H.; Li, Y. Factors Affecting Transportation Sector CO₂ Emissions Growth in China: An LMDI Decomposition Analysis. Sustainability 2017, 9, 1730. [Google Scholar] [CrossRef]
Kim, S. Decomposition Analysis of Greenhouse Gas Emissions in Korea’s Transportation Sector. Sustainability 2019, 11, 1986. [Google Scholar] [CrossRef]
Yuan, Y.; Wang, Y.; Chi, Y.; Jin, F. Identification of Key Carbon Emission Sectors and Analysis of Emission Effects in China. Sustainability 2020, 12, 8673. [Google Scholar] [CrossRef]
Hassouna, F.; Al-Sahili, K. Environmental impact assessment of the transportation sector and hybrid vehicle implications in Palestine. Sustainability 2020, 12, 7878. [Google Scholar] [CrossRef]
Lotfalipour, M.; Falahi, M.; Bastam, M. Prediction of CO₂ emissions in Iran using Grey and ARIMA models. Int. J. Energy Econ. Policy 2013, 3, 229–237. [Google Scholar]
Chigora, F.; Thabani, N.; Mutambara, E. Forecasting 2 emission for Zimbabwe’s tourism destination vibrancy: A univariate approach using box-Jenkins ARIMA model. Afr. J. Hosp. Tour. Leis. 2019, 8. [Google Scholar]
Ayvaz, B.; Kusakci, A.O.; Temur, G.T. Energy-related CO₂ emission forecast for Turkey and Europe and Eurasia. Grey Syst. Theory Appl. 2017, 7, 436–452. [Google Scholar] [CrossRef]
Ofosu-Adarkwa, J.; Xie, N.; Javed, S.A. Forecasting CO₂ emissions of China’s cement industry using a hybrid Verhulst-GM (1, N) model and emissions’ technical conversion. Renew. Sustain. Energy Rev. 2020, 130, 109945. [Google Scholar] [CrossRef]
Yang, H.; O’Connell, J.F. Short-term carbon emissions forecast for aviation industry in Shanghai. J. Clean. Prod. 2020, 275, 122734. [Google Scholar] [CrossRef]
Ang, C.; Morad, N.; Ismail, M.; Ismail, N. Projection of carbon dioxide emissions by energy consumption and transportation in Malaysia: A time series approach. J. Energy Technol. Policy 2013, 3, 63–75. [Google Scholar]
Rezaei, M.H.; Sadeghzadeh, M.; Alhuyi Nazari, M.; Ahmadi, M.H.; Astaraei, F.R. Applying GMDH artificial neural network in modeling CO₂ emissions in four nordic countries. Int. J. Low Carbon Technol. 2018, 13, 266–271. [Google Scholar] [CrossRef]
Tsay, Y.-S.; Yeh, C.-Y.; Chen, Y.-H.; Lu, M.-C.; Lin, Y.-C. A Machine Learning-Based Prediction Model of LCCO₂ for Building Envelope Renovation in Taiwan. Sustainability 2021, 13, 8209. [Google Scholar] [CrossRef]
Zeng, H.; Shao, B.; Bian, G.; Dai, H.; Zhou, F. Analysis of Influencing Factors and Trend Forecast of CO₂ Emission in Chengdu-Chongqing Urban Agglomeration. Sustainability 2022, 14, 1167. [Google Scholar] [CrossRef]
Hui, M. 2020 World Development Indicators from World Bank Open Data; Kaggle: San Francisco, CA, USA, 2021; Available online: https://www.kaggle.com/manchunhui/world-development-indicators (accessed on 2 January 2022).
United Nations. Country Classification. 2014. Available online: https://www.un.org/en/development/desa/policy/wesp/wesp_current/2014wesp_country_classification.pdf (accessed on 6 January 2022).
Ritchie, H. Cars, Planes, Trains: Where Do CO₂ Emissions from Transport Come from? Ourworldindata. 2020. Available online: https://ourworldindata.org/co2-emissions-from-transport (accessed on 2 January 2022).
Bakay, M.S.; Agbulut, Ü. Electricity production-based forecasting of green- house gas emissions in Turkey with deep learning, support vector machine and artificial neural network algorithms. J. Clean. Prod. 2021, 285, 125324. [Google Scholar] [CrossRef]
Hidecker, M.J.C.; Ho, N.T.; Dodge, N.; Hurvitz, E.A.; Slaughter, J.; Workinger, M.S.; Paneth, N. Inter-relationships of functional status in cerebral palsy: Analyzing gross motor function, manual ability, and communication function classification systems in children. Dev. Med. Child. Neurol. 2012, 54, 737–742. [Google Scholar] [CrossRef]
Stock, J.H.; Watson, M.W. Introduction to Econometrics; Addison Wesley: Boston, MA, USA, 2003. [Google Scholar]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Harris, D.; Burges, C.J.C.; Kaufman, L.; Smola, A.J.; Vapnik, V. Support Vector Regression Machines, in Advances in Neural Information Processing Systems 9. NIPS 1997, 779–784. [Google Scholar]
Joachims, T. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 1999 International Conference on Machine Learning (ICML 1999); Universität Dortmund: Dortmund, Germany, 1999; pp. 200–209. [Google Scholar]
Ben-Hur, A.; Horn, D.; Siegelmann, H.; Vapnik, V. Support vector clustering. J. Mach. Learn Res. 2001, 2, 125–137. [Google Scholar] [CrossRef]
Hsu, C.-W.; Lin, C.-J. A Comparison of Methods for Multiclass Support Vector Machines. IEEE Trans. Neural Netw. Learn. Syst. 2002, 13, 415–425. [Google Scholar] [CrossRef] [PubMed]
Polson, N.G.; Scott, S.L. Data Augmentation for Support Vector Machines. Bayesian Anal. 2011, 6, 1–23. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Cai, J.; Xu, K.; Zhu, Y.; Hu, F.; Li, L. Prediction and analysis of net ecosystem carbon exchange based on gradient boosting regression and random forest. Appl. Energy 2020, 262, 114566. [Google Scholar] [CrossRef]
Hu, W.; Shao, M.; Reichardt, K. Using a new criterion to identify sites for mean soil water storage evaluation. Soil Sci. Soc. Am. J. 2010, 74, 762–773. [Google Scholar] [CrossRef]
Chen, J.L.; Li, G.S.; Wu, S.J. Assessing the potential of support vector machine for estimating daily solar radiation using sunshine duration. Energy Convers. Manag. 2013, 75, 311–318. [Google Scholar] [CrossRef]
Ağbulut, Ü. Forecasting of transportation-related energy demand and CO₂ emissions in Turkey with different machine learning algorithms. Sustain. Prod. Consum. 2022, 29, 141–157. [Google Scholar] [CrossRef]
Kim, S.; Kim, H. A new metric of absolute percentage error for intermittent demand forecasts. Int. J. Forecast. 2016, 32, 669–679. [Google Scholar] [CrossRef]
Chakraborty, D.; Elzarka, H. Performance testing of energy models: Are we using the right statistical metrics? J. Build. Perform. Simul. 2018, 11, 433–448. [Google Scholar] [CrossRef]
Li, Q.; Deleger, L.; Lingren, T.; Zhai, H.J.; Kaiser, M.; Stoutenborough, L.; Jegga, A.G.; Cohen, K.B.; Solti, I. Mining FDA drug labels for medical conditions. BMC Med. Inf. Decis. Mak. 2013, 13, 53. [Google Scholar] [CrossRef]
Montaño Moreno, J.J.; Palmer Pol, A.; Sesé Abad, A.; Cajal Blasco, B. Using the R-MAPE index as a resistant measure of forecast accuracy. Psicothema 2013, 25, 500–506. [Google Scholar] [CrossRef]
Tuncer, A.D.; Sözen, A.; Afshari, F.; Khanlari, A.; Şirin, C.; Gungor, A. Testing of a novel convex-type solar absorber drying chamber in dehumidification process of municipal sewage sludge. J. Clean. Prod. 2020, 272, 122862. [Google Scholar] [CrossRef]

Figure 1. Yearly CO₂ emissions by sector. Transportation-based CO₂ emissions have tripled in the past 30 years.

Figure 2. Total CO₂ emissions vs. CO₂ emissions from TRAN by country. CO₂ emissions (in kt) are summarized from 2005–2014 data. Percentages represent each country’s contribution to global total CO₂ emissions. (a) Total CO₂ emissions; (b) CO₂ emissions from TRAN.

Figure 3. Pearson correlation matrix between the features and the target variable (transportation-based CO₂ emissions).

Figure 4. Importance analysis of the GBR-ALL model for the top 30 CO₂ emissions-producing countries. Passenger air transportation (TRAN) is the most important feature, followed by railroad total transportation (TRAN) and air registered carrier transportation (TRAN), as well as other TRAN features.

Figure 5. Feature importance analysis of the GBR-ALL model for Tier 1, the top 5 CO₂ emissions-producing countries. USA (SoEco) is the most important feature, followed by GDP (SoEco) and passenger vehicle-in-use (TRAN), as well as other TRAN features.

Figure 6. Feature importance analysis of the GBR-ALL model for Tier 2 CO₂ emissions-producing countries. GDP (SoEco) is the most important feature, followed by passenger vehicle-in-use (TRAN) and population (SoEco).

Table 1. Summary of Features.

Description	Indicators	Source
Socioeconomic Factors (SoEco)
GDP (current USD)	NY.GDP.MKTP.CD	WDI
Unemployment, total (% of total labor force) (modeled ILO estimate)	SL.UEM.TOTL.ZS	WDI
Agriculture, value added per worker (constant 2010 USD)	NV.AGR.EMPL.KD	WDI
Industry, value added per worker (constant 2010 USD)	NV.IND.EMPL.KD	WDI
Services, value added per worker (constant 2010 USD)	NV.SRV.EMPL.KD	WDI
Developed or developing country	Developing0Developed1	UN
Fuel-exporting country	Fuel-exportingCountries	UN
High income, upper middle income, lower middle income, low income	High Income Upper Middle Income Lower Middle Income	UN
Population, total	SP.POP.TOTL	WDI
Transportation factors (TRAN)		WDI
Air transportation, passengers carried	IS.AIR.PSGR	WDI
Air transportation, registered carrier departures worldwide	IS.AIR.DPRT	WDI
Air transportation, freight (million tons in km)	IS.AIR.GOOD.MT.K1	WDI
Container port traffic (TEU: 20 foot equivalent units)	IS.SHP.GOOD.TU	WDI
Rail lines (total routes in km)	IS.RRS.TOTL.KM	WDI
Commercial vehicles in use (per 1000 units)	IS.VHL.COM	OICA
Passengers cars in use (per 1000 units)	IS.VHL.PSGR	OICA
CO₂ related factors
CO₂ emissions (kt)	EN.ATM.CO2E.KT	WDI
CO₂ emissions from residential buildings and commercial and public services (% of total fuel combustion)	EN.CO2.BLDG.ZS	WDI
CO₂ emissions from electricity and heat production, total (% of total fuel combustion)	EN.CO2.ETOT.ZS	WDI
CO₂ emissions from manufacturing industries and construction (% of total fuel combustion)	EN.CO2.MANF.ZS	WDI
CO₂ emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)	EN.CO2.OTHX.ZS	WDI
CO₂ emissions from transportation (% of total fuel combustion)	EN.CO2.TRAN.ZS	WDI

Table 2. Summary of transportation-based CO₂ emissions from the top 30 CO₂ emissions-producing countries.

Tier	Country Name	Country Code	% of Total CO₂ Emissions	Avg. Yearly CO₂ Emissions from Transportation (kt) (2005–2014)	Avg. Yearly Per Capita CO₂ Emissions from Transportation (kt) (2005–2014)
Tier 1 (five countries with 61% of total CO₂ emissions from 2005–2014)	China	CHN	27.85%	656,467	0.49
	United States	USA	18.15%	1,744,667	5.68
	India	IND	5.71%	199,638	0.16
	Russian Federation	RUS	5.67%	263,744	1.84
	Japan	JAP	4.04%	227,222	1.78
Tier 2 (25 countries with 35% of total CO₂ emissions from 2005–2014)	Iran	IRN	1.88%	135,256	1.84
	Canada	CAN	1.81%	168,799	5
	Korea	KOR	1.79%	89,484	1.81
	United Kingdom	GBR	1.63%	123,077	1.97
	Saudi Arabia	SAU	1.62%	126,382	4.63
	Mexico	MEX	1.61%	162,303	1.43
	South Africa	ZAF	1.57%	56,821	1.11
	Brazil	BRA	1.40%	189,664	0.97
	Indonesia	IDN	1.40%	109,557	0.45
	Italy	ITA	1.37%	116,602	1.97
	Australia	AUS	1.26%	85,210	3.9
	France	FRA	1.17%	131,503	2.03
	Poland	POL	1.03%	44,409	1.17
	Turkey	TUR	0.99%	52,053	0.72
	Spain	ESP	0.99%	102,749	2.25
	Ukraine	UKR	0.98%	35,142	0.76
	Thailand	THA	0.84%	65,667	0.98
	Kazakhstan	KAZ	0.77%	13,835	0.85
	Malaysia	MYS	0.70%	52,240	1.87
	Egypt, Arab Rep.	EGY	0.68%	48,031	0.58
	Argentina	ARG	0.62%	47,167	1.16
	Venezuela, RB	VEN	0.60%	53,289	1.89
	Netherlands	NLD	0.59%	36,837	2.22
	United Arab Emirates	ARE	0.57%	32,214	4.29
	Pakistan	PAK	0.54%	43,081	0.24

Table 3. Descriptive statistics of the dataset.

Feature Type	Features	Count	Mean	Std	Min	50%	Max
SocEco	Year	300	2010	3	2005	2010	2014
SocEco	SP.POP.TOTL (10^8)	300	1.60	3.08	0.46	0.61	13.64
TRAN	IS.AIR.DPRT	296	758,007	1,743,070	17,302	315,383	10,095,200
TRAN	IS.AIR.GOOD.MT.K1	300	4023	7393	1	1386	40,618
TRAN	IS.SHP.GCNW.XQ	290	50	25	8	41	135
SocEco	NY.GDP.MKTP.CD (10^8)	300	17,193.7	29,169.1	571.24	8427.63	175,217.5
SocEco	NV.IND.EMPL.DK	298	53,633	42,761	2678	32,233	202,808
SocEco	NV.AGR.EMPL.KD	298	28,539	40,609	1029	12,938	305,042
SocEco	NV.SRV.EMPL.KD	298	39,589	32,555	4156	23,095	104,388
SocEco	SL.UEM.TOTL.ZS	300	7	5	0	6	29
TRAN	IS.AIR.PSGR	296	67,599,190	133,450,700	1,160,286	33,191,170	762,710,000
TRAN	IS.RRS.TOTL.KM	224	27,536	40,124	58	15,026	194,431
TRAN	IS.SHP.GOOD.TU	288	12,972,780	25,000,180	516,698	6,586,637	186,679,100
TRAN	IS.VHL.COM	300	8074	21,318	69	3388	137,043
TRAN	IS.VHL.PSGR	300	20,227	26,123	796	11,067	135,882
Target Variable	EN.CO2E.TRAN.KT	300	173,770	316,713	10,794	89,566	1,838,933

Table 4. Pearson correlation coefficient between the categorical variables and transportation-based CO₂ emissions.

Country	Correlation Coefficient	Country	Correlation Coefficient	Country	Correlation Coefficient	Features	Correlation Coefficient
USA	0.92	GBR	−0.03	SAU	−0.03	Tier 1	0.63
CHN	0.28	ITA	−0.03	TUR	−0.07	Tier 2	−0.15
RUS	0.05	IDN	−0.04	EGY	−0.07
JPN	0.03	ESP	−0.04	ARG	−0.07	Developing 0, Developed 1	0.23
IND	0.02	KOR	−0.05	POL	−0.08	Fuel-exporting Countries 1	0.14
BRA	0.01	AUS	−0.05	PAK	−0.08	High Income	0.18
CAN	0.00	THA	−0.06	NLD	−0.08	Upper Middle Income	−0.09
MEX	−0.01	ZAF	−0.07	UKR	−0.08	Lower Middle Income	−0.12
IRN	−0.02	VEN	−0.07	ARE	−0.08
FRA	−0.02	MYS	−0.07	KAZ	−0.09

Table 5. Statistical metric evaluation of transportation-based CO₂ emissions prediction for the top 30 CO₂ emissions-producing countries using ML algorithms. Colors indicate the level of performance.

	MAE	RMAE	rRMAE	MAPE	$R^{2}$
OLS_ALL	0.0111	0.0161	0.1770	0.3132	0.9866
SVM_ALL	0.0069	0.0127	0.1391	0.2144	0.9927
GBR_ALL	0.0061	0.0112	0.1165	0.1408	0.9943
OLS_TRAN	0.0150	0.0205	0.2312	0.4715	0.9800
SVM_TRAN	0.0092	0.0150	0.1687	0.2791	0.9896
GBR_TRAN	0.0067	0.0122	0.1163	0.1355	0.9930
OLS_SoEco	0.0188	0.0280	0.3207	0.5405	0.9577
SVM_SoEco	0.0099	0.0159	0.1997	0.3838	0.9880
GBR_SoEco	0.0091	0.0179	0.1728	0.1491	0.9789

Table 6. Statistical metric results for transportation-based CO₂ emissions prediction for Tier 1 countries (the top 5 CO₂ emissions-producing countries), using ML methods.

	MAE	RMAE	rRMSE	MAPE	$R^{2}$
OLS_ALL	0.0037	0.0050	0.1007	0.1282	0.9783
SVM_ALL	0.0027	0.0046	0.0878	0.0787	0.9805
GBR_ALL	0.0019	0.0025	0.1818	0.0628	0.9948
OLS_TRAN	0.0027	0.0046	0.0885	0.0814	0.9813
SVM_TRAN	0.0037	0.0049	0.1015	0.1127	0.9782
GBR_TRAN	0.0018	0.0024	0.1980	0.0538	0.9950
OLS_SoEco	0.0098	0.0130	0.2615	0.1086	0.8481
SVM_SoEco	0.0026	0.0039	0.0738	0.0876	0.9877
GBR_SoEco	0.0037	0.0064	0.2613	0.0551	0.9244

Table 7. Statistical metric results for transportation-based CO₂ emissions prediction for Tier 2 countries, using ML methods.

	MAE	RMAE	rRMSE	MAPE	$R^{2}$
OLS_ALL	0.0104	0.0145	0.2217	0.3178	0.9460
SVM_ALL	0.0050	0.0097	0.1465	0.1289	0.9780
GBR_ALL	0.0058	0.0103	0.2959	0.4957	0.9700
OLS_TRAN	0.0137	0.0182	0.2778	0.4482	0.9124
SVM_TRAN	0.0070	0.0132	0.2021	0.2098	0.9566
GBR_TRAN	0.0060	0.0100	0.3015	0.1277	0.9738
OLS_SoEco	0.0192	0.0271	0.4164	0.4947	0.7917
SVM_SoEco	0.0086	0.0125	0.1974	0.2444	0.9527
GBR_SoEco	0.0064	0.0110	0.3872	0.1558	0.9614

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Exploring Patterns of Transportation-Related CO₂ Emissions Using Machine Learning Methods

Abstract

1. Introduction

Literature Review

2. Materials and Methods

2.1. Dataset

2.1.1. Socioeconomic Features

2.1.2. Transportation Features

2.1.3. CO₂ Emissions by Sector, Year, and Country

2.2. Method

2.2.1. Ordinary Least Squares

2.2.2. Support Vector Machine

2.2.3. Gradient Boosting Regression

2.3. Evaluation

2.3.1. MAE

2.3.2. rRMSE

2.3.3. MAPE

2.3.4. R²

2.3.5. N-fold Cross-Validation

3. Results

3.1. Prediction of Transportation-Based CO₂ Emissions Using ML Methods

Features for Prediction Analysis

3.2. Predicting Transportation-Based CO₂ Emissions for Tier 1 Countries

3.3. Predicting Transportation-Based CO₂ Emissions for Tier 2 Countries

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Exploring Patterns of Transportation-Related CO2 Emissions Using Machine Learning Methods

Abstract

1. Introduction

Literature Review

2. Materials and Methods

2.1. Dataset

2.1.1. Socioeconomic Features

2.1.2. Transportation Features

2.1.3. CO2 Emissions by Sector, Year, and Country

2.2. Method

2.2.1. Ordinary Least Squares

2.2.2. Support Vector Machine

2.2.3. Gradient Boosting Regression

2.3. Evaluation

2.3.1. MAE

2.3.2. rRMSE

2.3.3. MAPE

2.3.4. R2

2.3.5. N-fold Cross-Validation

3. Results

3.1. Prediction of Transportation-Based CO2 Emissions Using ML Methods

Features for Prediction Analysis

3.2. Predicting Transportation-Based CO2 Emissions for Tier 1 Countries

3.3. Predicting Transportation-Based CO2 Emissions for Tier 2 Countries

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Exploring Patterns of Transportation-Related CO₂ Emissions Using Machine Learning Methods

2.1.3. CO₂ Emissions by Sector, Year, and Country

2.3.4. R²

3.1. Prediction of Transportation-Based CO₂ Emissions Using ML Methods

3.2. Predicting Transportation-Based CO₂ Emissions for Tier 1 Countries

3.3. Predicting Transportation-Based CO₂ Emissions for Tier 2 Countries