Next Article in Journal
Gazing into the Crystal Ball: A Review of Futures Analysis to Promote Environmental Justice in the UK Water Industry
Next Article in Special Issue
Case Study on Carbon Footprint Life-Cycle Assessment for Construction Delivery Stage in China
Previous Article in Journal
Evaluation of Bluetooth Detectors in Travel Time Estimation
Previous Article in Special Issue
Impact of Green Finance and Environmental Regulations on the Green Innovation Efficiency in China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring Patterns of Transportation-Related CO2 Emissions Using Machine Learning Methods

1
School of Economics and Management, Anhui Polytechnic University, Wuhu 241000, China
2
School of Business, State University of New York at New Paltz, New Paltz, NY 12561, USA
*
Author to whom correspondence should be addressed.
Sustainability 2022, 14(8), 4588; https://doi.org/10.3390/su14084588
Submission received: 20 March 2022 / Revised: 7 April 2022 / Accepted: 10 April 2022 / Published: 12 April 2022

Abstract

:
While the transportation sector is one of largest economic growth drivers for many countries, the adverse impacts of transportation on air quality are also well-noted, especially in developing countries. Carbon dioxide (CO2) emissions are one of the direct results of a transportation sector powered by burning fossil-based fuels. Detailed knowledge of CO2 emissions produced by the transportation sectors in various countries is essential for these countries to revise their future energy investments and policies. In this framework, three machine learning algorithms, ordinary least squares regression (OLS), support vector machine (SVM), and gradient boosting regression (GBR), are used to forecast transportation-based CO2 emissions. Both socioeconomic factors and transportation factors are also included as features in the study. We study the top 30 CO2 emissions-producing countries, including the Tier 1 group (the top five countries, accounting for 61% of global CO2 emissions production) and the Tier 2 group (the next 25 countries, accounting for 35% of total CO2 emissions production). We evaluate our model using four-fold cross-validation and report four frequently used statistical metrics ( R 2 , MAE, rRMSE, and MAPE). Of the three machine learning algorithms, the GBR model with features combining socioeconomic and transportation factors (GBR_ALL) has the best performance, with an R 2 value of 0.9943, rRMSE of 0.1165, and MAPE of 0.1408. We also find that both transportation features and socioeconomic features are important for transportation-based CO2 emission prediction. Transportation features are more important in modeling for 30 countries, while socioeconomic features (especially GDP and population) are more important when modeling for Tier 1 and Tier 2 countries.

1. Introduction

Global climate change has been recognized as the biggest threat to all living beings in the sea, on land, and in the atmosphere [1]. Unprecedented challenges, such as extreme weather, loss of species, shifting rainfall patterns, glaciers melting, and rising global mean sea level have affected the survival and growth of humanity globally [2]. According to the IPCC Fourth Assessment Report 2014, since 1750, the concentration of carbon dioxide (CO2) in the atmosphere has increased by 40%, while the same measure was 31% in 2001 [2,3]. As of 2010, the transportation sector accounted for 14% of global greenhouse gas emissions [3].
The transportation sector plays an essential role in humanity’s activities and affects the global economy. For example, the transportation of people and goods provides people with mobility, sustainable daily lives, local and international merchandise trade, and economic development [4,5,6,7]. However, most activities in the transportation sector are fueled by fossil-based energy sources, which are not renewable [8]. This implies that, while contributing to the global economy, the transportation sector has negative impacts on global climate change.
Considering its scale and the growing speed in energy consumption, the transportation sector has become the second largest CO2 emitter in the world [9]. The transportation of people and goods accounts for about 25% of total world energy consumption [10], and about 25% of greenhouse emissions in the European Union (EU) [11]. From 1990 to 2015, the share of CO2 emissions from the transportation sector in EU countries increased from 32% to 45% [12]. Given the continued growth in fossil-based energy usage and transportation-based CO2 emissions, building a sustainable transportation sector and reducing its CO2 emissions have become critical, especially for the 196 parties that adopted the Paris Agreement on December 12, 2015 [13]. Therefore, in this study, we focus on automatic prediction methods for transportation sector-related CO2 emissions, and their factor analysis.
The contribution of this paper is threefold: (1) we study transportation-based CO2 emission at the global level using the data from 30 countries; (2) we study the feature sets that not only include the transportation-related features (air, railroad, and highway vehicles), but also social economical features (population, GDP, and GDPs from different sectors); (3) we find that CO2 emission patterns are different in different countries: all transportation-related features are important in the models for all countries, while socioeconomic features are more important in the top five CO2 emission countries (Tier 1) and the next 25 CO2 emission countries (Tier 2).

Literature Review

Existing studies have examined the impact of transportation activities on economic development. The causal relationship between logistics development and economic growth in both the short and long term was studied using a dynamic structural model [14,15,16]. The causal relationship between transportation and income was investigated using a panel dataset [17] that included the data of 15 EU countries from 1970 to 2008, and the authors found an endogenous relationship between income and transportation. The impact of roadway and railway infrastructure on India’s economic growth was studied using the vector error correction model [18] and weak short-term effects were found. By examining data from India from 1970 to 2010, these authors found unidirectional causality from railway transportation to economic growth. Another study investigating data from Turkey from 1970 to 2005 showed the impact of highway infrastructure on Turkey’s economic growth [19]. Similarly, the causal relationships between transportation infrastructure investment and economic growth were also studied in China using time series data from 1978 to 2008 [20]. However, unlike previous findings from other countries, these authors found unidirectional Granger causality from economic growth to transportation sector infrastructure development at the national level. By grouping the 107 countries in the dataset into high-income, middle-income, and low-income countries, Liddle and Lung found Granger causality runs from GDP per capita to transportation energy consumption per capita by analyzing International Energy Agency data from 1971 to 2009 using panel methods [21]. These authors also found sufficient evidence that many countries exhibited significant Granger causality running from transportation sector energy consumption to GDP. Although these results were not exactly consistent [22,23], the existing literature suggests casual relationships between transportation sector infrastructure development and economic growth, and we included selected economic features in our models. Therefore, we considered using socioeconomic features (including GDP, income-level, and GDP from different sectors) in our prediction model.
Another stream of existing literature has studied the connections between transportation sector activities and the related CO2 emissions. Lakshmanan and Han suggested that the growth in people’s propensity to travel drove up U.S. transportation energy use and related CO2 emissions from 1970 to 1991. Using a decomposition scheme analysis, the authors also revealed that freight transportation played a more important role than passenger transportation in U.S. transportation energy use and CO2 emissions [24]. Similarly, Scholl et al. used a comparative analysis approach and studied the changes in CO2 emissions from passenger transportation activities in nine OECD countries [25]. By analyzing the data from 1973 to 1992, the authors observed a sharp increase in travel-related energy use and CO2 emissions from travel-related activities and discussed the impact of fuel shifts within the transportation sector on the increase of CO2 emissions. In a study conducted by Lu et al., highway vehicle activity was identified as the major driving factor that increased transportation CO2 emissions from 1990 to 2002 in Germany, Taiwan, South Korea, and Japan [26]. Similar studies of transportation sector activity in selected Asian countries or regions suggested that travel-related activity was one of the major potential factors increasing CO2 emissions [27,28,29,30,31]. These studies all suggested that the transportation sector has a direct impact on CO2 emissions and listed it as the key explanatory variable for CO2 emissions at the national level. In our study, we included transportation related features (air, railroad, and vehicle transportation) in our prediction models.
To forecast CO2 emissions, existing studies have adopted different approaches. Some have used time series analysis methods such as exponential smoothing models and ARIMA [32,33,34]. Similar studies used grey models to predict CO2 emissions in China, Iran, and Turkey [35,36]. Many other studies used time series models to predict CO2 emissions in China, the U.S., Malaysia, Iran, and Zimbabwe [37,38]. Some studies used neural network methods for CO2 emission prediction [39]. The gradient boosting decision tree (GBDT) algorithm was also used in predicting CO2 from envelope renovation projects in Taiwan [40]. The support vector machine model was also used in CO2 emission prediction in the Chengdu area [41]. All these studies used a dataset from one single nation and did not employ cross-validation using another nation’s dataset to evaluate the model. To fill this gap, this study aims to predict transportation-related CO2 emissions using socioeconomic features and transportation sector features. We deploy the support vector machine (SVM) model and the gradient boosting regression (GBR) model to compare to the baseline model, the ordinary least squares (OLS) model, in order to find the best model.
The rest of this paper is structured as follows: Section 2 provides the details of the data collection and sources used in our study and summarizes the features used in our model, the descriptive statistics of the dataset, the machine learning (ML) algorithms, and the evaluation methods; Section 3 presents and discusses the results obtained by our automatic models, including OLS, SVM, and GBR, with three types of features (transportation-related features, or TRAN; socioeconomic features, or SoEco; and a combination of both TRAN and SoEco features, or ALL) for different levels of CO2 in emissions-producing countries; Section 4 provides further discussion; and Section 5 concludes the study.

2. Materials and Methods

This section provides detailed information about the dataset and the data sources, ML algorithms, and the model evaluation method.

2.1. Dataset

The World Development Indicators (WDI) constitute the World Bank’s primary collection of development indicators, compiled from officially recognized international sources. In this study, we used the WDI 2020 release obtained from Kaggle [42]. The WDI dataset provides access to approximately 1437 indicators for 263 countries or regions, from 1960 to 2020. The database helps users find information related to development, both current and historical. The topics covered in the WDI include poverty and inequality (poverty, prosperity, consumption, income distribution), people (population, education, labor, health, gender), the environment (agriculture, climate change, energy, biodiversity, water, sanitation), the economy (growth, economic structure, income and savings, trade, labor productivity), states and markets (business, stock markets, military, communications, transport, technology), and global links (debt, trade, aid dependency, refugees, tourism, migration). We adopted the CO2 emissions data, most of the transportation-related factors, and socioeconomic factors from the WDI dataset.
The International Organization of Motor Vehicle Manufacturers (OICA) provides a dataset of motor vehicle production statistics obtained from national trade organizations and OICA members or correspondents, including production statistics (1999–2021), sales statistics (2019–2020), and vehicles in use (2005–2015). We adopted the vehicle-in-use data from the OICA dataset.
The United Nations’ (UN’s) Department of Economic and Social Affairs conducts a yearly economic analysis. In its 2014 World Economic Situation and Prospects (WESP) report [43], countries are classified according to their degree of development, as either high income, upper middle income, lower middle income, or low income. The report also categorized countries as either fuel-exporting or not. We adopted these features from this UN dataset.
The WDI dataset does not directly provide transportation-related CO2 emissions. Instead, it provides the total CO2 emissions and the percentage of CO2 emissions from different sectors. Therefore, we calculated transportation-related CO2 emissions by multiplying the total CO2 emissions (in kt units) by the percentage of CO2 emissions from transportation.
Based on an overview of the existing literature, we chose two types of features, socioeconomic and transportation factors, as the independent variables.

2.1.1. Socioeconomic Features

With respect to economic features, we selected GDP, which shows the strong relationship between the economy and transportation [16,17,18,19,20,21,22]. We also used the value added from different sectors in order to represent the level of socioeconomic features. Therefore, we used unemployment, value added from agriculture, value added from industry, and value added from services. We also included features indicating whether countries are developing or developed, their income level, and whether they are fuel-exporting.

2.1.2. Transportation Features

According to global transportation emissions statistics sourced from the International Energy Agency (IEA) in 2018, individual road travel accounts for nearly three-quarters of transportation-related emissions, and the other 29.4% comes from trucks carrying freight [44]. We chose three types of transportation features: airport transport, railway transportation, and vehicle transportation (including passengers and freight transportation).
We compared three feature combinations: transportation features only (TRAN), socioeconomic features only (SoEco), and transportation features plus socioeconomic features (ALL). Our goal was to evaluate whether inclusion of the socioeconomic features helps to improve the prediction of CO2 emissions from transportation. Table 1 shows the details (descriptions, indicator abbreviations, and sources) of the features we used in the model.

2.1.3. CO2 Emissions by Sector, Year, and Country

Figure 1 describes the overall yearly increase of CO2 emissions in our dataset, including five different sectors: electricity and heat production (ETOT), manufacturing industries and construction (MANF), transportation (TRAN), residential buildings and commercial and public services (BLDG), and other sectors (OTHX) including agriculture. As shown in Figure 1, since 1971, the CO2 emissions from ETOT increased fivefold and have been the biggest part of all CO2 emissions ever since. The two sectors with the second highest CO2 emissions are MANF and TRANS; the lines on the graph cross several times, and emissions in these sectors have tripled in the past 30 years. The two remaining sectors are BLDG and OTHX, which were stable in this 30-year range. In this paper, we focused on CO2 emissions from the TRAN sector.
Figure 2 shows total CO2 emissions (in kt) and transportation-based CO2 emissions by countries, from 2005 to 2014. According to the percentage of total global CO2 emissions, we divided the countries into two tiers: those that contribute more than 4% of the global total (blue, Tier 1) and those that contribute less than 4% (grey, Tier 2). We found that the trends of CO2 emissions in total are very different from CO2 emissions from transportation. For example, although India is in Tier 1 for total CO2 emissions, it is in Tier 2 for TRAN CO2 emissions, which means that India has a relatively lower TRAN portion of CO2 emissions than other countries.
In this paper, we defined our country tiers according to their contribution to the total CO2 emissions because we proposed to study the biggest contributors to global CO2 emissions. Table 2 summarizes the countries and their associated CO2 emission tiers. The top 30 countries emitted 96% of global CO2 in 2004–2015, so we focused our study on these 30 countries.
Within these top 30 CO2 emissions-producing countries, we further defined Tier 1 countries, which are the top five countries and contribute 61% of global CO2 emissions, and Tier 2 countries, which are the other 25 countries and contribute 35% of global CO2 emissions. As seen in Table 2, which compares total CO2 emissions, some countries have higher proportions of CO2 emissions from transportation, or even higher proportions per capita than total CO2 emissions. For example, the U.S., Saudi Arabia, and the United Arab Emirates produce, on average, over 4 kt CO2 emissions per capita yearly, while China, India, and Pakistan average less than 0.5 kt per capita yearly. We assumed that the number of vehicles in use is an important factor for the first category and that the country’s population is an important factor for the second category. Therefore, we introduced the population and vehicles-in-use as factors in building our model for transportation-based CO2 emissions prediction.
The descriptive statistics of the dataset used in our study are provided in detail in Table 3. This dataset contains 300 instances (30 countries and 10 years). On average, countries emitted 173,770 kt of CO2 yearly from 2005–2014, the average yearly GDP per country was 1.72 × 10 12 , and the average country population was 1.60 × 10 8 .
However, we found many values missing from Table 3. We analyzed the missing values and detected two scenarios. First, for certain countries, the feature values may be missing from the data. For example, the United Arab Emirates has a very limited the railway network, so the IS.RRS.TOTL.KM feature is 0 for the UAE. Second, some feature values are only missing in certain years. In that scenario, we replaced the missing year information with the following year’s value for that country.
To avoid biased estimations due to the magnitude of scale differences between features, we normalized the numeric features by converting each feature as follows,
x i ʹ = x i x m i n x m a x x m i n
where x i refers to the actual value of a feature and x m i n is the minimum value among all x i   values of the feature in the dataset. Similarly, x m a x represents the maximum x i value for this feature in the dataset. Accordingly, x i ʹ shows the normalized x i value, and its value range is between 0 and 1. Instead of using Z-score normalization, we preferred to have feature values between 0 and 1.
With these normalized feature values, we calculated the correlation between the features and our target variable, transportation-based CO2 emissions (i.e., EN.CO2E.TRAN.KT), as shown as the heatmap of Pearson correlation in Figure 3. We found that IS.AIR.DPRT, ISAIR.GOOD.MT.K1, NY.GDP.MKTP.CD(10^11), IS.AIR.PSGR, IS.RRS.TOTAL.KM, IS.VHL.COM, and IS.VHL.PSGR have very strong relationships with EN.CO2E.TRAN.KT. NV.IND.EMPL.KD, NV.AGR.EMPL.KD, NV.SRV.EMPL.KD, and SP.POP.TOL have weak relationships with EN.CO2E.TRAN.KT. Year and SL.UEM.TOTAL.ZS have very weak relationships with EN.CO2E.TRAN.KT. Accordingly, the correlation coefficient can be interpreted as follows, based on existing literature [45,46].
|r| ≥ 0.8, very strong relationship;
0.6 ≤|r| < 0.8, strong relationship;
0.4 ≤|r| < 0.6, moderate relationship;
0.2 ≤|r| < 0.4, weak relationship;
|r| < 0.2, very weak relationship.
Accordingly, those inputs with a correlation coefficient over 0.2 were used to train the ML algorithms for predicting transportation-based CO2 emissions.
We also examined the correlations between categorical values (countries, country tiers, development levels, fuel-exporting characteristics, and income levels) and transportation-based CO2 emissions, as shown in Table 4. We found that most countries have a weak correlation with transportation-based CO2 emissions, except for the U.S. and China.

2.2. Method

The data analysis showed that CO2 emissions are continuous values, so we used regression models to predict continuous values. We used three ML methods, OLS (the baseline system), SVM, and GBR, to build automated ML models for prediction.

2.2.1. Ordinary Least Squares

OLS is a type of linear regression model of a set of explanatory variables using the principle of least squares of the difference between the observed dependent variables and predicted values.
Given a set of data points, G = X j ,   y j j n , where X j is the input vector of a data point j with m features, y j is the desired value (CO2 emissions from the transportation sector), and n is the dataset. In a linear regression model, y j is a linear combination of features X j , as shown in Equation (1),
y j = α j + i = 1 m β i x ij + ε j
where ε j is the error term and α, β are the true parameters of the regression. The goal of linear regression is to find those parameters α and β for which the error term is minimized. That is,
minimize α , β i j = 1 n y j α j i = 1 m β i x ij 2
Or
m i n i m i z e α , β i j = 1 n ε j 2
This procedure is called ordinary least squared error, or OLS.
The OLS is the most popular approach for regression with feature coefficients [47].

2.2.2. Support Vector Machine

The support vector machine model, or SVM, is another supervised learning method that can be applied to regression and classification problems. After it was introduced by AT&T Bell Laboratories in 1992 [48] for a binary classification problem, it drew growing interest from researchers and has been extended into various related problems, including regression [49], high-dimension classification problems [50], clustering problems [51], multiclass problems [52], and Bayesian data argumentation [53]. It was also applied to CO2 emission in the Chengdu area [41].
The goal of SVM is to find a function f X that has at most ε deviation from the actually obtained targets y j for all the training data, while also remaining as flat as possible. In other words, we do not care about errors, as long as they are less than ε, but we will not accept any deviation larger than ε. SVM regression approximates the function, f X , using the following form
f X i = i = 0 m ω i x i
where ω i represents the weight for the feature x i , and ω 0 is the bias with x 0 = 1 . Flatness in this case means that one seeks a small ω . One way to ensure this is to minimize the norm, i.e., ω 2 = ω · ω , which is the dot product. Therefore, we can write the problem as a convex optimization problem:
m i n i m i z e   1 2 ω 2   s u b j e c t   t o   y j ω · x j ε ω · x j y j ε
The assumption in Equation (5) is that a function f exists that makes the difference between the real value of y j and the estimated value f X i with ε ,   or, in other words, that the convex optimization problem is feasible. LaGrange multipliers and optimality constraints are used to solve Equation (5). One key requirement in achieving high accuracy and high performance using the SVM method is to select the proper kernel function, C, and ε parameters. Here, we selected those parameters using the grid search technique. Accordingly, the best results for CO2 prediction were obtained when the kernel function type was dot and C was equal to 0.01.

2.2.3. Gradient Boosting Regression

Unlike many ML models, which focus on high quality prediction generated by a single model, boosting algorithms seek to improve prediction power by training a sequence of weak models, each compensating for the weaknesses of its predecessors. One type of such an approach, the gradient boosting regression (GBR) models, uses an ensemble of weak prediction models, such as decision trees, to make predictions [54]. GBR models can be used in both regression and classification tasks. Tsay etc. used the GBR method to model CO2 emission [40]
The goal of GBR is to find the best predicted values of function y ^ = F X by minimizing the mean squared errors of loss function between actual y and y ^ , as in Equation (6):
m i n i m i z e   1 n i = 1 n y i y j ^ 2
where i is the index of a training set of size n .
The gradient boosting regression tree builds the model in a stage-wise fashion and updates the model by minimizing the expected value with a certain loss function. Generic gradient boosting at the mth step would fit a decision tree h m X   to pseudo-residuals. Let J   be the number of its leaves. The tree partitions the input space J m into disjoint regions R 1 m , R 2 m , , R J m m ,   and predicts a constant value in each region. Using the indicator notation, the output of h m X   for input X can be written as the following sum:
h m x = j = 1 J b j m I R j m x
where b j m is the value predicted in region R j m , and I R j m x = 1 ,   i f   x     R j m 0 ,   o t h e r w i s e .
Using a regression tree to predict h m x in the generic gradient boosting method, the model updates the equations and gradient descent step size.
The parameter b is also referred to as the learning rate and controls the contribution of each base model by shrinking its contributions by a factor of 0 b 1 . There is a tradeoff between the number of iterations and the learning rate. With the same number of iterations, a larger value of learning rate tends to lead to a larger error. The more iterations occur, the better the performance becomes. Therefore, we preferred a small b. Based on our experience, we chose a learning rate of 0.1.
Another parameter, tree complexity, also influences model performance. The algorithm restricts all trees to the same size, which is the number of features divided by the number of tree models. The size of the trees thus reflects the maximum depth of variable interactions. In our experiment, we set maximum depth of variable (tree) at 3, maximum features at none (no limitation), and maximum leaf nodes at none (no limitation.)

2.3. Evaluation

We evaluated the performance of our prediction results obtained from the ML method by employing four statistical metrics that are heavily used in the literature. These metrics are mean absolute error (MAE), relative root mean square error (rRMSE), mean absolute percentage error (MAPE), and the determination coefficient ( R 2 ). We preferred to evaluate the percentage improvement of one model by comparing it to another model. The definitions of these four metrics are set out below.

2.3.1. MAE

MAE is a metric that evaluates the absolute error between predicted value and actual value, and MAE is also used as the model goal in our GBR algorithm [55]. MAE is defined as follows,
M A E = 1 n i = 1 n y i y i ^
where n is the number of samples, y i   is the actual value (real CO2 emissions), and y i ^ is the model predicted value, which is the predicted CO2 emissions of the i th sample. MAE carries the values from zero to + , and small MSE values are desirable [56].

2.3.2. rRMSE

rRMSE is achieved by dividing the RMSE value by the mean actual value, which is the relative RMSE.
r R M S E =   1 n i = 1 n y i y i ^ 2 y ¯ × 100
rRMSE ranges between 0 and 100%. A result becomes more desirable as it approaches zero [57,58].

2.3.3. MAPE

MAPE reflects the size of the errors as a percentage. It is a statistical benchmark for how accurate a prediction model is, since it is scale-independent and interpretable [59].
M A P E = 1 n i = 1 n y i y i ^ y i
A smaller MAPE value means that prediction results are more desirable.

2.3.4. R2

R 2   is the most important index for verifying the accuracy of the predicted result of a regression algorithm, and it has a range [0, 1]. It gives a clue of how well the trend of the model results is able to track the trend of actual data with a normalized value [58,60]. The definition of R 2 is
R 2 = 1 i = 1 n y i y i ^ 2 i = 1 n y i y ¯ 2
where y ¯ is the mean of the actual value of y i .   , an R 2 value of 1 would mean that the regression model makes predictions without any error. Therefore, the larger the R 2 value, the better the model fitting result.

2.3.5. N-fold Cross-Validation

Cross-validation is primarily used in applied ML to estimate a model on unseen data. That is, it uses a limited sample in order to best estimate how the model is generally expected to perform when used to make predictions on data that was not used during the model’s training. The regression model performance in our study was evaluated by applying an n-fold cross-validation process [61]. The validation procedure has a single parameter n   that refers to the number of subsets that a given data sample is to be split into. As such, the procedure is often called n-fold cross-validation. The general procedure is as follows:
  • Randomly shuffle the dataset and split the dataset into n subsets.
  • For each subset in the n subsets:
    • Take the subset as a holdout or test dataset.
    • Take the remaining n–1 subset as a training dataset.
    • Fit a model on the training set and evaluate it on the test set.
    • Retain the evaluation score and discard the model.
  • Average the evaluation score of n iterations.
The commonly chosen n is 4 or 10. In this study, we performed fourfold cross-validation for our experiments.

3. Results

We analyzed the ML method to predict transportation-based CO2 emissions for all of the top 30 CO2 emissions countries, as well as for Tier 1 and Tier 2 countries. The results are based on fourfold cross-validation.

3.1. Prediction of Transportation-Based CO2 Emissions Using ML Methods

Table 5 shows the performance of three ML methods in predicting the transportation-based CO2 emissions of 30 countries between 2005 and 2014, with three different feature sets. The statistical metrics of the performance are all based on fourfold cross-validation.
Based on the MAE results in Table 5, the best value of MAE metrics is 0.0061 from GRB_ALL. Comparing the three ML models, the best values of the MAE metric are 0.0061 (GRB_ALL), 0.0069 (SVM_ALL), and 0.0111 (OLS_ALL) for GBR, SVM, and OLS, respectively. Comparing the three feature combination strategies, the best values of the MAE metric are 0.0061 (GRB_ALL), 0.0067 (GRB_TRAN), and 0.0091 (GRB_SoEco) for ALL, TRAN and SoEco features, respectively. The difference among the MAE metric values of other algorithms is much smaller. Even if we consider RMAE (square root of MAE), the metric values of different models are still very small. Therefore, it will be useful to discuss the results of other metrics in evaluating the success of the algorithms’ predictions of transportation-based CO2.
Based on the calculated results of the metrics, the R 2 value for transportation-based CO2 emission prediction varied between 0.9577 and 0.9943. R 2 was the most frequently used metric in discussing the success of prediction results with respect to the actual data, and it provided an idea of how the predicted curves follow those of the actual data. The R 2 value of GBR_ALL had the best value of 0.9943, while OLS_SoEco had the worst R 2 result of 0.9577. Of the three ML algorithms, GBR performed better than SVM, and SVM performed better than OLS. Comparing the three feature combination sets, we found that ALL is better than TRAN, and TRAN is better than SoEco.
The MAPE metric gives the percentage error of the prediction results. Previous studies suggested evaluating the success of the MAPE metric by classifying it in four ways [62]. Accordingly,
When MAPE ≤ 10%, the prediction results can be classified as having high prediction accuracy.
When 10% < MAPE ≤ 20%, the prediction results can be classified as having good prediction accuracy.
When 20% < MAPE ≤ 50%, the prediction results can be classified as having reasonable prediction accuracy.
When MAPE > 50%, the prediction results can be classified as having inaccurate prediction accuracy.
Based on this commonly used classification, it is possible to say that the prediction results for each algorithm can be categorized as having good prediction accuracy. In other words, the MAPE metrics in predicting transportation CO2 emissions were between 10% and 20% for the GBR algorithm used in this study, with all three feature combination strategies.
rRMSE scales the magnitudes between 0 and 100. In the available literature, it was a commonly used classification for better understanding the performance of the algorithms. This classification indicates how an algorithm presents the better results in terms of the rRMSE metric [63]). In this classification,
When rRMSE < 10%, the prediction results can be classified as excellent.
When 10% < rRMSE < 20%, the prediction results can be classified as good.
When 20% < rRMSE < 30%, the prediction results can be classified as fair.
When rRMSE > 30%, the prediction results can be classified as poor.
As shown in Table 6, the rRMSE values of the SLO, SVM, and GBR algorithms, when using ALL features, are 17.7%, 13.91%, and 11.65%, respectively. Based on this classification, it is possible to say that the prediction results of all algorithms with all features can be classified as good. However, the best model was GBR-TRAN, with an rRMSE score of 11.63%.
In sum, each ML algorithm presented very good results for predicting transportation-based CO2 emissions, with GBR algorithms providing the best results of the three. The ALL feature was always the best choice, while the TRAN features were only a close second.

Features for Prediction Analysis

Figure 4 shows the feature importance of our best model, GBR_ALL, in light of the top 30 CO2 emissions-producing countries. The feature importance is the Gini impurity-based feature importance of the GBR method: the higher it is, the more important the feature is. The importance of a feature is calculated as the (normalized) total reduction of the criterion brought by the feature. That is, the values of all feature importance comprise a sum equal to 1.
According to Figure 4, the most important feature for predicting transportation-based CO2 emissions is air transportation, which is followed by the railroad and vehicle transportation factors. Socioeconomic factors—GDP and population—are also important for the model. Interestingly, when compared to all 30 countries, CHN is another important feature in the model, which means that China has its own transportation CO2 emissions pattern. Among all TRAN features, air transportation features are more important than railroad features, and railroad features are more important than vehicle-in-use features.

3.2. Predicting Transportation-Based CO2 Emissions for Tier 1 Countries

Table 6 shows the statical metric results for transportation-based CO2 emissions prediction for Tier 1 countries (the top five CO2 emissions-producing countries) using ML algorithms. The overall model performance in terms of R 2 value ranged between 0.8481 and 0.9948, with the best being GBR_TRAN. According to the rRMSE metrics, the best model was SVM_SoEco with a value of 0.0738, which was classified as excellent. According to MAPE classification, the GBR_SoEco was the best model, with a value of 0.0551, which was also classified as excellent. For Tier 1 countries, the SoEco feature set performed better than the other two, according to the MAPE.
Figure 5 shows the feature importance of the GBR_ALL model for the top five CO2 emissions-producing countries. In this category, the U.S. was the most important feature for the model, followed by the SoEco feature of GDP, with vehicle-in-use third. Almost all TRAN features were important for the model. Vehicle-in-use features were more important than air-related features, and air-related features were more important than rail-related features.

3.3. Predicting Transportation-Based CO2 Emissions for Tier 2 Countries

Table 7 shows the statical metric results for transportation-based CO2 emissions prediction for Tier 2 countries (25 countries) using ML algorithms. The overall model performance in terms of R 2 value ranged between 0.7917 and 0.9780. The best model was SVM_ALL, with an R 2 value of 0.9780 and an rRMAE value of 0.1465, which was classified as good. According to the MAPE classification, GBR_TRAN was a good model with a value of 0.1277. For Tier 2 countries, the TRAN feature set performed better than the SoEco feature sets, according to MAPE, although combining both types of features using the SVM method performed best most of the time.
Figure 6 shows the feature importance of the GBR-ALL model for Tier 2 CO2 emissions-producing countries. In this category, the SoEco feature of GDP was the most important feature, followed by vehicle-in-use for passengers, and then by another SoEco feature, population. In this category, not all TRAN features remained important. Air-related features were still important, but rail-related features were not. Other SoEco features—GDP added value in industry, service, and agriculture—were also important.

4. Discussion

Although most of the top five total countries are also the top transportation-related CO2 emissions-producing countries, the ranking of countries varies. India is one of the top five total CO2 emissions-producing countries, but not in the transportation sector. China has the highest overall CO2 emissions, but the U.S. has the highest transportation-related CO2 emissions.
According to the Pearson correlation coefficient, there are strong correlations between both TRAN features (air passengers carried, total railway length, vehicles in use) and SoEco features (country GDP, country total population, GDP value added by agriculture, industry, and service) and transportation-related CO2 emissions. The U.S. and China are specifically recognized as important factors, which means that these countries have their own special trends. However, the year, and the most of countries, are not recognized as important factors.
Considering all metrics, the GBR and SVM models had better performance in transportation-based CO2 emissions prediction, while the OLS model generally provided the worst results of the three methods. Nevertheless, the results demonstrated that each ML algorithm presents very satisfactory results in predicting CO2 emissions.
Based on the widely used rRMSE classification found in the literature, all models were categorized as good in predicting CO2 emissions for all 30 countries, in terms of the rRMSE metric, with the exception of OLS-TRAN and OLS-SoEco. The SVM_ALL and SVM-SoEco models were both categorized as excellent in predicting CO2 emissions of the top five countries, in terms of the rRMSE metric. However, the SVM_ALL and SVM-SoEco models were categorized as good in predicting CO2 emissions for Tier 2 countries, in terms of the rRMSE metric. We believe there are other important features to determine Tier 2 countries’ CO2 emissions from transportation that are not yet included in our model.
According to MAPE metric classifications in the literature, all GBR models showed good prediction accuracy for all 30 countries, excellent prediction accuracy for the top five CO2 emissions-producing countries, and good prediction accuracy for Tier 2 countries (except for the GBR_ALL model). This result further reinforced the finding that there are other important features to determine Tier 2 countries’ CO2 emissions from transportation that are not yet included in our model. Other important features should be explored to further improve the Tier 2 country categories.
According to the prediction feature analysis of GBR models, we found that GDP is always one of the most important features for predicting CO2 emissions in any type of country. TRAN features were the most important features for transportation CO2 emissions prediction for all 30 countries and for the top five emissions-producing countries. However, in the Tier 2 emissions-producing countries, SoEco features, including population and GDP added value in industry, service, and agriculture, were also very important for prediction. If we consider that our models’ prediction for Tier 2 countries was less effective than the prediction for Tier 1 countries, we see the necessity of further exploring other SoEco features to help facilitate Tier 2 transform-based CO2 prediction.

5. Conclusions

This paper aimed to predict transportation-based CO2 emissions using three ML methods (OLS, SVM, and GBR). We used three types of features, TRAN (transportation-related features only), SoEco (socioeconomic features only), and ALL (a combination of the TRAN and SoEco features). Four statistical metrics were used to assess the performance success of the algorithm. Three types of countries were targeted: all top 30 CO2 emissions-producing countries (accounting for 96% of global CO2 emissions), Tier 1 countries (the top five CO2 emissions-producing countries, accounting for 61% of global CO2 emissions), and Tier 2 countries (the next 25 CO2 emissions-producing countries, accounting for 35% of global CO2 emissions).
In sum, three ML algorithms can predict countries’ CO2 emissions arising from the transportation sector. Of these methods, GBR performs better than OLS and SVM. TRAN features are the most important features for transportation CO2 emissions prediction. However, SoEco features, such as GDP, also affect the top five emissions-producing countries, while TRAN features are the most influential factors in the CO2 emissions prediction for Tier 2 countries. Our prediction approach and the identified influential factors may aid in near-future attempts by decision-makers to reduce the growth rate of transportation-related CO2 emissions.

Author Contributions

Conceptualization, A.R. and Q.L.; methodology, A.R., X.L., and Q.L.; software, A.R., X.L., and Q.L.; validation, A.R. and X.L.; formal analysis, A.R., X.L., and Q.L.; investigation, A.R., X.L., and Q.L.; data curation, A.R., X.L., and Q.L.; writing—original draft preparation, A.R., X.L., and Q.L; writing—review and editing, A.R., X.L., and Q.L.; supervision, Q.L.; project administration, Q.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Anhui Philosophy and Social Science Planning Project, grant number AHSKY2017D24.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. United Nation. Climate Change, ‘Biggest Threat Modern Humans Have Ever Faced’, World-Renowned Naturalist Tells Security Council, Calls for Greater Global Cooperation. 2021. Available online: https://www.un.org/press/en/2021/sc14445.doc.htm (accessed on 10 January 2022).
  2. IPCC 2001. Climate Change 2001 Synthesis Report: Mitigation. In Contribution of Working Group III to the Third Assessment Report of the Intergovernmental Panel on Climate Change, 2001; Watson, R.T., Ed.; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
  3. Pachauri, R.K.; Meyer, L.A. (Eds.) IPCC 2014 Climate Change 2014: Synthesis Report. In Contribution of Working Groups I, II and III to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change; IPCC: Geneva, Switzerland, 2014; p. 151. [Google Scholar]
  4. Nallapaneni, M.K.; Dash, A. Internet of things: An opportunity for transportation and logistics. In Proceedings of the International Conference on Inventive Computing and Informatics, ICICI, Coimbatore, India, 23–24 November 2017; pp. 194–197. [Google Scholar]
  5. Onat, N.C.; Kucukvar, M.; Tatari, O. Towards life cycle sustainability assessment of alternative passenger vehicles. Sustainability 2014, 6, 9305–9342. [Google Scholar] [CrossRef] [Green Version]
  6. Garcia-Lopez, M.-L.; Pasidis, I.; Viladecans-Marsal, E. Express delivery to the suburbs: The effects of transportation in Europe’s heterogeneous cities. SSRN Electron. J. 2015. [Google Scholar] [CrossRef] [Green Version]
  7. Danish, T.H.; Baloch, M.A.; Suad, S. Modeling the impact of transport energy consumption on CO2 emission in Pakistan: Evidence from ARDL approach. Environ. Sci. Pollut. Res. 2018, 25, 9461–9473. [Google Scholar] [CrossRef] [PubMed]
  8. Tarhan, C.; Çil, M.A. A study on hydrogen, the clean energy of the future: Hydrogen storage methods. J. Energy Storage 2021, 40, 102676. [Google Scholar] [CrossRef]
  9. Giannakis, E.; Serghides, D.; Dimitriou, S.; Zittis, G. Land transport CO2 emissions and climate change: Evidence from Cyprus. Int. J. Sustain. Energy 2020, 39, 634–647. [Google Scholar] [CrossRef]
  10. U.S. Energy Information Administration. International Energy Outlook 2016; U.S. Energy Information Administration: Washington, DC, USA, 2016. Available online: https://www.eia.gov/outlooks/ieo/pdf/transportation.pdf (accessed on 2 January 2022).
  11. Sajida, M.J.; Cao, Q.; Kanga, W. Transport sector carbon linkages of EU’s top seven emitters. Transp. Policy 2019, 80, 24–38. [Google Scholar] [CrossRef]
  12. Gonzalez, R.M.; Marrero, G.; Rodriguez-Lopes, J.; Marrero, A. Analyzing CO2 emissions from passenger cars in Europe: A dynamic panel data approach. Energy Policy 2019, 129, 1271–1281. [Google Scholar] [CrossRef]
  13. United Nations Treaty Collection. Paris Agreement; United Nations Treaty Collection: New York, NY, USA, 2015; Available online: https://unfccc.int/process-and-meetings/the-paris-agreement/the-paris-agreement (accessed on 2 January 2022).
  14. Lean, H.H.; Huang, W.; Hong, J. Logistics and economic development: Experience from China. Transp. Policy 2014, 32, 96–104. [Google Scholar] [CrossRef]
  15. Yaacob, N.F.F.; Mat Yazid, M.R.; Abdul Maulud, K.N.; Ahmad Basri, N.E. A Review of the Measurement Method, Analysis and Implementation Policy of Carbon Dioxide Emission from Transportation. Sustainability 2020, 12, 5873. [Google Scholar] [CrossRef]
  16. Song, W.; Ren, A.; Li, X.; Li, Q. The Orchestrating Role of Carbon Subsidies in a Capital-Constrained Supply Chain. Math. Probl. Eng. 2021, 2021, 8920624. [Google Scholar] [CrossRef]
  17. Beyzatlar, M.A.; Karacal, M.; Yetkiner, H. Granger-causality between transportation and GDP: A panel data approach. Transp. Res. 2014, 63, 43–55. [Google Scholar] [CrossRef] [Green Version]
  18. Pradhan, R.P.; Bagchi, T.P. Effect of transportation infrastructure on economic growth in India: The VECM approach. Res. Transp. Econ. 2013, 38, 139–148. [Google Scholar] [CrossRef]
  19. Kustepeli, Y.; Gulcan, Y.; Akgungor, S. Transportation infrastructure investment, growth and international trade in Turkey. Appl. Econ. 2012, 44, 2619–2629. [Google Scholar] [CrossRef]
  20. Yu, N.; De Jong, M.; Storm, S.; Mi, J. Transport infrastructure, spatial clusters and regional economic growth in China. Transp. Rev. 2012, 32, 3–28. [Google Scholar] [CrossRef]
  21. Liddle, B.; Lung, S. The long-run causal relationship between transport energy consumption and GDP: Evidence from heterogeneous panel methods robust to cross-sectional dependence. Econ. Lett. 2013, 121, 524–527. [Google Scholar] [CrossRef]
  22. Lean, C.S. Empirical tests to discern linkages between construction and other economic sectors in Singapore. Constr. Manag. Econ. 2001, 19, 355–363. [Google Scholar]
  23. Eruygur, A.; Kaynak, M.; Mert, M. Transportation-communication capital and economic growth: A VECM analysis for Turkey. Eur. Plan. Stud. 2012, 20, 341–363. [Google Scholar] [CrossRef]
  24. Lakshmanan, T.R.; Han, X. Factors underlying transportation CO2 emissions in the U.S.A.: A decomposition analysis. Transp. Res. Transp. Environ. 1997, 2, 1–15. [Google Scholar] [CrossRef]
  25. Scholl, L.; Schipper, L.; Kiang, N. CO2 emissions from passenger transport: A comparison of international trends from 1973 to 1992. Energy Policy 1996, 24, 17–30. [Google Scholar] [CrossRef]
  26. Lu, I.J.; Lin, S.J.; Lewis, C. Decomposition and decoupling effects of carbon dioxide emission from highway transportation in Taiwan, Germany, Japan and South Korea. Energy Policy 2007, 35, 3226–3235. [Google Scholar] [CrossRef]
  27. Timilsina, G.R.; Shrestha, A. Transport sector CO2 emissions growth in Asia: Underlying factors and policy options. Energy Policy 2009, 37, 4523–4539. [Google Scholar] [CrossRef]
  28. Zhu, X.; Li, R. An Analysis of Decoupling and Influencing Factors of Carbon Emissions from the Transportation Sector in the Beijing-Tianjin-Hebei Area, China. Sustainability 2017, 9, 722. [Google Scholar] [CrossRef] [Green Version]
  29. Liang, Y.; Niu, D.; Wang, H.; Li, Y. Factors Affecting Transportation Sector CO2 Emissions Growth in China: An LMDI Decomposition Analysis. Sustainability 2017, 9, 1730. [Google Scholar] [CrossRef] [Green Version]
  30. Kim, S. Decomposition Analysis of Greenhouse Gas Emissions in Korea’s Transportation Sector. Sustainability 2019, 11, 1986. [Google Scholar] [CrossRef] [Green Version]
  31. Yuan, Y.; Wang, Y.; Chi, Y.; Jin, F. Identification of Key Carbon Emission Sectors and Analysis of Emission Effects in China. Sustainability 2020, 12, 8673. [Google Scholar] [CrossRef]
  32. Hassouna, F.; Al-Sahili, K. Environmental impact assessment of the transportation sector and hybrid vehicle implications in Palestine. Sustainability 2020, 12, 7878. [Google Scholar] [CrossRef]
  33. Lotfalipour, M.; Falahi, M.; Bastam, M. Prediction of CO2 emissions in Iran using Grey and ARIMA models. Int. J. Energy Econ. Policy 2013, 3, 229–237. [Google Scholar]
  34. Chigora, F.; Thabani, N.; Mutambara, E. Forecasting 2 emission for Zimbabwe’s tourism destination vibrancy: A univariate approach using box-Jenkins ARIMA model. Afr. J. Hosp. Tour. Leis. 2019, 8. [Google Scholar]
  35. Ayvaz, B.; Kusakci, A.O.; Temur, G.T. Energy-related CO2 emission forecast for Turkey and Europe and Eurasia. Grey Syst. Theory Appl. 2017, 7, 436–452. [Google Scholar] [CrossRef]
  36. Ofosu-Adarkwa, J.; Xie, N.; Javed, S.A. Forecasting CO2 emissions of China’s cement industry using a hybrid Verhulst-GM (1, N) model and emissions’ technical conversion. Renew. Sustain. Energy Rev. 2020, 130, 109945. [Google Scholar] [CrossRef]
  37. Yang, H.; O’Connell, J.F. Short-term carbon emissions forecast for aviation industry in Shanghai. J. Clean. Prod. 2020, 275, 122734. [Google Scholar] [CrossRef]
  38. Ang, C.; Morad, N.; Ismail, M.; Ismail, N. Projection of carbon dioxide emissions by energy consumption and transportation in Malaysia: A time series approach. J. Energy Technol. Policy 2013, 3, 63–75. [Google Scholar]
  39. Rezaei, M.H.; Sadeghzadeh, M.; Alhuyi Nazari, M.; Ahmadi, M.H.; Astaraei, F.R. Applying GMDH artificial neural network in modeling CO2 emissions in four nordic countries. Int. J. Low Carbon Technol. 2018, 13, 266–271. [Google Scholar] [CrossRef] [Green Version]
  40. Tsay, Y.-S.; Yeh, C.-Y.; Chen, Y.-H.; Lu, M.-C.; Lin, Y.-C. A Machine Learning-Based Prediction Model of LCCO2 for Building Envelope Renovation in Taiwan. Sustainability 2021, 13, 8209. [Google Scholar] [CrossRef]
  41. Zeng, H.; Shao, B.; Bian, G.; Dai, H.; Zhou, F. Analysis of Influencing Factors and Trend Forecast of CO2 Emission in Chengdu-Chongqing Urban Agglomeration. Sustainability 2022, 14, 1167. [Google Scholar] [CrossRef]
  42. Hui, M. 2020 World Development Indicators from World Bank Open Data; Kaggle: San Francisco, CA, USA, 2021; Available online: https://www.kaggle.com/manchunhui/world-development-indicators (accessed on 2 January 2022).
  43. United Nations. Country Classification. 2014. Available online: https://www.un.org/en/development/desa/policy/wesp/wesp_current/2014wesp_country_classification.pdf (accessed on 6 January 2022).
  44. Ritchie, H. Cars, Planes, Trains: Where Do CO2 Emissions from Transport Come from? Ourworldindata. 2020. Available online: https://ourworldindata.org/co2-emissions-from-transport (accessed on 2 January 2022).
  45. Bakay, M.S.; Agbulut, Ü. Electricity production-based forecasting of green- house gas emissions in Turkey with deep learning, support vector machine and artificial neural network algorithms. J. Clean. Prod. 2021, 285, 125324. [Google Scholar] [CrossRef]
  46. Hidecker, M.J.C.; Ho, N.T.; Dodge, N.; Hurvitz, E.A.; Slaughter, J.; Workinger, M.S.; Paneth, N. Inter-relationships of functional status in cerebral palsy: Analyzing gross motor function, manual ability, and communication function classification systems in children. Dev. Med. Child. Neurol. 2012, 54, 737–742. [Google Scholar] [CrossRef] [Green Version]
  47. Stock, J.H.; Watson, M.W. Introduction to Econometrics; Addison Wesley: Boston, MA, USA, 2003. [Google Scholar]
  48. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  49. Harris, D.; Burges, C.J.C.; Kaufman, L.; Smola, A.J.; Vapnik, V. Support Vector Regression Machines, in Advances in Neural Information Processing Systems 9. NIPS 1997, 779–784. [Google Scholar]
  50. Joachims, T. Transductive Inference for Text Classification using Support Vector Machines. In Proceedings of the 1999 International Conference on Machine Learning (ICML 1999); Universität Dortmund: Dortmund, Germany, 1999; pp. 200–209. [Google Scholar]
  51. Ben-Hur, A.; Horn, D.; Siegelmann, H.; Vapnik, V. Support vector clustering. J. Mach. Learn Res. 2001, 2, 125–137. [Google Scholar] [CrossRef]
  52. Hsu, C.-W.; Lin, C.-J. A Comparison of Methods for Multiclass Support Vector Machines. IEEE Trans. Neural Netw. Learn. Syst. 2002, 13, 415–425. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  53. Polson, N.G.; Scott, S.L. Data Augmentation for Support Vector Machines. Bayesian Anal. 2011, 6, 1–23. [Google Scholar] [CrossRef]
  54. Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  55. Cai, J.; Xu, K.; Zhu, Y.; Hu, F.; Li, L. Prediction and analysis of net ecosystem carbon exchange based on gradient boosting regression and random forest. Appl. Energy 2020, 262, 114566. [Google Scholar] [CrossRef]
  56. Hu, W.; Shao, M.; Reichardt, K. Using a new criterion to identify sites for mean soil water storage evaluation. Soil Sci. Soc. Am. J. 2010, 74, 762–773. [Google Scholar] [CrossRef]
  57. Chen, J.L.; Li, G.S.; Wu, S.J. Assessing the potential of support vector machine for estimating daily solar radiation using sunshine duration. Energy Convers. Manag. 2013, 75, 311–318. [Google Scholar] [CrossRef]
  58. Ağbulut, Ü. Forecasting of transportation-related energy demand and CO2 emissions in Turkey with different machine learning algorithms. Sustain. Prod. Consum. 2022, 29, 141–157. [Google Scholar] [CrossRef]
  59. Kim, S.; Kim, H. A new metric of absolute percentage error for intermittent demand forecasts. Int. J. Forecast. 2016, 32, 669–679. [Google Scholar] [CrossRef]
  60. Chakraborty, D.; Elzarka, H. Performance testing of energy models: Are we using the right statistical metrics? J. Build. Perform. Simul. 2018, 11, 433–448. [Google Scholar] [CrossRef]
  61. Li, Q.; Deleger, L.; Lingren, T.; Zhai, H.J.; Kaiser, M.; Stoutenborough, L.; Jegga, A.G.; Cohen, K.B.; Solti, I. Mining FDA drug labels for medical conditions. BMC Med. Inf. Decis. Mak. 2013, 13, 53. [Google Scholar] [CrossRef] [Green Version]
  62. Montaño Moreno, J.J.; Palmer Pol, A.; Sesé Abad, A.; Cajal Blasco, B. Using the R-MAPE index as a resistant measure of forecast accuracy. Psicothema 2013, 25, 500–506. [Google Scholar] [CrossRef]
  63. Tuncer, A.D.; Sözen, A.; Afshari, F.; Khanlari, A.; Şirin, C.; Gungor, A. Testing of a novel convex-type solar absorber drying chamber in dehumidification process of municipal sewage sludge. J. Clean. Prod. 2020, 272, 122862. [Google Scholar] [CrossRef]
Figure 1. Yearly CO2 emissions by sector. Transportation-based CO2 emissions have tripled in the past 30 years.
Figure 1. Yearly CO2 emissions by sector. Transportation-based CO2 emissions have tripled in the past 30 years.
Sustainability 14 04588 g001
Figure 2. Total CO2 emissions vs. CO2 emissions from TRAN by country. CO2 emissions (in kt) are summarized from 2005–2014 data. Percentages represent each country’s contribution to global total CO2 emissions. (a) Total CO2 emissions; (b) CO2 emissions from TRAN.
Figure 2. Total CO2 emissions vs. CO2 emissions from TRAN by country. CO2 emissions (in kt) are summarized from 2005–2014 data. Percentages represent each country’s contribution to global total CO2 emissions. (a) Total CO2 emissions; (b) CO2 emissions from TRAN.
Sustainability 14 04588 g002
Figure 3. Pearson correlation matrix between the features and the target variable (transportation-based CO2 emissions).
Figure 3. Pearson correlation matrix between the features and the target variable (transportation-based CO2 emissions).
Sustainability 14 04588 g003
Figure 4. Importance analysis of the GBR-ALL model for the top 30 CO2 emissions-producing countries. Passenger air transportation (TRAN) is the most important feature, followed by railroad total transportation (TRAN) and air registered carrier transportation (TRAN), as well as other TRAN features.
Figure 4. Importance analysis of the GBR-ALL model for the top 30 CO2 emissions-producing countries. Passenger air transportation (TRAN) is the most important feature, followed by railroad total transportation (TRAN) and air registered carrier transportation (TRAN), as well as other TRAN features.
Sustainability 14 04588 g004
Figure 5. Feature importance analysis of the GBR-ALL model for Tier 1, the top 5 CO2 emissions-producing countries. USA (SoEco) is the most important feature, followed by GDP (SoEco) and passenger vehicle-in-use (TRAN), as well as other TRAN features.
Figure 5. Feature importance analysis of the GBR-ALL model for Tier 1, the top 5 CO2 emissions-producing countries. USA (SoEco) is the most important feature, followed by GDP (SoEco) and passenger vehicle-in-use (TRAN), as well as other TRAN features.
Sustainability 14 04588 g005
Figure 6. Feature importance analysis of the GBR-ALL model for Tier 2 CO2 emissions-producing countries. GDP (SoEco) is the most important feature, followed by passenger vehicle-in-use (TRAN) and population (SoEco).
Figure 6. Feature importance analysis of the GBR-ALL model for Tier 2 CO2 emissions-producing countries. GDP (SoEco) is the most important feature, followed by passenger vehicle-in-use (TRAN) and population (SoEco).
Sustainability 14 04588 g006
Table 1. Summary of Features.
Table 1. Summary of Features.
DescriptionIndicatorsSource
Socioeconomic Factors (SoEco)
GDP (current USD)NY.GDP.MKTP.CDWDI
Unemployment, total (% of total labor force) (modeled ILO estimate)SL.UEM.TOTL.ZSWDI
Agriculture, value added per worker (constant 2010 USD)NV.AGR.EMPL.KDWDI
Industry, value added per worker (constant 2010 USD)NV.IND.EMPL.KDWDI
Services, value added per worker (constant 2010 USD)NV.SRV.EMPL.KDWDI
Developed or developing countryDeveloping0Developed1UN
Fuel-exporting countryFuel-exportingCountriesUN
High income, upper middle income, lower middle income, low incomeHigh Income
Upper Middle Income
Lower Middle Income
UN
Population, totalSP.POP.TOTLWDI
Transportation factors (TRAN) WDI
Air transportation, passengers carriedIS.AIR.PSGRWDI
Air transportation, registered carrier departures worldwideIS.AIR.DPRTWDI
Air transportation, freight (million tons in km)IS.AIR.GOOD.MT.K1WDI
Container port traffic (TEU: 20 foot equivalent units)IS.SHP.GOOD.TUWDI
Rail lines (total routes in km)IS.RRS.TOTL.KMWDI
Commercial vehicles in use (per 1000 units)IS.VHL.COMOICA
Passengers cars in use (per 1000 units)IS.VHL.PSGROICA
CO2 related factors
CO2 emissions (kt)EN.ATM.CO2E.KTWDI
CO2 emissions from residential buildings and commercial and public services (% of total fuel combustion)EN.CO2.BLDG.ZSWDI
CO2 emissions from electricity and heat production, total (% of total fuel combustion)EN.CO2.ETOT.ZSWDI
CO2 emissions from manufacturing industries and construction (% of total fuel combustion)EN.CO2.MANF.ZSWDI
CO2 emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)EN.CO2.OTHX.ZSWDI
CO2 emissions from transportation (% of total fuel combustion)EN.CO2.TRAN.ZSWDI
Table 2. Summary of transportation-based CO2 emissions from the top 30 CO2 emissions-producing countries.
Table 2. Summary of transportation-based CO2 emissions from the top 30 CO2 emissions-producing countries.
TierCountry NameCountry Code% of Total CO2 EmissionsAvg. Yearly CO2 Emissions from Transportation (kt)
(2005–2014)
Avg. Yearly Per Capita CO2 Emissions from Transportation (kt)
(2005–2014)
Tier 1
(five countries with 61% of total CO2 emissions from 2005–2014)
ChinaCHN27.85%656,4670.49
United StatesUSA18.15%1,744,6675.68
IndiaIND5.71%199,6380.16
Russian FederationRUS5.67%263,7441.84
JapanJAP4.04%227,2221.78
Tier 2
(25 countries with 35% of total CO2 emissions from 2005–2014)
IranIRN1.88%135,2561.84
CanadaCAN1.81%168,7995
KoreaKOR1.79%89,4841.81
United KingdomGBR1.63%123,0771.97
Saudi ArabiaSAU1.62%126,3824.63
MexicoMEX1.61%162,3031.43
South AfricaZAF1.57%56,8211.11
BrazilBRA1.40%189,6640.97
IndonesiaIDN1.40%109,5570.45
ItalyITA1.37%116,6021.97
AustraliaAUS1.26%85,2103.9
FranceFRA1.17%131,5032.03
PolandPOL1.03%44,4091.17
TurkeyTUR0.99%52,0530.72
SpainESP0.99%102,7492.25
UkraineUKR0.98%35,1420.76
ThailandTHA0.84%65,6670.98
KazakhstanKAZ0.77%13,8350.85
MalaysiaMYS0.70%52,2401.87
Egypt, Arab Rep.EGY0.68%48,0310.58
ArgentinaARG0.62%47,1671.16
Venezuela, RBVEN0.60%53,2891.89
NetherlandsNLD0.59%36,8372.22
United Arab EmiratesARE0.57%32,2144.29
PakistanPAK0.54%43,0810.24
Table 3. Descriptive statistics of the dataset.
Table 3. Descriptive statistics of the dataset.
Feature TypeFeatures CountMeanStdMin50%Max
SocEcoYear30020103200520102014
SocEcoSP.POP.TOTL (10^8)3001.603.080.460.6113.64
TRANIS.AIR.DPRT296758,0071,743,07017,302315,38310,095,200
TRANIS.AIR.GOOD.MT.K1300402373931138640,618
TRANIS.SHP.GCNW.XQ2905025841135
SocEcoNY.GDP.MKTP.CD (10^8)30017,193.729,169.1571.248427.63175,217.5
SocEcoNV.IND.EMPL.DK29853,63342,761267832,233202,808
SocEcoNV.AGR.EMPL.KD29828,53940,609102912,938305,042
SocEcoNV.SRV.EMPL.KD29839,58932,555415623,095104,388
SocEcoSL.UEM.TOTL.ZS300750629
TRANIS.AIR.PSGR29667,599,190133,450,7001,160,28633,191,170762,710,000
TRANIS.RRS.TOTL.KM22427,53640,1245815,026194,431
TRANIS.SHP.GOOD.TU28812,972,78025,000,180516,6986,586,637186,679,100
TRANIS.VHL.COM300807421,318693388137,043
TRANIS.VHL.PSGR30020,22726,12379611,067135,882
Target VariableEN.CO2E.TRAN.KT300173,770316,71310,79489,5661,838,933
Table 4. Pearson correlation coefficient between the categorical variables and transportation-based CO2 emissions.
Table 4. Pearson correlation coefficient between the categorical variables and transportation-based CO2 emissions.
CountryCorrelation CoefficientCountryCorrelation CoefficientCountryCorrelation CoefficientFeaturesCorrelation Coefficient
USA0.92GBR−0.03SAU−0.03Tier 10.63
CHN0.28ITA−0.03TUR−0.07Tier 2−0.15
RUS0.05IDN−0.04EGY−0.07
JPN0.03ESP−0.04ARG−0.07Developing 0, Developed 10.23
IND0.02KOR−0.05POL−0.08Fuel-exporting Countries 10.14
BRA0.01AUS−0.05PAK−0.08High Income 0.18
CAN0.00THA−0.06NLD−0.08Upper Middle Income −0.09
MEX−0.01ZAF−0.07UKR−0.08Lower Middle Income −0.12
IRN−0.02VEN−0.07ARE−0.08
FRA−0.02MYS−0.07KAZ−0.09
Table 5. Statistical metric evaluation of transportation-based CO2 emissions prediction for the top 30 CO2 emissions-producing countries using ML algorithms. Colors indicate the level of performance.
Table 5. Statistical metric evaluation of transportation-based CO2 emissions prediction for the top 30 CO2 emissions-producing countries using ML algorithms. Colors indicate the level of performance.
MAERMAErRMAEMAPE R 2
OLS_ALL0.01110.01610.17700.31320.9866
SVM_ALL0.00690.01270.13910.21440.9927
GBR_ALL0.00610.01120.11650.14080.9943
OLS_TRAN0.01500.02050.23120.47150.9800
SVM_TRAN0.00920.01500.16870.27910.9896
GBR_TRAN0.00670.01220.11630.13550.9930
OLS_SoEco0.01880.02800.32070.54050.9577
SVM_SoEco0.00990.01590.19970.38380.9880
GBR_SoEco0.00910.01790.17280.14910.9789
Table 6. Statistical metric results for transportation-based CO2 emissions prediction for Tier 1 countries (the top 5 CO2 emissions-producing countries), using ML methods.
Table 6. Statistical metric results for transportation-based CO2 emissions prediction for Tier 1 countries (the top 5 CO2 emissions-producing countries), using ML methods.
MAERMAErRMSEMAPE R 2
OLS_ALL0.00370.00500.10070.12820.9783
SVM_ALL0.00270.00460.08780.07870.9805
GBR_ALL0.00190.00250.18180.06280.9948
OLS_TRAN0.00270.00460.08850.08140.9813
SVM_TRAN0.00370.00490.10150.11270.9782
GBR_TRAN0.00180.00240.19800.05380.9950
OLS_SoEco0.00980.01300.26150.10860.8481
SVM_SoEco0.00260.00390.07380.08760.9877
GBR_SoEco0.00370.00640.26130.05510.9244
Table 7. Statistical metric results for transportation-based CO2 emissions prediction for Tier 2 countries, using ML methods.
Table 7. Statistical metric results for transportation-based CO2 emissions prediction for Tier 2 countries, using ML methods.
MAERMAErRMSEMAPE R 2
OLS_ALL0.01040.01450.22170.31780.9460
SVM_ALL0.00500.00970.14650.12890.9780
GBR_ALL0.00580.01030.29590.49570.9700
OLS_TRAN0.01370.01820.27780.44820.9124
SVM_TRAN0.00700.01320.20210.20980.9566
GBR_TRAN0.00600.01000.30150.12770.9738
OLS_SoEco0.01920.02710.41640.49470.7917
SVM_SoEco0.00860.01250.19740.24440.9527
GBR_SoEco0.00640.01100.38720.15580.9614
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, X.; Ren, A.; Li, Q. Exploring Patterns of Transportation-Related CO2 Emissions Using Machine Learning Methods. Sustainability 2022, 14, 4588. https://doi.org/10.3390/su14084588

AMA Style

Li X, Ren A, Li Q. Exploring Patterns of Transportation-Related CO2 Emissions Using Machine Learning Methods. Sustainability. 2022; 14(8):4588. https://doi.org/10.3390/su14084588

Chicago/Turabian Style

Li, Xiaodong, Ai Ren, and Qi Li. 2022. "Exploring Patterns of Transportation-Related CO2 Emissions Using Machine Learning Methods" Sustainability 14, no. 8: 4588. https://doi.org/10.3390/su14084588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop