Analysis and Prediction Model of Fuel Consumption and Carbon Dioxide Emissions of Light-Duty Vehicles

: Due to the alarming rate of climate change, fuel consumption and emission estimates are critical in determining the effects of materials and stringent emission control strategies. In this research, an analytical and predictive study has been conducted using the Government of Canada dataset, containing 4973 light-duty vehicles observed from 2017 to 2021, delivering a comparative view of different brands and vehicle models by their fuel consumption and carbon dioxide emissions. Based on the ﬁndings of the statistical data analysis, this study makes evidence-based recommendations to both vehicle users and producers to reduce their environmental impacts. Additionally, Convolutional Neural Networks (CNN) and various regression models have been built to estimate fuel consumption and carbon dioxide emissions for future vehicle designs. This study reveals that the Univariate Polynomial Regression model is the best model for predictions from one vehicle feature input, with up to 98.6% accuracy. Multiple Linear Regression and Multivariate Polynomial Regression are good models for predictions from multiple vehicle feature inputs, with approximately 75% accuracy. Convolutional Neural Network is also a promising method for prediction because of its stable and high accuracy of around 70%. The results contribute to the quantifying process of energy cost and air pollution caused by transportation, followed by proposing relevant recommendations for both vehicle users and producers. Future research should aim towards developing higher performance models and larger datasets for building APIs and applications.


Introduction
With the accelerated growth of urbanization, environmental issues caused by transportation have been challenging due to the significant negative impact on climate change [1]. Although the COVID-19 pandemic (commencing in 2020) has temporarily lessened the amount of greenhouse gas emitted into the atmosphere, the temperature of the planet is increasing due to ever-increasing air pollutants [2]. Moreover, 20 to 30% of global greenhouse gases (GHG) are emitted from passenger and freight transportation [3], and 75% of total carbon dioxide emissions originate from passenger cars [4]. Despite stringent fuel and greenhouse gas emission standards regulations, the number of used vehicles has significantly increased, corresponding with the rise in vehicle miles traveled (VMT), leading to their large percentage in air pollutant emissions and natural resource consumption [5].
Estimating and visualizing fuel consumption and exhaust emissions are critical for quantifying the energy cost and air pollution caused by transportation [6], as well as detailing emission control strategies [7]. As, in the past decade, there has been a pressing concern about climate change, estimation models of CO 2 emissions and fuel consumption from vehicles are of increasing significance. Therefore, this has invoked a global interest in applied research (in the areas of data analytics and machine learning) for sustainability among global researchers and engineers [8,9]. Although many studies have introduced various machine learning models and techniques for the estimation of carbon dioxide emissions and fuel consumption, the trend focuses more on optimizing models rather than using vehicle metrics to analyze different vehicle types and brands [8,10,11]. Therefore, a comparative study of different types of vehicles and their effect on the environment has a significance for the vehicle market. Such research provides deep insights into understanding its environmental impacts. This identified gap is addressed by this research, that is, to provide an insight into vehicle fuel consumption and carbon dioxide emission through a series of rigorous data analytics and machine learning. It is worthwhile to note that the data analysis and machine techniques applied in this research are transferable to similar datasets.
The following research objectives (RO) support the aim of this research. To implement and address the listed research objectives, an analytical and predictive study has been conducted on the Government of Canada dataset, containing 4973 lightduty vehicles observed from 2017 to 2021. Using the above-mentioned four levels of data analytics methodology (i.e., Descriptive Statistical Analysis, Inferential Statistical Analysis, Machine Learning, and Deep Learning), the study unravels current trend and comparative analysis of fuel consumption and carbon dioxide emissions from different brands, vehicle models, vehicle class, cylinders, engine size, transmission, fuel type, smog rating, and fuel consumption within a city and on a highway. The research also predicts these features in the upcoming year and builds up a predictive model for fuel consumption and carbon dioxide emission based on relevant car specifications. The results contribute to the quantifying process of energy cost and air pollution caused by transportation, followed by proposing relevant recommendations for both vehicle users and producers. The prediction results from this study discard abrupt factors, such as legislative requirements, unpredictable economic crises, or similar unforeseen interruptions.

Literature Review
With the current alarming rate of climate change, due attention ought to be given to the environmental impact of fuel consumption and emissions from light-duty vehicles, particularly passenger cars. Vehicle emissions can be classified into two principal categories: dangerous exhaust emissions for air quality and human health; and emissions that contribute towards climate change. The emission that has the most significant effect on climate change is carbon dioxide (CO 2 ), which represents the largest proportion of the Green House Gas (GHG) emissions. Notably, road transportation emits about one-fifth of the total emissions of carbon dioxide in the European Union, 75% of which arises from passenger cars [4]. Moreover, the relation between fuel consumption and CO 2 is direct and strong [12]. In the European Union (EU), average fleet emission limits are stated in terms of CO 2 emissions, in grams per kilometer unit. In North America (i.e., the United States (US), and Canada), similar measures have been used, but with limits imposed in terms of fuel economy. Electric vehicles are a critical step in the transportation sector's decarbonization. However, the International Energy Agency estimates that, by 2030, it is needed to have at least 20% of all road transport vehicles to be powered by electricity in order to keep global warming below 2°C (approximately 300 million vehicles) [13]. Consequently, light-duty vehicles with low carbon intensity will continue to play a significant role during the transition. Moreover, legislative requirements have been discussed globally; for example, the European Union (EU) has adopted a climate change agenda to reduce GHG emissions by over 55% by 2030 compared to 1990 [14] and become a net-zero GHG emission economy by 2050 [15]. In addition, the Government of Canada has also set the target of reducing its emissions by 40-45% by 2030 and committed to achieving net-zero emissions by 2050 to avert the worst effects of climate change [16]. Therefore, to satisfy those limits in CO 2 and achieve such high targets from legislative requirements, many worldwide researchers have proposed different vehicle emissions and consumption models. The systematic process for this literature review is to specify current approaches that have been used by various researchers, identify which models and methodologies have been used in each approach, before identifying the research gap.

Vehicle Emissions Estimation Models
A number of vehicle emissions estimation models have been introduced by different researchers in the last decades. Using look-up tables, a micro-scale model called CORSIM is built to estimate emissions based on dynamometer data. To ascertain the total emissions of each link, the CORSIM model applies default emission rates per second to each vehicle that travels on the given link, based on acceleration and speed [17]. EMIT is a model for estimating HC, CO 2 , CO, and NOx, which is built from dynamometer data of 344 light-duty vehicles and employs a regression equation with acceleration and speed [18]. At the project or regional level, a United States agency has proposed a model called MOVES in 2010 for the estimations of greenhouse gas emissions: CO, VOCs, PM, and NOx generated from lightduty vehicles [19]. Features such as vehicle mass, total resistance force, velocity, acceleration, and driveline performance have been employed by Rakha and colleagues to build a model for estimating CO 2 emissions using instantaneous vehicle power [20]. A function of acceleration and velocity observed from a dynamometer experiment has been applied to the INTEGRATION model for the estimation of emissions from measured fuel consumption. Additionally, it is further developed for the simulation and optimization of trip-based microscopic traffic [21]. Using more parameters, including 55 parameters, a model named CMEM is proposed by a group of researchers to estimate parameters for a wide range of light-duty vehicles. For dynamometer testing, this model uses emissions per second data of CO, CO 2 , NO, and HC, along with physical vehicle features (engine size, vehicle mass, and aerodynamic drag coefficient) and operating features (acceleration and speed) [22]. Another example of using data-intensive parameters is MEASURE, which was invented by the Georgia Institute of Technology. It calculates the emissions of NOx, CO, and VOCs from vehicle operating modes, including acceleration, deceleration, cruise, and idling. However, CO 2 estimation is not included in this model, while it has over 30 features as its inputs [23]. Another well-known framework has been developed by the European Environment Agency (EEA) called COPERT, which became one of the standard methodologies for road transport emission inventories in EEA member countries [24]. It estimates primary air pollutants (CO, NOx, PM, VOC, SO 2 , NH 3 , heavy metals) and greenhouse gas emissions (CO 2 , N 2 O, CH 4 ) using functions of the mean traveling speed throughout a complete driving cycle [25]. However, the framework neglected other characteristics while estimating the emissions of a specific vehicle, such as engine size, cylinders, and engine model. Furthermore, some recent research authors have applied Machine Learning and Deep Learning methodologies for vehicle emission models. Toth-Nagy and colleagues, for instance, have proposed a model using the Artificial Neural Network to predict emissions of NOx and CO from heavy-duty vehicles. Though the outcome is positive, CO 2 has also not been included, and the model is appropriate for gasoline vehicles [26]. When testing on the real-world driving conditions of 70 diesel vehicles, a group of researchers implemented a machine learning model to make projections of emissions alongside the performance of vehicles. A look-up table, non-linear regression, and Neural Network Multilayer Perceptron models are consequently applied for instantaneous NOx predictions. Despite the model taking inputs of vehicle acceleration and speed, its outputs focus only on NOx estimation, and CO 2 remains excluded [27]. Qing et al. have built a model for estimating vehicle emission rates, including CO, CO 2 , HC, and NOx from vehicle idling by using Portable Emission Measurement System. The dataset is collected from actual driving tests; Boosted and Bagged Decision Trees are introduced as a reliable prediction model for vehicle emissions estimation [28]. It can be seen that applying Machine Learning and Deep Learning techniques for predicting carbon dioxide emissions remains limited and needs further development, which is thereby, the principal goal for this study.

Vehicle Consumption Estimation Models
On the other hand, some researchers have focused on the fuel consumption of vehicles rather than CO 2 emissions, as fuel consumption (and economic costs) seem to be more relevant to consumers in general. The vehicle fuel consumption models are classified into 2 categories: theoretical fuel consumption models and statistical fuel consumption models [29]. The theoretical fuel consumption model concentrates on the operation features of the vehicle, such as output power and engine parameters, while the statistical fuel consumption model converges the statistical attributes from vehicle activity and fuel consumption data, including acceleration and speed [30]. One of the fuel consumption models is based on a novel macroscopic model that considers trip time and intersection distance for prediction [31]. Using the distribution of Vehicle Specific Power, a fuel consumption prediction model is proposed by Qi et al., which comprises a fuel consumption model and traffic condition predictor to provide a real-time prediction. From this, an API is developed for fuel consumption estimation, using on-board diagnostic (OBD) data for verification, with a 20% forecasting error. By collecting driving behavior data from consumers' smartphones, a prediction model of fuel consumption is developed based on a backpropagation (BP) neural network, random forests, and support vector regression with a relative error of less than 10%. It is also found that the average acceleration and deceleration, acceleration time percentage, deceleration time percentage, and cruising time percentage are major indicators for fuel consumption estimation [10]. Furthermore, Tamer et al. has proposed an approach to estimate fuel consumption by onboard vehicle information system Onboard Diagnoses-II (OBD-II) using Support Vector Machine and Lagrange interpolation. The model successfully provided precise fuel consumption with a square root mean difference of 2.43 [32]. Applying a Machine Learning model, a neural-network-based fuel prediction model is presented by utilizing seven predictors obtained from road grade and vehicle speed. It could optimize fuel usage over the entire fleet, with a peak-to-peak error rate of less than 4% in both city and highway [11].
Furthermore, vehicle emission and consumption can be predicted based on one single model. For example, by using GPS Big Data, an N-Dimensional framework is proposed by a group of researchers for estimating and visualizing fuel consumption and emissions. They stated that analyzing GPS big data generated from vehicles can deliver practical insight on the quantity and distribution of energy use and emissions in real-world driving conditions (acceleration, idle, cruise, and deceleration). This model has claimed effectiveness by a prediction accuracy of 88.6% [8]. Additionally, several statistical models of vehicle emissions and fuel consumption, which are published by Alessandra et al., could be integrated to predict the spatial and temporal distribution of traffic emissions and fuel consumption [18].
Overall, it can be seen from the mentioned studies that numerous researchers have proposed different models for estimating carbon dioxide emissions and fuel consumption using micro-scale methodologies, or Machine Learning and Deep Learning. The common vehicle characteristics for building these models are engine size, vehicle mass, and aerodynamic drag coefficient; and standard operating features used are acceleration and speed. The research trend generally emphasizes improving estimation models, rather than analyzing different vehicle types and brands using vehicle measurements, making it a limited market analysis for users and manufacturers. As a result, for a better knowledge of the vehicle market and its environmental effects, a comparative view of different types of vehicles and their influence on the environment is significant. Based on these metric analyses, recommended prediction models should be built using selective vehicle features. This identified gap provides the basis for the aim and objectives of this research.

Macro Methodology
In this study, to conduct an analytical and predictive study for fuel consumption and carbon dioxide emissions of vehicles, the dataset used is collected by the Government of Canada. A data analytics life cycle has been adopted for this research. This life cycle is a standard for Data Science and Big Data Analytics purposes, adopted from EMC Education Services [33], and contains 6 phases, as indicated in Figure 1.
The first stage of this process is discovery, where the problem, context, hypothesis, and objectives that the data are used for are determined. The main goals of this study are to deliver a comparative view of fuel consumption and carbon dioxide emissions from different brands and vehicle models, to make evidence-based recommendations, and to construct a model to predict changes in the future consumption and emission rate. The dataset used in this study is derived from the 'Fuel consumption rating' datasets from the Government of Canada, which contains fuel consumption ranks and measured CO 2 emissions for 4974 samples of light-duty vehicles in Canada [34]. The data were originally gathered from vehicle manufacturers, who compile the fuel consumption and CO 2 rating data using standardized, monitored laboratory testing and analytical procedures. Then, a 5cycle testing process is used by manufacturers to simulate common driving conditions and styles. The approach also includes testing for city and highway driving, as well as driving in cold weather, using air conditioners, and driving at faster speeds with higher acceleration and braking [35]. Note that the CO 2 and smog ratings given in the dataset were generated from the original ratings by manufacturers, not from vehicle testing. Consequently, the collected fuel consumption and CO 2 consumption data from newly produced vehicles are used in this study for data analytics purposes. In Phase 2-Data Preparation, the dataset has then been processed and compressed into one single spreadsheet. By scoping down the research analysis, data of 4974 light-duty vehicles annually collected from 2017 to 2021 is merged, aggregated, with several renamed categories, including fuel consumption and carbon dioxide emissions from different brands, vehicle models, vehicle class, cylinders, engine size, transmission, fuel type, smog rating, and fuel consumption in a city and on a highway. Next, the dataset has been checked, and there are no issues or missing values. Subsequently, the dataset is cleaned to filter out if any data are not necessary for analysis purposes. For instance, one record is removed from the dataset since it is the only record containing the unique brand named 'super' (that can be considered an error record while there is no brand carrying that name), leading to a final 4973 record dataset.
In Phase 3 and 4-Model Planning and Building, the dataset is analyzed and visualized by using four levels of data analytics methodology, including Descriptive Statistical Analysis, Inferential Statistical Analysis, Machine Learning, and Deep Learning methodology. Specific categories of all algorithms are discussed in the next Section 3.2. Finally, in Phases 5 and 6, relevant results on machine learning analytics and predictions are communicated and presented in detail in Sections 4 and 5 on Results and Discussion. Final reports, briefings, code snippets are also presented in the rest of this paper.

Micro Methodology
In this paper, the "micro methodology" term refers to the micro-level data analysis methodology. This includes data analysis methods that are critically discussed (supported by embedded citations) by the measurements/approaches/algorithms that will be employed. In particular, four levels of data analytics are applied, as listed below.

Level 1: Descriptive Statistics
This level comprises basic calculations of central tendency (mean, median, mode) and dispersion statistics (standard deviation, variance, range). A list of comparative statistics of fuel consumption and CO 2 emission has been presented for each brand, model, engine size, vehicle class, transmission and cylinder type, and fuel type, giving a comprehensive outlook of emissions and consumption of various vehicle types and brands. The changes of the patterns through the years are also indicated before progressing to time-series changes of the greenest and the least environmental-friendly vehicle brand.

Level 2: Inferential Statistics
The dataset is verified by different types of analytic testing for various purposes.
• t-test: has been conducted to compare the mean fuel consumption in the city and on the highway for the same vehicle; • ANOVA: compares the means of total fuel consumption and carbon dioxide emissions for each vehicle class and fuel type over time to define whether each fuel type (or vehicle class) is significantly different from the rest; • Correlation: A heat map of correlation coefficients is shown to illustrate the direction and strength of a linear relationship among vehicle features in pairs. Moreover, a comparison of the importance of features for predicting CO 2 Emissions and Total Fuel Consumption has been conducted, which is an important test before advancing to Levels 3 and 4; • Chi-Square: Two Chi-Square Goodness of Fit tests have been carried out to investigate whether there is a significant difference between the observed (data in 2021) and expected values (data from 2017 to 2020). Additionally, a chain of Chi-Square of Independence tests have been implemented to define relationships between all features to each other, therefore, presented in a heat map.

Level 3: Machine Learning
In order to answer RQ3.1, input features have been used from the dataset to predict values in upcoming years: • Time Series Regression: has been used since it can forecast a future response using the historical responses and dynamics transition from related predictors. Different models are applied in this study, including persistence models (using walk forward validation), autoregression models (using autoregression function by statsmodels), and optimized autoregression model (using walk-forward over time steps). These models are evaluated by Root Means Square Error (RMSE) value, which measures the differences between values predicted and the values observed.
To define whether Machine Learning models can use vehicle specifications data to predict their fuel consumption and CO 2 emission (RQ3.2), different models are conducted in this study and classified into two groups: Machine Learning models to predict a variable from a variable; and models to predict a variable from multiple variables.
For building Machine Learning models to estimate a variable from a single variable, data of engine size, number of cylinders, fuel consumption in a city and on a highway have been used to predict total fuel consumption and CO 2 emissions. Moreover, total fuel consumption and CO 2 emission data were used to predict each other. This research uses relevant methodologies to model relationships between those variables, which include: • Linear Regression: using the sklearn model and the dataset is split into training and testing sets with 80%:20% ratio; • Univariate Polynomial Regression: using the sklearn model and 5 different degrees (from Degree 1 to Degree 5).
Regarding Machine Learning models used for estimating a variable from multiple variables, groups of data, including group A (model year, engine size, and cylinders) and group B (engine size and cylinders) have been used to predict total fuel consumption and CO 2 emissions. Furthermore, data on fuel consumption in cities and highways were also used to estimate the total fuel consumption of vehicles. The applied models are listed as follows: These models are chosen because many variables can be used at the same time to examine the statistical significance of each variable and transform them into independent variables. These forms of regression models also support the prediction of the dependent (or target) variables for later analysis [36]. In this paper, the coefficient of decision (R squared) value has been used to evaluate the above-mentioned models. The R squared value is a statistical measurement that examines how differences in one variable can be explained by differences in a second variable. Ranging from 0 to 1, the higher the R squared value, the better the model can be used for prediction.

Level 4: Deep Learning
In addition, Convolutional Neural Network (CNN) is used in this study to predict a variable from multiple variables. Since CNN is normally used for image classification, to use CNN for regression problems, this research uses a one-dimensional convolutional network by reshaping input data. This enables the model to simulate numerical input data using learnable weights and biases [37].
The dataset has two dimensions that are the number of rows and columns (i.e., 4973 rows and 3 columns). Therefore, to reshape the data, a third dimension has been added as the number of the single input row (i.e., it becomes [4973, 3,1]). Subsequently, the data are split into training and testing sets with an 80:20 ratio. Moreover, Keras is also applied to create a Conv1D class to add a one-dimensional convolutional layer into the model. Flatten and Dense layers are also supplemented and compiled with optimizers. Finally, the model can predict the test data with the trained model. This is evaluated by checking the mean squared error rate (MSE) of the predicted results.

Results and Discussion
This section is structured based on the Micro Methodology mentioned in Section 3.2, and divided by four levels of data analytics.

Level 1: Descriptive Statistics
The general purpose of this Level 1 is to observe 4973 light-duty vehicles from 2017 to 2021 by their fuel consumption and carbon dioxide emissions from different brands, vehicle models, vehicle class, cylinders, engine size, transmission, fuel type, smog rating, and fuel consumption in a city and on a highway. Recall that the CO 2 and smog ratings in the dataset were calculated using manufacturer ratings rather than vehicle testing, and were ranked from worst (1) to best (10) with no unit.
Firstly, in order to address RQ1.1 (How do light-duty vehicles compare in terms of fuel consumption and carbon dioxide emission?), descriptive statistics for all numerical columns in the dataset have been conducted to provide an evaluation of the data distribution. The purpose of descriptive statistics is to provide a statistical understanding of the dataset quality [36]. It can be seen from Table 1 that the average total fuel consumption is 10.86 L/100 km, of which 57.77% (12.36 L/100 km) from the city and 42.22% from the highway (9.04 L/100 km). Additionally, it is clear from the statistics that the average CO 2 emissions of all vehicles are 251.44 g/km, with a standard deviation of 58.85 g/km. Ranking from worst (1) to best (10), the average CO 2 rating is 4.60, and the average smog rating is 4.63. Moreover, dispersion statistics of standard deviation and variance also indicate that the size of the distribution of values expected is reliable enough for prediction. Regarding the fuel consumption and carbon dioxide emission of different brands, their average data are indicated in Table 2. In this dataset, the number of vehicles from Ford accounts for the highest with 436 vehicles, and the lowest amount is from Bugatti with 6 vehicles. After the descriptive statistical analysis, a bar chart is created, as presented in Figure 2, to demonstrate the average fuel consumption of different brands. It reveals that Honda consumes fuel the least (8.03 L/100 km), while Bugatti has the highest fuel consumption (22.98 L/100 km). Moreover, from Figures 3  and 4, Honda seems to be the greenest brand as it emits the least CO 2 (187.58 g/km) and attains the highest CO 2 rating (6.65), whereas Bugatti continues to perform poorly in its environmental-friendliness with the highest CO 2 emissions (538.83 g/km) and the worst CO 2 rating (1.00).
Considering smog, Figure 5 proves that Volkswagen emits smog the least (6.45), and Bugatti seems to be the worst brand in terms of smog (1.00), fuel consumption, and CO 2 emissions.     Regarding fuel consumption and CO 2 emissions of different models, Table 3 explains that the IONIQ BLUE model consumes and emits the least, and in contrast, the CHIRON PUR SPORT model consumes and emits the most.
Similarly, when considering fuel consumption and CO 2 emissions, Tables 4-8 showcase that Station wagon (Small) class, Engine Size 1.2L, 3 Cylinders, Transmission Type AV1, and Fuel Type D (Diesel) consume fuel and emit CO 2 the least. Conversely, Van (Passenger) class, Engine Size 8.0, 16 Cylinders, Transmission Type A6, and Fuel Type E (Ethanol E85) seem to be the most consumers and emitters. However, since the Volkswagen emissions scandal emerged, the negative image of diesel has intensified. The actual NO and PM emissions of diesel vehicles, according to recent researchers, are significantly greater than those reported. Because of carcinogenic compounds, diesel particle emissions are also a possible health danger [38]. Therefore, the conclusion that Ethanol E85 emits the most among other fuel types remains the scope of the data in this research. Table 3. CO 2 emissions (g/km) and total fuel consumption (L/100 km) of each model.  Table 4. CO 2 emissions (g/km) and total fuel consumption (L/100 km) of each vehicle class.  Secondly, to answer RQ1.2 (How have patterns of consumption and emission of each vehicle type changed throughout the selected period?), descriptive statistics have been conducted for total CO 2 emissions and fuel consumption through the period of 2017 to 2021 in Table 9 in general.

Vehicle Class Total Fuel Consumption (L/100 km) CO 2 Emissions (g/km)
It can be seen from Table 9 that the total fuel consumption gradually increases from 2017 to 2020, before a significant drop in 2021. However, the peak in 2020 does not exist in the CO 2 emissions, and the value steadily rises over the entire period.  Table 10, it can be seen a similar pattern of gradually increasing from 2017 to 2020 before significantly dropping in the data of engine size, cylinders, fuel consumption in the city, and the total. The highway fuel consumption and in total (mpg) and CO 2 emission observe a continuous rise over the years. That could explain a gradual decrease in CO 2 rating during the period. Finally, smog rating dramatically is reduced in 2018, before continuously growing until 2021. In this research, it is evident that Honda is the greenest brand, and it is essential to analyze its pattern of consumption and emission through the years. From Figure 6, in 2018, Honda seems to have optimized fuel consumption and carbon dioxide emissions of their products. Although the data in 2019 and 2020 show a slight increase, it dramatically drops again in 2021. Given the same analysis on the brand that has demonstrated to possess the least environmental awareness, Bugatti has never considered optimizing their products' consumption and emission, proven by the significant growth in total fuel consumption and CO 2 emission shown in Figure 7. Considering the fuel consumption of each fuel type during the years, it can be seen from Figure 8 that Fuel Type E (Ethanol E85) and Z (Premium gasoline) always consume more than Fuel Type X (Regular gasoline) and D (Diesel). Over the period, Fuel Type D (Diesel), E (Ethanol E85), and Z (Premium gasoline) all have increased their consumption, whereas Fuel Type X (Regular gasoline) has a slight decrease, thus having the least fuel usage in 2021.

t-Test
To address RQ2.1 (Is there any particular distribution for fuel consumption in the city and the highway of vehicles in Canada?), a two-tailed T-test has been conducted to compare the means of fuel consumption in the city and on the highway for the same vehicle, with the following configurations. It is clear that: Therefore, the null hypothesis can be rejected. This means the mean of fuel consumption in a city and on a highway for the same individual has a significant difference.

ANOVA
To answer RQ2.2 (Is there a notable difference in the performance of one specific fuel type (or vehicle type) in comparison to the rest of the vehicle types in Canada?), a one-way ANOVA one-tailed test was implemented to compare the means of each vehicle class in terms of total fuel consumption, using the following assumptions.

•
The samples are not dependent; • Each sample comes from a population that is normally distributed; • The group population standard deviations are all equal (homoscedasticity).
Firstly, the means of total fuel consumption for each class through the years is calculated based on the descriptive statistics method, as shown in Figure 9.
The following configurations have been set out. After the test, the result showed that: Therefore, the null hypothesis can be rejected, meaning that at least one mean of total fuel consumption for each vehicle class is significantly different from the rest.
Similarly, using the same assumptions, hypothesis, and confidence level, one-way ANOVA one-tailed tests have been conducted in CO 2 emissions and fuel consumption of each vehicle class and fuel type (Figures 10 and 11, respectively) of each fuel type, and each result is presented as the following. p-value = 6.81894 × 10 −27 < α = 0.01. (3) Consequently, the null hypothesis can be rejected, meaning that at least one mean of CO 2 emissions for each vehicle class is significantly different from the rest. Total fuel consumption of each fuel type over time: Therefore, the null hypothesis can be rejected, meaning that at least one mean of total fuel consumption for each fuel type is significantly different from the rest.
Emissions of each fuel type over time: p-value = 5.5127 × 10 −05 < α = 0.01. (5) From that comparison, the null hypothesis can be rejected, meaning that at least one mean of CO 2 emissions for each fuel type is significantly different from the rest.

Correlation
To define the strength of the relationship among two features in the dataset and address RQ2.3 (How the brand, model, vehicle class, cylinder, engine size, transmission type, and fuel type correlate with emissions and consumption of various vehicles?), a correlation algorithm has been introduced to generate correlation coefficients. The most commonly used algorithm of this type in statistics is Pearson correlation, which estimates the direction and strength of a linear relationship among two variables [39]. In this study, the objective of this statistic is to define which parameter has the strongest correlation with the total fuel consumption and CO 2 emission. To achieve this, Pearson's correlation coefficients have been applied and computed between all features through all vehicles and presented in a correlation heat map shown in Figure 12.
From the heat map in Figure 12, all the correlation coefficients have been calculated, showing the correlation between corresponding parameters on the left and the corresponding parameters at the bottom. The higher the correlation coefficient, the warmer color was presented.
Moreover, Figures 13 and 14 below reveal the importance of all features on estimating total fuel consumption and CO 2 emissions by using bar charts. It is seen from Figures 13 and 14 that besides the fuel consumption features in the highway and the city (the two most important features), engine size gives the highest correlation for estimating total fuel consumption, whereas cylinders, year, and smog rating are nearly half as important, compared to engine size. For estimating carbon dioxide emission, engine size, year, and smog rating are important features. This finding contributes as an influential factor in building Machine Learning and Deep Learning models presented in Levels 3 and 4.

Chi-Square
Chi-Square is a non-parametric test, which is divided into two different types: Chi-Square Goodness of Fit and Chi-Square of Independence. The purpose of Chi-Square Goodness of Fit is to compare the observed and expected values from one categorical variable. Meanwhile, Chi-Square of Independence defines whether there is an association among categorical variables, meaning that the variables are related or independent, known as the Chi-Square Test of Association [40].
To implement the Chi-Square Goodness of Fit test, the dataset is split into the period of 2017 to 2020, used for testing the predictions of 2021 whether there is a significant difference between the observed and expected values. First, the Chi-Square Goodness of Fit Test is applied to compare the Total Fuel Consumption by Vehicle Class between expected (from 2017 to 2020) and observed (2021) using a confidence level of 98% (α = 0.02), and the results attained are discussed below.
Therefore, the null hypothesis can be accepted, meaning that there is no significant difference between the observed and expected values.
A similar Chi-Square Goodness of Fit Test is conducted for comparing Total Fuel Consumption by Fuel Type in expected (from 2017 to 2020) and observed (2021) with the following outputs.
Therefore, the null hypothesis can be rejected, meaning that there is a significant difference between the observed and expected values.
Next, to address RQ2.4 (What are the relationships between all features to each other of the entire dataset?), the Chi-Square of Independence Test was conducted to ascertain whether there is a relationship between fuel type and CO 2 rating and the results are the following.
With the chosen confidence level of 98%, the null hypothesis is rejected, and there is a relationship between fuel type and CO 2 rating.
A chain of similar Chi-Square of Independence tests have also been implemented to define relationships amongst all features and are presented in a correlation heat map shown in Figure 15. In the heat map, all the correlation coefficients have been calculated and indicated as 1, if there is a relationship between corresponding parameters on the left and the corresponding parameters at the bottom, and indicated as 0 if there is no relationship among them. It reveals that there is some form of relationship amongst almost all features except that there is no relationship between year and model, cylinders, and total fuel consumption (mpg). Through this test, it is concluded that all the features from the chosen dataset can be used for prediction models proposed in Level 3 and 4, and year can be used as a time index for the estimation.  Firstly, all the input features from the dataset are used to calculate their mean values over time, as shown in Figure 16. Secondly, using the correlation results from Section 4.2.3, this study builds the following models to predict the fuel consumption (in city, highway, and total) and CO 2 emissions of an average vehicle in Canada in the four upcoming years.
The prediction results of these models are presented in Figure 17 and Table 11. It can be observed from Table 11 that the autoregression model always has the highest RMSE. The optimized autoregression model has lower values, while the persistence model has the lowest values. The persistence model predicts that total fuel consumption and CO 2 emission will increase in the next four years. However, fuel consumption in the city is projected to decline, while the data in highways are expected to grow firmly.
The rest of the following Machine Learning models have been constructed to answer RQ3.2 (Is it possible to build Machine Learning models that use vehicle specifications data to predict their fuel consumption and carbon dioxide emission?).

Linear Regression and Univariate Polynomial Regression
These methodologies have been applied to build models that predict total CO 2 emissions and fuel consumption of vehicles from a single input (engine size, or the number of cylinders, etc.), and the result is presented in Table 12 and Figure 18.
The coefficient of determination is ranged from 1 to 10, from worst to perfect prediction. It can be seen from Table 12 that the Univariate Polynomial Regression Degree 5 model achieves the highest coefficient of determination (R squared) in 7 out of 10 scenarios. Being insignificantly different from it, the Linear Regression almost attains the same R squared value and at the same time, obtains the highest in 3 out of 10 scenarios.

Multiple Linear Regression, Logarithmic Regression, Multivariate Polynomial
Regression, Transformation of Data, and Exponential Regression These models are selected to estimate total CO 2 emissions and fuel consumption of vehicles from multiple inputs, and the result is presented in Table 13. Table 13 shows that in 3 out of 5 cases, the Multiple Linear Regression model has the largest coefficient of decision (R squared). Despite being insignificantly different from it, the Linear Regression comes close to attaining the same R squared value and also achieves the best score in 2 out of 5 scenarios (at Degree 2 and 5). On the other hand, the Logarithmic Regression with Log Transformation model receives lower determination scores in all scenarios. Notably, the Logarithmic Regression with Exponential Transformation model generates negative R squared values in all cases, implying that the goodness of fit level is worse than fitting the curve of the model. In this subsection, different Machine Learning models are applied to use vehicle specifications data for fuel consumption and carbon dioxide emission estimation. It is recognized that Linear Regression, Multiple Linear Regression, Univariate Polynomial Regression, and Multivariate Polynomial Regression are very potential in this field, which answered the research question RQ3.2.  [41,42] has been employed in this study to estimate the total CO 2 emissions and fuel consumption of vehicles from multiple inputs. CNN is a form of deep neural network that is often used to explore visual imagery [37,43]. The deep learning model has been built using Google Collab and results are presented in Figure 19 and Table 14.  Table 13, while the CNN model is yet to reach the highest R squared score, in any case, the model is likely to attain it with stable predictions. Moreover, Figure 19 demonstrates that the CNN model could predict with high accuracy.

Recommendations
Through a series of rigorous data analyses, the study has showcased the current trend and comparative analysis of fuel consumption and carbon dioxide emissions from different brands and vehicle features.
A list of recommendations for customers who currently wish to buy new vehicles is as follows: • From the findings of our in-depth statistics and analysis of different Machine Learning and Deep Learning model, there are several evidence-based recommendations. First, it is possible to use engine size and the number of cylinders to estimate CO 2 emissions and fuel consumption of future vehicle designs, with a relatively high determination coefficient, around 70%. Moreover, fuel consumption and CO 2 emission data can be used to predict each other, with every high accuracy in most cases, up to 91.22%. Secondly, different Machine Learning models, including Linear Regression, Multiple Linear Regression, Univariate Polynomial Regression, and Multivariate Polynomial Regression have potential to predict the CO 2 emission and fuel consumption of light-duty vehicles. However, it is suggested to apply Convolutional Neural Network for the prediction, which is proven to predict stably with relatively high accuracy of around 70%. Prediction results from the Machine Learning and Deep Learning models in this paper can be used as an index and a reference for relevant predictors, that can be used for different stakeholders in the upcoming actions. Moreover, the models can be applied to other air pollutants of the vehicle exhausts, including CO, NOx, SO2, PM, etc.

Conclusions and Future Work
In this research, an observational and predictive analysis has been performed using data from the Government of Canada, which includes 4973 light-duty vehicles observed between 2017 and 2021, to provide a comparative view of various brands and vehicle types in terms of fuel consumption and CO 2 emissions before making applicable recommendations. Despite significant efforts that have been developed in the past [10,19,27], this research analyzes different vehicle types and brands using vehicle measurements, providing a deeper understanding of the vehicle market and its environmental effects. The proposed vehicle features and recommended prediction models in this study can be further used as a reference for vehicle manufactures and users to make relevant actions for reducing their environmental impacts.
By using descriptive and inferential statistics methodologies, it is observed that the average total fuel consumption of light-duty vehicles is 10.86 L/100 km, and the average CO 2 emission is 251.44 g/km. Different brands and vehicle features have been included in a rigorous, as well as comprehensive, analysis. Based on the findings, relevant recommendations have been made. Over the study period, some vehicle brands have been working towards optimizing their products with environmental awareness (such as Honda), while some are doing conversely (including Bugatti).
Moreover, different machine learning and deep learning models have been built throughout this study for fuel consumption and CO 2 emission prediction. Firstly, this study reveals that the Persistence model has outperformed the autoregression and optimized autoregression models for predictions from one input variable with vector autoregression. Additionally, the Univariate Polynomial Regression model (degree 5) attains a higher coefficient of determination, compared to the model itself with lower degrees and Linear Regression model. Secondly, for estimating total fuel consumption and CO 2 emissions of vehicles from multiple inputs, the Multiple Linear Regression and Multivariate Polynomial Regression have been demonstrated to be the best models, compared to Logarithmic Regression (with Log and Exponential Transformation). Finally, it should be noted that Convolutional Neural Network is also promising for predicting in this field, with stable and high coverage of correct predicted values.
Future research may gear towards developing higher performance models for predicting fuel consumption and CO 2 emissions. Moreover, a larger dataset with more vehicle features should be studied for building a predictive model in vehicle design. Based on that, APIs and applications can be designed and constructed for predictions. Finally, vehicle consumers and producers can adopt the recommendations from the findings of this study to design, as well as implement appropriate action plans for reducing their environmental impacts.

Data Availability Statement:
The data used to analyze in this paper can be found in this link https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64 (accessed on 30 November 2021).

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: