1. Introduction
The transportation system is a cornerstone of economic growth and societal development, enabling the movement of people, goods, and services across regions. However, it also significantly contributes to global energy consumption and greenhouse gas (GHG) emissions, posing serious environmental and public health challenges. In the European Union, the transport sector accounted for 72% of transportation-related GHG emissions in 2017, with passenger cars alone contributing 44% [
1]. Similarly, in Beijing, passenger cars emitted 11.3 million tons of carbon dioxide (
) in 2012, constituting 75.5% of the city’s total
emissions [
2]. These figures underscore the urgent need for accurate predictive modeling of fuel consumption and vehicle emissions to inform policy and mitigate environmental impacts.
Vehicle emissions are influenced by a variety of factors, including vehicle type, engine characteristics, operational parameters (e.g., speed, acceleration, and driving patterns), and environmental conditions. Previous studies have demonstrated that vehicle speed, particularly instantaneous speed, plays a pivotal role in emissions modeling, as it captures dynamic changes in driving behavior [
3]. Fuel type and properties, such as grade and chemical composition, also significantly affect emissions. While advancements in engine technology have led to more efficient engines, including diesel engines with reduced well-to-wheels emissions [
4], the transportation sector remains a major contributor to poor air quality in urban areas.
Traditionally, emissions and fuel consumption have been modeled using empirical, statistical, and analytical methods. However, these approaches often face limitations in handling complex, multidimensional datasets. Machine learning (ML) techniques have emerged as a powerful alternative, offering the ability to model nonlinear relationships and analyze large datasets with high accuracy. ML approaches, such as artificial neural networks (ANN), support vector machines (SVM), and ensemble methods, have been successfully applied to emissions modeling. Recent studies have demonstrated their capability to predict various emissions, including
and
, with strong correlations between predicted and measured values [
5,
6].
Despite these advancements, several challenges remain. First, while most studies focus on individual engine types (either gasoline or diesel), the potential of mixed datasets to improve model generalizability is underexplored. Second, the reliance on vehicle speed as a primary input for predictive modeling requires extensive data collection infrastructure, which can be logistically complex and costly. GPS speed data offer a promising alternative, as it is widely accessible and requires minimal hardware modifications. However, its reliability and accuracy in predictive modeling, particularly for mixed gasoline and diesel datasets, have not been fully evaluated.
This research aims to address these gaps by developing machine learning models that predict fuel consumption and emissions using vehicle speed and GPS speed data. Ensemble bagged and decision tree algorithms are employed to evaluate model performance, leveraging data collected from gasoline and diesel engine vehicles under real-world driving conditions. By comparing mixed and individual datasets, this study investigates the impact of dataset composition on predictive accuracy. The results provide actionable insights into the feasibility of integrating GPS-based models into intelligent transportation systems (ITS) for real-time monitoring and management of environmental impacts.
The findings of this research contribute to the ongoing effort to develop sustainable transportation solutions by enhancing the accuracy and applicability of emissions and fuel consumption models. Furthermore, the study underscores the importance of leveraging emerging technologies, such as machine learning and GPS data, to address critical challenges in urban transportation and environmental sustainability.
2. Literature Review
The accurate modeling of vehicle fuel consumption and emissions has been a major focus of research due to the transportation sector’s substantial contribution to greenhouse gas emissions and air pollution. Numerous studies have explored various modeling techniques and influencing factors to enhance prediction accuracy and support policymaking.
Previous research addresses the foundational contributions that have shaped current research in fuel consumption and emissions prediction. Vehicle emissions are influenced by several factors, including vehicle physical properties, operational characteristics, environmental conditions, and fuel properties. Sullivan et al. (2004) demonstrated that diesel engines generally produce lower emissions than gasoline engines on a well-to-wheels basis [
4]. Meanwhile, studies, such as those by Clark et al. (2011) and Zhang et al. (2013), have categorized these influencing factors, highlighting vehicle speed and acceleration as critical parameters for emissions modeling [
7,
8]. Specifically, speed has been extensively researched, with both instantaneous and averaged measurements proving vital for accurate predictions [
3]. Early studies were crucial in establishing the basic principles, methodologies, and models used in emission estimation, particularly in relation to vehicle speed and driving behavior. These studies focused on understanding the relationship between vehicle operating conditions (like speed) and emissions, providing foundational insights for later research.
The evolution of emission modeling methodologies has seen the development of various approaches. Faris et al. (2011) and Wang and McGlinchy (2009) grouped emissions and fuel consumption models into five categories: scale input models (microscopic, macroscopic, mesoscopic), formulation models (analytical, empirical, statistical, graphical), dimension models, explanatory variable models, and state variable value models [
9,
10]. Among these, microscopic models have been favored for their ability to capture real-time variations in vehicle behavior, but they often require substantial computational resources. These studies established key models and frameworks that continue to influence current methodologies.
Predictive accuracy pertaining to vehicle emissions, especially NO
x and CO
2 and fuel usage, has been boosted through recent machine learning (ML) implementations together with GPS data application. Modern advances are constructed upon previous research findings, which prove historically significant in developing solutions for current emissions modeling problems. Machine learning approaches have gained prominence in recent years for their ability to process large and complex datasets, offering significant advantages over traditional methods [
11,
12,
13]. Techniques such as artificial neural networks (ANN), support vector machines (SVM), and ensemble methods have shown high predictive accuracy for vehicle emissions and fuel consumption [
14]. Deep learning techniques have also been introduced to the field, though their application remains limited. Works by Altug and Kucuk (2019) and Shin et al. (2021) have explored deep learning for predicting tailpipe
emissions and transient engine emissions, respectively [
15,
16]. These approaches demonstrate potential for capturing complex relationships in emissions data but require extensive computational resources and large datasets.
Researchers in Ecuador developed a new approach to evaluate pollutant emissions generated by light vehicles through their investigation [
17]. Researchers developed an artificial neural network that utilized GPS data speed and slope information together with vehicle-specific data of mass and engine capacity to forecast CO
2, CO, HC, and NO
x emissions [
17]. The examined models demonstrated strong predictive power through their 0.735 R
2 value for CO
2 predictions and 0.798 R
2 value for NO
x predictions, which established the effective combination of GPS data and ML techniques in emissions estimation [
17]. Liu et al. (2023) integrated the Motor Vehicle Emission Simulator (MOVES) model with ML methods to predict
and NO emissions, achieving strong correlations between predicted and measured values [
6]. Ahmed et al. (2021) applied ANN to model the emissions of spark ignition engines using methanol-gasoline blends, reporting an accuracy of 99% [
5]. These studies underscore the adaptability and reliability of ML approaches across various contexts.
More recent advancements have focused on real-world driving data and the integration of portable emission measurement systems (PEMS) for accurate data collection [
18,
19]. Different research developed superior-level models of light-duty vehicle emissions and fuel consumption through deep neural networks (DNNs) [
20]. The system utilized (PEMS) for real-time data collection by recording speed parameters and engine controls together with GPS identification data [
20]. The implementation of DNNs enabled researchers to establish predictions of complex nonlinear associations between driving patterns and emission outputs, which boosted accuracy levels [
20]. Lee et al. (2021) used ANN to predict diesel vehicle emissions under real-world conditions, incorporating diverse input parameters such as vehicle-specific power, ambient temperature, and exhaust gas recirculation rates [
21]. Cha et al. (2021) utilized regression models to predict
emissions from diesel vehicles and found that instantaneous speed and acceleration were among the most influential predictors [
22]. Researchers conducted another study to estimate real-driving CO
2 together with NO
x and fuel consumption through the use of machine learning methods [
23]. The research group used real-measured driving data for developing predictive machine learning models that achieved reliable results for emissions and fuel usage evaluation [
23]. The analysis demonstrates the ability of ML methods to handle complicated data for improving emission estimation accuracy in actual driving circumstances [
23]. Using gradient boosting regression (GBR) models, Wen et al. (2021) demonstrated the capability of ML for NO
x and CO
2 emissions prediction together with fuel consumption modeling for diesel vehicles operating under different circumstances [
24]. The training process employed PEMS data, which incorporated mass air flow rate together with exhaust flow rate features, according to Wen et al. (2021) [
24]. The GBR models confirmed their exceptional predictive accuracy through their R
2 values, which reached up to 0.99 in various driving situations [
24]. A study by Alfaseeh et al. (2020) investigated how deep sequence learning models, specifically long short-term memory (LSTM) networks, can forecast greenhouse gas emissions that occur on road networks [
25]. Time-series data that include speed and traffic density input enables these models to make forecasts of CO
2 equivalent emissions at high temporal resolutions [
25]. The integration of this method enables experts to develop eco-routing strategies and operate real-time emissions monitoring systems. Although ANN and SVM are commonly used, ensemble methods such as bagged and boosted trees have emerged as powerful tools for modeling emissions. Kocev et al. (2007) demonstrated the effectiveness of ensemble learning techniques in handling complex, high-dimensional datasets [
26].
ML algorithms enable rapid vehicle-specific emissions evaluation when used with on-road remote sensing data. Scientists used ensemble model technology, which unites neural networks and extreme gradient boosting with random forests to assess CO, HC, and NO
x emissions from light-duty gasoline vehicles [
27]. On-road vehicle emission supervision becomes possible through this methodology, which provides practical functionality to policymakers and environmental agencies, according to Xia et al. (2021) [
27].
The use of GPS data for real-time speed tracking and the integration of vehicle-specific data have added new dimensions to the prediction models. Furthermore, the increase in computing power and the development of machine learning techniques have enabled researchers to handle large datasets more effectively and generate more accurate, real-time emission predictions. More recent studies have significantly expanded incorporating real-time GPS data and advanced computational techniques, such as machine learning, to enhance the accuracy and applicability of emissions prediction. GPS speed data have recently gained attention as an alternative to vehicle speed data in emissions modeling. Studies by Akkamis et al. (2021) highlighted the accuracy of GPS speed sensors, particularly in high-acceleration scenarios [
28]. However, the reliability of GPS data for predictive modeling in mixed vehicle datasets is not well-established, creating an opportunity for further exploration.
Modern investigations demonstrate that basic research performed in the last twenty years continues to maintain its essential position. Studies during the early phase established speed and acceleration variables as significant components when developing emissions models, which are now combined with advanced machine learning methods combined with GPS systems. Modern predictions based on sophisticated ML techniques developed from traditional empirical models show continued relevance in the present research effort for predicting accurate emissions outcomes. While significant progress has been made, gaps remain in the application of ML techniques to mixed gasoline and diesel datasets and the use of alternative input parameters such as GPS speed data. This study aims to address these gaps by leveraging ensemble bagged models and decision tree algorithms to evaluate the predictive accuracy of GPS-based models for fuel consumption and emissions. By combining datasets from gasoline and diesel engines, this research seeks to develop generalized models that enhance real-world applicability and provide actionable insights for sustainable transportation management.
4. Results
GPS speeds are commonly used in transportation research. GPS speed sensors are typically used to replace radar speed sensors since they deliver speed data in pulse signals [
28]. According to Akkamis et al., (2021), higher update frequency GPS speed sensors give better accuracy, especially in the case of higher acceleration situations [
28]. One of the main contributions of this research is to investigate the efficiency of using GPS speed data in an ML model to predict fuel consumption and vehicle emissions. The percentage error of a GPS speed measurement was 1.4% when compared to vehicle speed data. We investigated how ML can accurately use this low error GPS speed data to predict fuel consumption and emission factors. The compiled engine type dataset was used and a comparison between actual vehicle speed and GPS vehicle speed was made. Moreover, this section discusses the difference between gasoline and diesel engines to explain the performance of the ML model when applied to these two types of engines. Finally, a comparison between the compiled diesel and gasoline dataset and the separate datasets was made to quantify the difference between the real-world (compilation of different vehicle engine types) and the ideal (one engine type) conditions.
In the research database (diesel and gasoline engines), the mean vehicle speed was 46.5 mph and the mean GPS speed was 47.2 mph. Mean fuel rate and emission factors among all monitored runs were 0.0003 gal/s fuel rate, 2.93 g/s , and 0.00029 g/s .
Ensemble and decision tree models were used to predict fuel rate, which measures instantaneous fuel consumption of two exhaust pollutants:
and
. The model results are presented in
Table 2, and the regression scatter plot showing the predicted and measured value relationship is shown in
Figure 2. The results are presented in two categories. The first is an evaluation of modeling fuel consumption using the mixed diesel and gasoline dataset for the vehicle and GPS speed input classes. The second category evaluates the performance of the ensemble bagged algorithm on the single diesel and gasoline datasets for the vehicle and GPS speed input classes.
4.1. Modeling of Fuel Consumption
The modeling of fuel consumption has become an important issue for consideration not just for the implementation of a sustainable intelligent transport system (ITS), but also for economic justification of the amount of energy consumed. Again, in
Table 2, the results of the model performance are presented.
Based on the value of the coefficient of determination (R2), the ensemble bagged model performance exceeded that of the decision tree model in both vehicle and GPS speed input categories.
The scatter plot of the predicted and measured fuel consumption presented in
Figure 2 depicts the prediction accuracy of the models. The results show that the ensemble bagged model performance accuracy is higher than that of the tree algorithm. The scatter plot data points are closer to the line of equality than the tree model plot. This is an indication of better model performance.
Furthermore, the values of the RMSE, MAE, and MSE presented in
Table 2 also indicate that the performance of the ensemble bagged model has better accuracy. Although the normalized MSE value of both models obtained was 0.001, a lower RMSE and MAE value of 0.025 and 0.014 for the bagged model explains the higher R
2 value of 0.967 and 0.968 for fuel consumption prediction using vehicle speed and GPS parameters, respectively.
A superimposition scatter plot of the vehicle and GPS speed presented in
Figure 3 and 4 shows that the GPS input-based models are much closer to the line of equality. Also, the similarities in the models’ plot position for the outliers strongly suggest the potential of using GPS data in the prediction of fuel consumption. The results further show that modeling of fuel consumption with the gasoline engine dataset has high performance accuracy for both vehicle speed and GPS parameter-based models. The determination coefficient (R
2) obtained is about 99% for the gasoline dataset. Furthermore, the poor performance of the diesel engine dataset models could be attributed to either low data quality arising from machine error or the insufficiency of the selected input for predicting consumption in diesel engines.
4.2. Modeling of Exhaust
The transportation industry accounts for large
emissions, especially in urban areas, which not only increase the potential of increasing global warming but also contribute to poor air quality. This research evaluated the model performance of ensemble bagged and decision tree models using vehicle and GPS parameters. Based on the data from
Figure 3 and
Table 3, GPS parameter-based models have higher performance accuracy when compared with vehicle speed-based models in both diesel and gasoline mixed and single databases. The confidence of determination obtained between the normalized predicted and measured values were 0.966 and 0.975 for the ensemble and tree models, respectively, in the GPS input-based model. The results further suggest that real-time monitoring of
is feasible and can be a useful tool to help policymakers make optimizations in the transportation system.
4.3. Modeling of Exhaust
The model performance of exhaust emissions shows a slightly different trend when compared to models. The model shows poor performance using the GPS input parameter. The performance accuracy obtained with the ensemble bagged model was 95% and further decreased to 76% in the tree model. However, the high performance accuracy of the ensemble model also indicates that the GPS parameter can be used to predict exhaust emissions.
Table 4 further indicates that a combination of the evaluating parameters (RMSE, MAE, MSE and R
2) is necessary for validating the performance of a model.
Although, in the prediction of
, the performance accuracy of the single diesel and gasoline database model did not exceed 89%, as observed in
Figure 4, surprisingly, the ensemble bagged methods in the diesel/gasoline mixed database had a higher performance accuracy, up to 96%.
5. Discussion
Rising greenhouse gas emissions and their environmental impacts have become a global concern, necessitating innovative approaches to quantify and mitigate emissions. Real-time estimation of emissions and fuel consumption is critical for developing efficient Intelligent Transportation Systems (ITS) and implementing environmentally and economically sustainable transportation solutions. This study addresses these needs by evaluating machine learning models based on GPS speed and vehicle speed inputs.
Recent work has focused on integrating real-time vehicle behavior data, including speed variations and acceleration patterns, to develop dynamic, location-specific emission forecasts. Despite these advancements, there are still significant challenges that warrant further research. For instance, the variability in GPS signal accuracy, the influence of different vehicle types, and the adaptation of predictive models to new emission standards and real-world driving conditions remain areas for improvement. Additionally, integrating machine learning and artificial intelligence with real-time GPS data presents opportunities to refine emission predictions and address emerging concerns, such as electric vehicles and new fuel technologies. This study addresses these needs by evaluating machine learning models based on GPS speed and vehicle speed inputs.
The findings indicate that GPS speed, as an alternative to vehicle speed, is a reliable and efficient parameter for predicting fuel consumption and exhaust emissions. The low percentage error (1.4%) between GPS speed and vehicle speed measurements demonstrates the feasibility of using GPS data for real-time monitoring. The ensemble bagged model exhibited superior performance compared to the decision tree model across all datasets and parameters, reinforcing its applicability for complex, high-dimensional data.
The study also explored the combined use of gasoline and diesel engine datasets. Interestingly, while the gasoline-only dataset achieved the highest accuracy in fuel consumption predictions, the combined dataset enhanced the predictive accuracy for emissions. These findings suggest that combining datasets can capture broader variations and improve model robustness, particularly for emission predictions.
The comparison of models trained on individual gasoline and diesel datasets versus the combined dataset underscores the importance of data diversity in machine learning applications. The high accuracy achieved with gasoline datasets could be attributed to the consistency of input parameters, whereas the slightly lower performance for diesel engines points to potential variability in data quality or inherent differences in engine behavior.
Figure 5,
Figure 6 and
Figure 7 present a detailed summary of the models’ comparison using a radar chart. In these Figures, the performance metrics were normalized to between 0 and 0.5 to scale their visual representation while ensuring that R
2 remains distinguishable.
A significant contribution of this study is the demonstration of GPS data as a viable alternative to vehicle speed data, particularly in scenarios where installing speed sensors may be impractical. The results pave the way for ITS solutions that can quantify real-time fuel consumption and emissions without the logistical and legal complexities associated with vehicle-integrated systems.
However, the scope of this study is limited to light-duty vehicles and specific exhaust pollutants ( and ). Further research is required to expand these findings to heavy-duty vehicles, additional pollutants, and alternative machine learning techniques such as deep learning. Future investigations should also examine the impact of varying road conditions, vehicle types, and environmental factors to develop more generalized and scalable models.
This research represents a step forward in leveraging machine learning for sustainable transportation, offering practical insights for policymakers and transportation planners aiming to optimize energy use and minimize environmental impacts.
6. Conclusions
This study investigated the application of ensemble bagged and decision tree algorithms for predicting fuel consumption and exhaust emissions ( and ) in gasoline and diesel engine vehicles using vehicle speed and GPS-based speed input parameters. The research also explored the impact of combining gasoline and diesel engine datasets to enhance model accuracy.
The results demonstrated that both ensemble bagged and decision tree models achieved high predictive accuracy, with performance exceeding 95% in fuel consumption predictions for both vehicle speed and GPS speed inputs. The ensemble bagged model consistently outperformed the decision tree model, showcasing its superior capability in handling complex datasets. Notably, GPS-based input models performed comparably to vehicle speed-based models, emphasizing the feasibility of GPS data for real-time modeling applications in intelligent transportation systems (ITS).
Moreover, the analysis revealed that the combination of gasoline and diesel datasets improved the predictive accuracy for emissions, with the mixed dataset performing better than individual engine datasets. However, the gasoline engine dataset exhibited the highest predictive accuracy, achieving 99% in both input variable categories.
This research highlights the potential of leveraging GPS speed data in machine learning models for real-time monitoring of fuel consumption and exhaust emissions. Such advancements could significantly aid policymakers in optimizing transportation systems and mitigating environmental impacts. Future work should focus on expanding the scope of this research to include additional exhaust pollutants, heavy-duty vehicles, and the integration of deep learning techniques to further enhance predictive capabilities and perform comparative analysis with other machine learning methods.