GPS Data and Machine Learning Tools, a Practical and Cost-Effective Combination for Estimating Light Vehicle Emissions

This paper focuses on the emissions of the three most sold categories of light vehicles: sedans, SUVs, and pickups. The research is carried out through an innovative methodology based on GPS and machine learning in real driving conditions. For this purpose, driving data from the three best-selling vehicles in Ecuador are acquired using a data logger with GPS included, and emissions are measured using a PEMS in six RDE tests with two standardized routes for each vehicle. The data obtained on Route 1 are used to estimate the gears used during driving using the K-means algorithm and classification trees. Then, the relative importance of driving variables is estimated using random forest techniques, followed by the training of ANNs to estimate CO2, CO, NOX, and HC. The data generated on Route 2 are used to validate the obtained ANNs. These models are fed with a dataset generated from 324, 300, and 316 km of random driving for each type of vehicle. The results of the model were compared with the IVE model and an OBD-based model, showing similar results without the need to mount the PEMS on the vehicles for long test drives. The generated model is robust to different traffic conditions as a result of its training and validation using a large amount of data obtained under completely random driving conditions.


Introduction
Vehicle emissions from internal combustion engines are the primary source of pollution in urban areas, negatively impacting air quality in cities [1].Consequently, these pollutants need to be quantified [2].Thus, vehicular emissions inventories serve as important tools for implementing and evaluating policies aimed at reducing the environmental impact of vehicular activity on the quality of life of the population [3].The quality of emissions inventory results directly depends on the inputs and methodologies applied in their determination; therefore, various methods exist for estimating pollutants according to the realities of each population.Among the most commonly used alternatives are the International Vehicle Emissions (IVE) model developed in the United States by the Massachusetts Institute of Technology in collaboration with the International Council on Clean Transportation and the Computer Program to Calculate Emissions from Road Transport (COPERT) developed in the European Union by the Joint Research Center.These models estimate vehicular pollution emissions based on parameters such as emission factors, vehicular activity, and characteristics of the vehicle fleet.However, these parameters may not be equivalent to those in regions like Latin America, as variations in geographical and environmental conditions, vehicle technology, driving styles, and fuel quality can significantly impact vehicle emissions, as determined by [4], and may not be fully reflected in the IVE and COPERT calculations [5].Therefore, different authors have developed methods to improve pollutant estimation by considering the specific conditions of each region or city.Costagliola et al. [6,7] found that pollutant emissions estimated using laboratory chassis dynamometer tests and adjusted driving cycles are lower than those determined in real driving cycles.Kurtyka et al. [8] and Mera et al. [9] reach similar conclusions, emphasizing that the differences in results between dynamometer tests and real driving emissions (RDEs) are due to traffic conditions and driving styles.Hence, they recommend evaluating pollutant emissions in real driving cycles.
Fontaras et al. [10] and Samaras et al. [11] determined that trips in private vehicles constitute the main cause of fuel waste and unnecessary emissions of pollutants, influenced by driver behavior, route selection, and traffic management, highlighting the importance of vehicle monitoring for large-scale pollutant estimation.Prakash and Bodisco [12] and Boulter et al. [13] determined that fuel consumption and pollutant emissions depend on vehicle-specific factors such as model, engine displacement, weight, fuel type, technological level, and mileage, as well as operational factors such as speed, acceleration, road gradient, ambient temperature, and especially the gear shifting strategy employed by the driver [14][15][16].Rivera-Campoverde et al. [17] proposed a model based on machine learning and OBDs (on board diagnostics) for estimating emission factors of a single vehicle through real short-duration driving tests in Cuenca-Ecuador, thus avoiding long measurement campaigns and prolonged use of PEMSs (portable emissions measurement systems).Other authors, such as [18,19], proposed GPS-based models that consider real traffic conditions, obtaining good results with low implementation costs.
This article presents a novel method for estimating pollutant emissions from three different types of vehicles, using driving variables such as speed and gradient obtained through GPS, as well as characteristic parameters of each vehicle such as mass, engine displacement, and aerodynamic coefficients through the application of machine learning techniques.To achieve this, RDE tests were conducted on three routes, from which emissions, GPS, and OBD data were collected.With these data, the input variables of the model and their respective levels of importance were estimated, followed by the training of an artificial neural network (ANN) validated with data obtained from three different RDE tests not used for training, confirming the validity of the emissions estimator.Finally, this estimator was applied to a dataset of 324, 300, and 316 km of real driving data for each vehicle.The results were compared with those obtained from the IVE and OBD test models, showing similar outcomes.

Methodology for the Estimation of Emission Gases under Real Driving Conditions
Pollutant emissions must be measured under real driving conditions [20].Within these results, various factors are considered, such as driving style, fuel type, geographic location, and environmental conditions in which vehicles are operated [11], which are currently not considered in the models used by the Mobility Company of the city of Cuenca (EMOV-EP).
To estimate pollutant emissions using a parametric model that considers the weight, engine displacement, and aerodynamic coefficients of the vehicle under real driving conditions, the following steps are proposed, as illustrated in Figure 1: 1.
Acquisition of real driving and emission data on two routes based on [20] for each vehicle.

2.
Estimation of the relative importance of each obtained variable.

3.
Training and validation of the neural network with the most significant variables from route 1.

4.
Validation of the trained ANNs using data from Route 2.

5.
Application of the random driving dataset to the validated ANNs.

6.
Processing and presentation of results.For data collection, the vehicles used are the best-selling ones in Ecuador in the Sedan, SUV, and pickup categories.According to [21], the vehicles, whose characteristics are shown in Table 1, undergo all maintenance operations recommended by the manufacturer.Additionally, the aerodynamic characteristics of the vehicle are displayed, such as the drag coefficient (CX) and the frontal area of the vehicle (Af).The portable emissions measurement system (PEMS) used is the Brain Bee AGS-688 gas analyzer, powered by a battery independent from the test vehicles, as established in [20].Fuel consumption is measured using the AIC Fuel Flow Master 5004.The GPS used is incorporated within the Freematics ONE+ data logger, which stores latitude (Lat), longitude (Lon), altitude (Alt), and vehicle speed (VGPS) data on an SD card in CSV format.In addition to GPS data, the device stores driving data from OBD such as vehicle speed (VOBD).The obtained data are shown in Table 2.For data collection, the vehicles used are the best-selling ones in Ecuador in the Sedan, SUV, and pickup categories.According to [21], the vehicles, whose characteristics are shown in Table 1, undergo all maintenance operations recommended by the manufacturer.Additionally, the aerodynamic characteristics of the vehicle are displayed, such as the drag coefficient (C X ) and the frontal area of the vehicle (A f ).The portable emissions measurement system (PEMS) used is the Brain Bee AGS-688 gas analyzer, powered by a battery independent from the test vehicles, as established in [20].Fuel consumption is measured using the AIC Fuel Flow Master 5004.The GPS used is incorporated within the Freematics ONE+ data logger, which stores latitude (Lat), longitude (Lon), altitude (Alt), and vehicle speed (V GPS ) data on an SD card in CSV format.In addition to GPS data, the device stores driving data from OBD such as vehicle speed (V OBD ).The obtained data are shown in Table 2.

Test Routes
To analyze the behavior of the test vehicles during the application of the RDE tests [20], two different routes were proposed: Route 1 and Route 2. The datasets of each vehicle obtained on Route 1 were divided into 70% for training, 15% for validation, and the remaining 15% for testing the ANNs.The datasets of each vehicle obtained on Route 2 were used for a double cross-validation of the trained ANNs.The data collection routes used in the various RDE tests are located in the city of Cuenca, Ecuador.Urban segments are located in the city center, rural segments on the North Pan-American Highway, and highway segments on the Cuenca-Azogues highway, as shown in Figure 2.

Test Routes
To analyze the behavior of the test vehicles during the application of the RDE tests [20], two different routes were proposed: Route 1 and Route 2. The datasets of each vehicle obtained on Route 1 were divided into 70% for training, 15% for validation, and the remaining 15% for testing the ANNs.The datasets of each vehicle obtained on Route 2 were used for a double cross-validation of the trained ANNs.The data collection routes used in the various RDE tests are located in the city of Cuenca, Ecuador.Urban segments are located in the city center, rural segments on the North Pan-American Highway, and highway segments on the Cuenca-Azogues highway, as shown in Figure 2. The tests were conducted without the presence of rain or strong winds, with the windows closed and without air-conditioning activated.The test vehicles carried two passengers and a full tank of fuel.According to the manufacturer's recommendations, 92-octane fuel was used.The characteristics of the routes in real driving conditions are shown in Table 3 and are validated according to the guidelines in [20].The tests were conducted without the presence of rain or strong winds, with the windows closed and without air-conditioning activated.The test vehicles carried two passengers and a full tank of fuel.According to the manufacturer's recommendations, 92-octane fuel was used.The characteristics of the routes in real driving conditions are shown in Table 3 and are validated according to the guidelines in [20].

Estimation of Pollutants
Based on the volumetric concentrations of pollutants in the exhaust gases measured by the PEMS, the mass flow rates of each pollutant were estimated using the procedure described in [20].The exhaust mass flow rate .m ex [g/s] was estimated from the mass flow rate of air .m in , which was estimated from parameters obtained from OBD, and the fuel flow .m f , measured by the flow meter located in the fuel line. .
The emissions of pollutant j measured on a dry basis C dry,j were corrected to a wet basis C wet, j using the correction factor k w , which depends on the molar ratio of hydrogen α and the concentrations of CO 2 and CO on a dry basis, C CO 2 + C CO , respectively.
The instantaneous mass emissions of each pollutant .m j, i [g/s] are obtained from the in- stantaneous concentration of each gas c j, and the ratio between the density of each component and the overall density of the exhaust µ j .According to [20] the values of µ j are as follows: µ CO 2 = 0.001518, µ CO = 0.000966, µ HC = 0.000499, µ NO X = 0.001587.The instantaneous emissions of pollutants obtained during real driving tests are shown in Figure 3.The emissions of each pollutant  (g) in the driving cycle are equal to the sum of n elements of their instantaneous emissions over time for a sampling time ∆ equal to 0.1 s.The emissions of each pollutant m j (g) in the driving cycle are equal to the sum of n elements of their instantaneous emissions over time for a sampling time ∆t equal to 0.1 s.
The emission factors EF j,k of each pollutant ([g/km]) are determined by the following equation: where m j,k is the mass of pollutant j and s is the distance traveled in section k of the RDE test, where k takes the values of u, r, m for the urban, rural, and highway sections, respectively.The emission factors of each vehicle per section are shown in Figure 4.The emissions of each pollutant  (g) in the driving cycle are equal to the sum of n elements of their instantaneous emissions over time for a sampling time ∆ equal to 0.1 s.

𝑚 = 𝑚 , ∆𝑡
The emission factors  , of each pollutant ([g/km]) are determined by the following equation: where  , is the mass of pollutant j and  is the distance traveled in section k of the RDE test, where k takes the values of u, r, m for the urban, rural, and highway sections, respectively.The emission factors of each vehicle per section are shown in Figure 4.  Applying the total emissions generated for each pollutant and the total distance traveled during the RDE test to Equation (6) yields the average emission factors for each vehicle, which are shown in Table 4.

Predictor Estimation
Among the most influential variables in pollutant emissions, characteristics inherent to individual vehicles stand out, such as engine displacement.This is because larger engines burn more fuel per cycle, resulting in a greater generation of CO 2 , CO, HC, and NO X [22].It is important to consider that the specific influence of engine displacement on emissions may vary depending on the engine design, technology, and implemented emissions control.
Another variable analyzed in pollutant emissions is the vehicle's weight [23], as it influences the rolling resistance force F r , which is shown in Equation (7), and depends on the coefficients of static adherence f = 0.015 and dynamic adherence f 0 = 0.01, as well as affecting the gravitational resistance force F g shown in Equation (8).
The aerodynamic resistance F a is one of the major contributors to the fuel consumption and pollutant emissions of a vehicle, especially when traveling at high speeds [24].It is calculated using Equation ( 9) [25], where the value of air density ρ is equal to 0.89 kg/m³.
The longitudinal acceleration of the vehicle is determined by Equation ( 10), while the forces occurring during driving are applied as shown in Figure 5 and are related using Equation (11), where F T represents the tractive force and F F represents the braking force, and they are mutually exclusive.
sions may vary depending on the engine design, technology, and implemented emissions control.
Another variable analyzed in pollutant emissions is the vehicle's weight [23], as it influences the rolling resistance force Fr, which is shown in Equation (7), and depends on the coefficients of static adherence f = 0.015 and dynamic adherence f0 = 0.01, as well as affecting the gravitational resistance force Fg shown in Equation (8).
The aerodynamic resistance Fa is one of the major contributors to the fuel consumption and pollutant emissions of a vehicle, especially when traveling at high speeds [24].It is calculated using Equation ( 9) [25], where the value of air density  is equal to 0.89 kg/m³.
The longitudinal acceleration of the vehicle is determined by Equation ( 10), while the forces occurring during driving are applied as shown in Figure 5 and are related using Equation (11), where FT represents the tractive force and FF represents the braking force, and they are mutually exclusive.For the training of machine learning architectures, parameters P 1 , P 2 , and P 3 are considered, which refer to the engine displacement, vehicle weight, and its aerodynamic characteristics (C X , A f ), respectively.

Estimation of the Selected Gear
The test vehicles are equipped with manual transmission, and like 69% of the vehicles sold in Ecuador [21], they do not have sensors to determine the gear selected by the driver; therefore, it is necessary to determine this information from the OBD data using machine learning according to the process shown in [17].The K-means algorithm is applied to the data acquired in the RDE test to cluster the vector r, which is calculated using Equation (12).
where VSS is the vehicle speed and RPM is the engine speed obtained from the OBD.The algorithm generates a label for each of the 7, 7, and 6 groups obtained from their centroids [26].The generated groups correspond to each of the 6, 6, and 5 gears plus the neutral position of the sedan, SUV, and pickup vehicles, respectively.With the obtained label, a classification tree (CT) is trained that is applicable to all sampled driving cycles, as shown in Figure 6.

𝑅𝑃𝑀
where VSS is the vehicle speed and RPM is the engine speed obtained from the OBD.The algorithm generates a label for each of the 7, 7, and 6 groups obtained from their centroids [26].The generated groups correspond to each of the 6, 6, and 5 gears plus the neutral position of the sedan, SUV, and pickup vehicles, respectively.With the obtained label, a classification tree (CT) is trained that is applicable to all sampled driving cycles, as shown in Figure 6.The values of VOBD and RPM are directly obtained through the OBD, so they cannot be used to train the GPS-based model.The gear used by the driver cannot be directly determined by VGPS since the gears selected do not depend exclusively on the driving speed.Given that gear usage during driving is random [27], supervised learning is employed, where the forces acting on the vehicle's movement are used as predictors for classification trees, and the gear used by the driver is the output, whose labels were obtained from OBD data, making the training vector I = [VGPS, aX, Fr, Fg, Fa], [19].From the training performed, three classification trees are obtained with 7, 7, and 6 splits to determine the gear of the sedan, SUV, and pickup vehicles, respectively; their training results are shown in the confusion matrices in Figure 7.These hyperparameters were determined based on the The values of V OBD and RPM are directly obtained through the OBD, so they cannot be used to train the GPS-based model.The gear used by the driver cannot be directly determined by V GPS since the gears selected do not depend exclusively on the driving speed.Given that gear usage during driving is random [27], supervised learning is employed, where the forces acting on the vehicle's movement are used as predictors for classification trees, and the gear used by the driver is the output, whose labels were obtained from OBD data, making the training vector I = [V GPS , a X , F r , F g , F a ], [19].From the training performed, three classification trees are obtained with 7, 7, and 6 splits to determine the gear of the sedan, SUV, and pickup vehicles, respectively; their training results are shown in the confusion matrices in Figure 7.These hyperparameters were determined based on the appropriate configuration of the maximum tree, which is quite simple, making pruning unnecessary.Cross-validation of the obtained trees is performed by randomly splitting the training data into several mutually exclusive folds.In each fold, a portion of the data is used for training and another portion for testing [28].The data are divided into 5 folds, with each fold divided into 70% of the data for training and 30% for testing, resulting in an average test accuracy rate of 99.5%.The highest accuracy rates occur in neutral, 5th, and 6th gears, while in 3rd and 4th gears, the model's efficiency decreases because the vehicle's performance under these conditions is very similar.
Sensors 2024, 24, x FOR PEER REVIEW 9 of 18 appropriate configuration of the maximum tree, which is quite simple, making pruning unnecessary.Cross-validation of the obtained trees is performed by randomly splitting the training data into several mutually exclusive folds.In each fold, a portion of the data is used for training and another portion for testing [28].The data are divided into 5 folds, with each fold divided into 70% of the data for training and 30% for testing, resulting in an average test accuracy rate of 99.5%.The highest accuracy rates occur in neutral, 5th, and 6th gears, while in 3rd and 4th gears, the model's efficiency decreases because the vehicle's performance under these conditions is very similar.

Estimation of the Relative Importance of Each Predictor
Predictive models based on machine learning methods such as random forest (RF) suffer from bias and variance issues.Simple models have low variance and high bias, whereas complex models reduce bias but increased variance due to overfitting [29].Therefore, the training process of ANNs is optimized by prioritizing the use of the most important predictors determined by the RF technique [30], which coincides with the selection according to the Gini criterion.RF relies on multiple classification and regression trees

Estimation of the Relative Importance of Each Predictor
Predictive models based on machine learning methods such as random forest (RF) suffer from bias and variance issues.Simple models have low variance and high bias, whereas complex models reduce bias but increased variance due to overfitting [29].Therefore, the training process of ANNs is optimized by prioritizing the use of the most important predictors determined by the RF technique [30], which coincides with the selection according to the Gini criterion.RF relies on multiple classification and regression trees (CART) to mitigate dimensionality problems in predicting variables, thereby enhancing the accuracy and stability of the model obtained by averaging the results of individual CART models [31].This approach is applied to datasets where not all variables are considered, as they are randomly chosen in each CART [32].
For variable selection with RF, the data obtained from the RDE of Route 1 for each test vehicle were considered.The inputs included all vehicle operating parameters obtained through GPS, while the outputs consisted of the resulting pollutant emissions.To reduce the variance contributed by the predictors to the model, a very effective technique called "bagging" was employed.This involves combining results from different CARTs obtained using different subsets of predictors from the same population [31].For this purpose, continuous variables must be transformed into categorical variables through level discrimination [17].The number of levels was set to 7, 110, 144, 144, 144, 144, 3, 3, and 3 for the variables G, V GPS , a x , F g , F r , F a , P 1 , P 2 , and P 3 , respectively.The outcome of the most influential predictors is illustrated in Figure 8.The R 2 factor estimates the quality of the fit that RF has achieved to determine the importance of the variables in each of the outputs [33].It is determined by Equation ( 13), where Y i is the vector of n predictions, Ŷi is the vector of true values, and Y i is their mean value.

Training of the Neural Network with the Most Significant Variables
The data obtained on Route 1 of the RDE test for each vehicle were used to train 1 ANN for each pollutant, with their respective input vectors being as follows: The networks were configured with 4 neurons in the input layer, 10 in the hidden layer, and 1 in the output layer, as determined in [34].The dataset from Route 1 was divided into 70% for training, 15% for validation, and the remaining 15% for testing.The Levenberg-Marquardt backpropagation algorithm was used for network training, employing backpropagation to increase the learning speed [35,36].The training characteristics of the ANNs obtained for estimating CO2, CO, NOX, and HC are shown in Table 5, where it can be observed that generalization is achieved rapidly, avoiding network overfitting.This can be verified by comparing the cost values (mean squared error, MSE) in training, validation, and testing, where the indicator's value in the test dataset is lower than in training.The MSE is calculated using Equation ( 18), where  is the vector of n

Training of the Neural Network with the Most Significant Variables
The data obtained on Route 1 of the RDE test for each vehicle were used to train 1 ANN for each pollutant, with their respective input vectors being as follows: The networks were configured with 4 neurons in the input layer, 10 in the hidden layer, and 1 in the output layer, as determined in [34].The dataset from Route 1 was divided into 70% for training, 15% for validation, and the remaining 15% for testing.The Levenberg-Marquardt backpropagation algorithm was used for network training, employing backpropagation to increase the learning speed [35,36].The training characteristics of the ANNs obtained for estimating CO 2 , CO, NO X , and HC are shown in Table 5, where it can be observed that generalization is achieved rapidly, avoiding network overfitting.This can be verified by comparing the cost values (mean squared error, MSE) in training, validation, and testing, where the indicator's value in the test dataset is lower than in training.The MSE is calculated using Equation (18), where Y i is the vector of n predictions and Ŷi is the vector of true values [33].The networks for estimating CO 2 , CO, NO X , and HC were trained achieved in 221, 344, 101, and 17 epochs, respectively, due to early stopping, ensuring good performance of the networks in the training, validation, and testing stages.The number of epochs is relatively low for estimating HC, as generalization is quickly reached, avoiding network overfitting.This can be verified by comparing the MSE values.

Validation of the Neural Networks
The obtained networks were applied using the data collected on Route 2 of the RDE test for the three vehicles as inputs to compare the results with the data measured by the PEMS.It was observed that the fit is very satisfactory according to the scatter plots and error distribution diagrams shown in Figure 9.The model errors exhibit a nearly normal symmetric behavior around 0, with no offsets in the estimation of each contaminant [37].Moreover, they behave completely randomly, thus ruling out the inference of other variables not considered in the ANNs' training.

Validation of the Neural Networks
The obtained networks were applied using the data collected on Route 2 of the RDE test for the three vehicles as inputs to compare the results with the data measured by the PEMS.It was observed that the fit is very satisfactory according to the scatter plots and error distribution diagrams shown in Figure 9.The model errors exhibit a nearly normal symmetric behavior around 0, with no offsets in the estimation of each contaminant [37].Moreover, they behave completely randomly, thus ruling out the inference of other variables not considered in the ANNs' training.

Results
To assess the performance of the parametric model based on GPS for emission estimation, its results are compared to those obtained by applying the IVE model and the OBD-based estimation model [17].

Results
To assess the performance of the parametric model based on GPS for emission estimation, its results are compared to those obtained by applying the IVE model and the OBD-based estimation model [17].

CO 2 Emissions
The emission of CO 2 depends on the average driving speed.In Figure 10, the results obtained for the three analyzed vehicles are shown; in all three cases, the CO

CO Emissions
The emissions of CO shown in Figure 11 are inversely proportional to t driving speed.The maximum emissions values are 18.04, 17.65, and 27.65 g/km when driving at the minimum average speed using first gear.As the driving creases, the minimum CO emissions are achieved, with values of 2.70, 2.95, and when using the fourth gear in the sedan, SUV, and pickup vehicles, respectiv using gears higher than fourth gear, the emissions slightly increase, highlight portance of efficient driving and proper gear usage to reduce pollutant emissio sedan and SUV vehicles, the results of the proposed model and the OBD-based very similar.However, there is a difference in the results for the pickup veh because these vehicles are used as light-duty vehicles [38], which increases the e and consequently CO emissions due to incomplete fuel combustion [39].

CO Emissions
The emissions of CO shown in Figure 11 are inversely proportional to the average driving speed.The maximum emissions values are 18.04, 17.65, and 27.65 g/km, achieved when driving at the minimum average speed using first gear.As the driving speed increases, the minimum CO emissions are achieved, with values of 2.70, 2.95, and 5.95 g/km when using the fourth gear in the sedan, SUV, and pickup vehicles, respectively.When using gears higher than fourth gear, the emissions slightly increase, highlighting the importance of efficient driving and proper gear usage to reduce pollutant emissions.For the sedan and SUV vehicles, the results of the proposed model and the OBD-based model are very similar.However, there is a difference in the results for the pickup vehicle; this is because these vehicles are used as light-duty vehicles [38], which increases the engine load and consequently CO emissions due to incomplete fuel combustion [39].
using gears higher than fourth gear, the emissions slightly increase, highlighting portance of efficient driving and proper gear usage to reduce pollutant emissions.sedan and SUV vehicles, the results of the proposed model and the OBD-based m very similar.However, there is a difference in the results for the pickup vehicle because these vehicles are used as light-duty vehicles [38], which increases the eng and consequently CO emissions due to incomplete fuel combustion [39].

HC Emissions
The HC emissions determined by the proposed model are very similar to th mated by the OBD model in the sedan and SUV vehicles, with differences observe pickup category, as explained in Section 3.2.As shown in Figure 12, in all three v the emission factor is high at low speed values and high driving speeds, reaching mum emissions value of 0.0235, 0.0343, and 0.0573 g/km at 58.98, 51.28, and 51.9 for the sedan, SUV, and pickup vehicles, respectively.Beyond this speed, HC em increase again.This occurs because at low speeds, the loading and RPM condit not optimal for generating efficient and complete combustion, while at high spe

HC Emissions
The HC emissions determined by the proposed model are very similar to those estimated by the OBD model in the sedan and SUV vehicles, with differences observed in the pickup category, as explained in Section 3.2.As shown in Figure 12, in all three vehicles, the emission factor is high at low speed values and high driving speeds, reaching a minimum emissions value of 0.0235, 0.0343, and 0.0573 g/km at 58.98, 51.28, and 51.98 km/h for the sedan, SUV, and pickup vehicles, respectively.Beyond this speed, HC emissions increase again.This occurs because at low speeds, the loading and RPM conditions are not optimal for generating efficient and complete combustion, while at high speeds, the loading and temperature conditions also affect combustion efficiency [39].However, the behaviors are similar in all three test vehicles, indicating that the model is effective.
Sensors 2024, 24, x FOR PEER REVIEW 13 loading and temperature conditions also affect combustion efficiency [39].However behaviors are similar in all three test vehicles, indicating that the model is effective.

NOX Emissions
The emissions of NOx represented in Figure 13 show that the proposed model the IVE model maintain the same behavior in the sedan vehicle, with the maximum e sions being 0.6907 g/km in first gear at a speed of 9.95 km/h.After this point, the emissions decrease as the average driving speed increases because, at lower speeds engine tends to experience a higher load, which is a crucial factor for NOx emission the SUV and pickup vehicles, the maximum emissions of 1.094 and 0.958 g/km occu second gear at an average speed of 18.94 and 23.79 km/h, respectively; this is because gear is used to gain speed after starting, resulting in an increase in temperature and p sure in the combustion chamber in light-duty vehicles [23,40].This demonstrates tha proposed model is capable of replicating results from a reference model, thus suppo its effectiveness and validity.

NO X Emissions
The emissions of NOx represented in Figure 13 show that the proposed model and the IVE model maintain the same behavior in the sedan vehicle, with the maximum emissions being 0.6907 g/km in first gear at a speed of 9.95 km/h.After this point, the NOx emissions decrease as the average driving speed increases because, at lower speeds, the engine tends to experience a higher load, which is a crucial factor for NO x emissions.In the SUV and pickup vehicles, the maximum emissions of 1.094 and 0.958 g/km occur in second gear at an average speed of 18.94 and 23.79 km/h, respectively; this is because this gear is used to gain speed after starting, resulting in an increase in temperature and pressure in the combustion chamber in light-duty vehicles [23,40].This demonstrates that the proposed model is capable of replicating results from a reference model, thus supporting its effectiveness and validity.
the SUV and pickup vehicles, the maximum emissions of 1.094 and 0.958 g/km oc second gear at an average speed of 18.94 and 23.79 km/h, respectively; this is becaus gear is used to gain speed after starting, resulting in an increase in temperature and sure in the combustion chamber in light-duty vehicles [23,40].This demonstrates th proposed model is capable of replicating results from a reference model, thus supp its effectiveness and validity.

Discussion
To evaluate the performance of the proposed model, its results are compared those obtained using the RDE test.The average emission factors for each model, mined from the total pollutant emissions and total distance traveled, are shown in 6.It is noteworthy that there is a close resemblance between the results of the RD GPS models; small differences arise because the relationship between the urban, and highway segments in the RDE driving cycle differs from what occurs during ra driving, which provided the data used for the GPS model estimation, whereas the v

Discussion
To evaluate the performance of the proposed model, its results are compared with those obtained using the RDE test.The average emission factors for each model, determined from the total pollutant emissions and total distance traveled, are shown in Table 6.It is noteworthy that there is a close resemblance between the results of the RDE and GPS models; small differences arise because the relationship between the urban, rural, and highway segments in the RDE driving cycle differs from what occurs during random driving, which provided the data used for the GPS model estimation, whereas the values estimated by the IVE model are higher than the other models analyzed, as indicated in Section 1 [5].The main difference lies in the CO 2 emissions factor, which, as already discussed, is strongly influenced by low driving speeds in urban areas.The emissions estimated by the proposed model show minimal deviations from the RDE results, with −3.97% in CO emission for the sedan vehicle and −15.56% in HC emissions and −3.57% in NO X emissions for the SUV vehicle.The largest deviations occur in the estimation of emissions for the pickup vehicle due to the specific use of these types of vehicles [38].Table 6 shows the average emission factor values for the three models analyzed.The emissions of CO 2 , CO, and NOX exhibit similar behavior concerning speed, and this is attributed to the gear shifts of the vehicle according to the driving speed.At lower speeds, lower gears (first, second, and third) are engaged, requiring the engine to operate at higher speeds, thereby increasing air and fuel consumption and, consequently, emissions.Conversely, at higher speeds, higher gears (fourth, fifth, and sixth) are utilized, reducing the engine's rotation speed and thus fuel consumption and emissions generated [17].

Conclusions
This article proposes a novel approach for estimating pollutant emissions from the most representative light vehicles circulating in Ecuador based on GPS data and applying machine learning to a large dataset.An approach was developed that initially employs a highly effective classifier to assess the gears selected by the driver.This classifier was built by obtaining labels through K-means clustering and subsequent training of classification trees.Errors manifest in the brief intervals that occur during gear transitions.Pollutant emissions calculations were performed by determining the importance of predictors in the data collected from two RDE test routes using RF.Subsequently, four ANNs were trained, which demonstrated high determination coefficients R 2 of 0.735, 0.861, 0.892, and 0.798 for the estimation of CO 2 , CO, HC, and NO X , respectively, and adequate error behavior, validating the method used.
In urban environments, average driving speeds are reduced, leading to the predominant use of the first, second, and third gears, resulting in a consequent increase in pollutant emission factors.In this context, the proposed model demonstrates greater robustness to various traffic conditions and driving styles in urban areas.This is because the model is based on the results of random driving data covering 324, 300, and 316 km compared to the 96.99, 81.88, and 87.21 km of the RDE test and the results of the IVE model for sedan, SUV, and pickup vehicles, respectively.As the average driving speed increases, the results of the proposed model and the RDE test become more similar due to the decreased influence of traffic on vehicle performance and the smaller number of transient events in driving.
The obtained model is characterized by estimating emissions at a microscopic level with high reliability and low cost, due to the current availability of GPS receivers in a variety of portable devices.It presents advantages over existing models such as the IVE model, as it considers traffic conditions, the physical states of roads, and all interactions and dynamics between vehicles and their surroundings.Additionally, it considers special environmental conditions such as mountainous terrain and altitude above sea level, as well as the specific environmental conditions of each region, such as temperature, humidity, atmospheric pressure, and solar radiation.
The obtained model offers economic and practical advantages in its application compared to other models, given the ease of generating applications for installation on portable devices.Furthermore, it shows highly satisfactory performance, as despite its limitations, it provides excellent results in pollutant estimation without the need for connection to expensive equipment for long periods of time.This work presents several limitations such as vehicle longevity, driving styles, cold operation, and circulation on slopes, as under these operating conditions, engine control systems tend to employ special strategies that directly influence emission behavior, so they should be considered for future developments.It is essential to replicate the proposed methodology in models of vehicles with a greater presence and activity in the automotive fleet, aiming to refine the results of vehicle emissions inventories.

3 .
Training and validation of the neural network with the most significant variables from route 1. 4. Validation of the trained ANNs using data from Route 2. 5. Application of the random driving dataset to the validated ANNs.6. Processing and presentation of results.

Figure 4 .
Figure 4. Emission factors of each vehicle per section of the RDE test.Figure 4. Emission factors of each vehicle per section of the RDE test.

Figure 4 .
Figure 4. Emission factors of each vehicle per section of the RDE test.Figure 4. Emission factors of each vehicle per section of the RDE test.

Figure 6 .
Figure 6.Obtaining labels through K-means and CT training.

Figure 6 .
Figure 6.Obtaining labels through K-means and CT training.
2 emissions are inversely proportional to the average driving speed, and the results of the IVE model are higher than those obtained by the other models, in accordance with what was shown in Section 1 [5].The highest emissions are 249.91,324.55, and 670.61 g/km, achieved at 9.95, 8.65, and 12.95 km/h using the first gear, while the lowest emissions are 35.1,45.58, and 80.98 g/km, achieved at 78.15, 77.48, and 64.15 km/h using the highest gear in the sedan, SUV, and pickup vehicles, respectively.If a comparison is made among the three test vehicles, it can be observed that the highest emission values are found in the pickup, followed by the SUV and sedan; these values are proportional according to their weight and aerodynamics, among other factors.It is worth noting the close similarity between the results generated by the proposed model and the OBD-based model, as both are based on a large amount of data collected under real driving conditions.Sensors 2024, 24, x FOR PEER REVIEW the results generated by the proposed model and the OBD-based model, as bot on a large amount of data collected under real driving conditions.

Table 1 .
Characteristics of the test vehicles.
Figure 1.Methodology and proposed procedure.

Table 1 .
Characteristics of the test vehicles.

Table 3 .
Characteristics of the RDE tests.

Table 3 .
Characteristics of the RDE tests.

Table 4 .
Average emission factors in RDE.