Estimation of Pollutant Emissions in Real Driving Conditions Based on Data from OBD and Machine Learning

This article proposes a methodology for the estimation of emissions in real driving conditions, based on board diagnostics data and machine learning, since it has been detected that there are no models for estimating pollutants without large measurement campaigns. For this purpose, driving data are obtained by means of a data logger and emissions through a portable emissions measurement system in a real driving emissions test. The data obtained are used to train artificial neural networks that estimate emissions, having previously estimated the relative importance of variables through random forest techniques. Then, by the application of the K-means algorithm, labels are obtained to implement a classification tree and thereby determine the selected gear by the driver. These models were loaded with a data set generated covering 1218.19 km of driving. The results generated were compared to the ones obtained by applying the international vehicle emissions model and with the results of the real driving emissions test, showing evidence of similar results. The main contribution of this article is that the generated model is stronger in different traffic conditions and presents good results at the speed interval with small differences at low average driving speeds because more than half of the vehicle’s trip occurs in urban areas, in completely random driving conditions. These results can be useful for the estimation of emission factors with potential application in vehicular homologation processes and the estimation of vehicular emission inventories.


Introduction
Internal combustion engines (ICE) of automobiles are a major source of pollution in urban areas, contributing significantly to the deterioration of air quality in cities [1], which causes a serious problem given that, according to the United Nations Organization in 2016, 54.5% of the world's population lives in urban areas [2]. Private vehicle trips are the main cause of fuel wastage and unnecessary CO 2 emissions. These show inefficiency in three domains: driver behavior, route selection and traffic management [3], wherein parameters such as deficient deceleration, incorrect selected gear and engine speed, excessive speed and acceleration, congestion, poorly synchronized traffic signals, inefficient transfer route choice and lack of knowledge and motivation are highlighted [4,5].
CO 2 emissions are certified under standardized conditions, which are very different from real driving conditions, so when there are not enough data available, simulation can be used [6]. Several stochastic models determine the influence of various variables on fuelconsumption during road driving, such as speed and the presence of road features and without including those manufactured in the European Union and Asia [23] and which also considers types of fuels, environmental and traffic conditions different from those of the city of Cuenca in its database. Therefore, the proposed methodology is novel, since as far as the authors know, it would be the first contribution in Ecuador for the estimation of polluting emissions of one of the most common vehicles in real driving and environmental conditions.
To estimate exhaust gasses emissions using real driving parameters, the following steps that make up the new methodology are proposed and are the same ones that are represented in Figure 1: 1.
Real driving and emission data collection; a. Estimation of the selected gear during driving; b.
Estimation of the relative importance of each measured variable; 2.
Training and validation of the neural network with the most significant variables of Route 1 and validation of the ANN calculated by applying on Route 2; 3.
Application of the data set of 1218.9 km to the validated ANN; 4.
Processing and presentation of results. puterized models for estimating emissions require databases with characteristics of: vehicle fleet, fuels, environmental conditions and geographic location [8]. Currently, the Cuenca Mobility Company (EMOV-EP) estimates the emissions inventory based on the MOBILE6-Mexico model [22], which considers only vehicles manufactured in the USA without including those manufactured in the European Union and Asia [23] and which also considers types of fuels, environmental and traffic conditions different from those of the city of Cuenca in its database. Therefore, the proposed methodology is novel, since as far as the authors know, it would be the first contribution in Ecuador for the estimation of polluting emissions of one of the most common vehicles in real driving and environmental conditions. To estimate exhaust gasses emissions using real driving parameters, the following steps that make up the new methodology are proposed and are the same ones that are represented in Figure 1  The procedure for each of the steps proposed in this methodology is detailed below.

Test Vehicle
The vehicle used in the route tests is a Kia Sportage 2018 model, which is the bestselling SUV in Ecuador according to the Automobile Company Association of Ecuador, 2018 [24]. The vehicle has a DOHC 2.0 L engine, 6-speed manual transmission and 18,720 The procedure for each of the steps proposed in this methodology is detailed below.

Test Vehicle
The vehicle used in the route tests is a Kia Sportage 2018 model, which is the bestselling SUV in Ecuador according to the Automobile Company Association of Ecuador, 2018 [24]. The vehicle has a DOHC 2.0 L engine, 6-speed manual transmission and 18,720 km of travelled distance according to the tachometer and with all the maintenance operations recommended by the manufacturer at the beginning of the measurement campaign.  [20]. The equipment had a calibration certificate by ISO/IEC 17025 using span gas according to ISO 6145, valid at the time of sampling.

Data Logger
Operational parameters of the vehicle were obtained through OBD together with the GPS information using Freematics ONE+ data logger at a frequency of 15.15 Hz and stored on a micro-SD card. Fuel consumption was measured using AIC Fuel Flow Master 5004. The operating and driving parameters are shown in Table 1.
The data logger recorded the information in CSV format, generating a separate file for each driving cycle. This file was vectorised to obtain a time series matrix. The Savitzky-Golay algorithm was subsequently applied to each variable in order to eliminate outliers and soften the discretisation of the measured data [25]. PEMS and data logger recording equipment showed different sampling frequencies, therefore, a re-sampling and re-measuring algorithm was created to obtain compatible vectors regarding size and synchronization. ANNs were used for this purpose, which increased the number of PEMS samples, making them compatible with the number of data logger records, as shown in Figure 2.

Test Routes
In order to analyse the performance of the test vehicle during the application of the real driving emission (RDE) test [20], two different routes were proposed: Route 1 and

Test Routes
In order to analyse the performance of the test vehicle during the application of the real driving emission (RDE) test [20], two different routes were proposed: Route 1 and Route 2. The data set obtained in Route 1 was divided into 70% of the data for training, 15% for validation and the remaining 15% for ANN testing. The data set obtained in Route 2 was used for a double cross-validation of the adjusted ANN. The route is chosen for the data collection for the RDE test in the city of Cuenca-Ecuador-South America, which has its urban area in the city centre and the rural area in the Panamerica Norte road, and the main motorway is the Cuenca-Azogues motorway, as shown in Figure 3.

Test Routes
In order to analyse the performance of the test vehicle during the application of the real driving emission (RDE) test [20], two different routes were proposed: Route 1 and Route 2. The data set obtained in Route 1 was divided into 70% of the data for training, 15% for validation and the remaining 15% for ANN testing. The data set obtained in Route 2 was used for a double cross-validation of the adjusted ANN. The route is chosen for the data collection for the RDE test in the city of Cuenca-Ecuador-South America, which has its urban area in the city centre and the rural area in the Panamerica Norte road, and the main motorway is the Cuenca-Azogues motorway, as shown in Figure 3.  The environmental temperature during the test was 14 • C with no rain or strong winds; the weight of the vehicle including two passengers and a full fuel tank was 1719.5 kg. The vehicle was driven with windows closed, without activating the air conditioning and under minimal traffic conditions. Fuel (92 octane) was used according to the recommendations of the manufacturer. The RDE trip characteristics are shown in Table 2.

Estimation of the Selected Gear
The test vehicle, like most manual transmission vehicles, does not have a selected gear sensor; therefore, it must be determined from the OBD-obtained data. A state-of art-review evidences the lack of an automatic method to infer the gear used in every moment of the trip [11,26] and can estimate the selected gear from the engine speed and wheel speed achieved from the CAN Bus data, identifying the RPM/u ratio within the previously determined intervals. Values that did not fall within the above-mentioned intervals were considered gear changes. Therefore, this paper presents a methodology that allows, based on the vehicle speed and the engine speed and by applying machine learning, the determination of the gear with a high degree of certainty (over 99.5% accuracy). The K-means algorithm was applied to the data obtained in the RDE test, to the r i = VSS i /RPM vector specifically, which generated a label for each of the 7 groups obtained from their centroids [27], and the groups generated corresponded to each one of the 6 vehicle gears and to the neutral position. A classification tree (CT) was trained with the label obtained, which was applicable to all sampled driving cycles, given that the use of gears in a driving cycle is random, making it necessary to draw upon supervised learning [28]. The generated tree had 7 splits and had a 99.5% effectiveness rate, from which the matrix G i = [G 0 , G 1 , G 2 , G 3 , G 4 , G 5 , G 6 ] was obtained and whose elements take value 1 depending on the gear selected in sample i. The labels obtained and CT results are detailed in Figure 4.

Pollutant Estimation
From the volumetric concentrations of pollutants in the exhaust gases, the mass flow rates of each pollutant are determined by using the procedure described in [20]. The exhaust mass flow rate ̇ [ g/s ] is estimated from Equation (1).

̇=̇+̇
(1) where ̇ is the air mass flow estimated from the parameters obtained from OBD, and ̇ is the fuel flow measured by the rotary piston flowmeter located in the fuel-line. Emissions are measured on a dry basis and must therefore be corrected by Equations (2) and (3).
where , is the concentration on a wet basis of the pollutant j in volume; , is the concentration of the pollutant on a dry basis; is the correction factor from dry to wet bases; is the molar ratio of hydrogen, and ∁ 2 + ∁ are the concentrations on a dry basis of CO2 and CO, respectively. The instantaneous mass emissions of each pollutant ̇, [g/s] are obtained by Equation 4.

Pollutant Estimation
From the volumetric concentrations of pollutants in the exhaust gases, the mass flow rates of each pollutant are determined by using the procedure described in [20]. The exhaust mass flow rate where . m in is the air mass flow estimated from the parameters obtained from OBD, and . m f is the fuel flow measured by the rotary piston flowmeter located in the fuel-line. Emissions are measured on a dry basis and must therefore be corrected by Equations (2) and (3).
where C wet, j is the concentration on a wet basis of the pollutant j in volume; C dry,j is the concentration of the pollutant on a dry basis; k w is the correction factor from dry to wet where i is the measuring number; c j, is the instantaneous concentration of the gas component in [ ppm ], and µ j is the ratio between the density of each component and the overall exhaust density. In [20] they are determined to be µ CO2 = 0.001518, µ CO2 = 0.000966, µ NOx = 0.001587, µ HC = 0.000499. The instantaneous emission values of the vehicle can be obtained based on this estimation, as shown in Figure 5. The emission of each pollutant [g] in the driving cycle is equal to the summation of its instantaneous emissions regarding time, as shown in Equation (5).
where ̇ is the instantaneous mass flow of the pollutant j; n is the number of samples in the data set, and ∆ is the sampling time, which is equal to 0.1 s. The cumulative emissions are detailed in Figure 6. The emission of each pollutant m j [g] in the driving cycle is equal to the summation of its instantaneous emissions regarding time, as shown in Equation (5).
where . m is the instantaneous mass flow of the pollutant j; n is the number of samples in the data set, and ∆t is the sampling time, which is equal to 0.1 s. The cumulative emissions are detailed in Figure 6. where , is the mass of the pollutant j, and is the travelled distance in section k of the RDE. k assumes the values of u, r and m for urban, rural and motorway sections, respectively. The results obtained are shown in Table 3. The mass flow of each pollutant ̇, , , the total mass per trip , and the total travelled distance , per each gear selected G is estimated by: The emission factors F j,k per each pollutant [g/km] in section k of the RDE were determined by Equation (6).
where m j,k is the mass of the pollutant j, and s is the travelled distance in section k of the RDE. k assumes the values of u, r and m for urban, rural and motorway sections, respectively. The results obtained are shown in Table 3. The mass flow of each pollutant . m j,i,G , the total mass per trip m j,G and the total travelled distance s j,G per each gear selected G is estimated by:

Estimation of the Relative Importance of Each Measured Variable
To optimise the training process of the ANNs, the use of the most representative or influential variables was prioritised based on the importance of predictor variables provided by the random forest (RF) technique that matched in the selection according to the Gini criterion. RF is based on multiple classification and regression trees (CART) to reduce dimensionality problems in the prediction of variables, therefore improving the accuracy and stability of the model obtained from the average of the results of the individual CART models applied to data sets wherein not all the variables involved are considered because they are randomly chosen in each CART [26].
For the selection of variables with RF, the data obtained in the RDE of Route 1 are taken, being the inputs of all the operating parameters of the vehicle and the outputs the pollutant emissions produced. The result of the most influential predictors is shown in Figure 7. The most influential variables in pollutant emissions are the TPS, MAP, RPM, VSS and GEAR, leaving aside factors such as IAT, ECT and O 2 , with the level of importance of the cut-off value fixed in 5. Acceleration (a x ) is one of the least influential in a direct way that can be explained by the correlation with VSS and GEAR [16].

Training and Validation of the Neural Network with the Most Significant Variables of Route 1
The data obtained in Route 1 of the RDE test are used to train 1 ANN per pollutant, the ones that have 4 neurons in the input layer, 10 in the hidden layer and 1 in the output layer. Their input vectors, respectively, are:

Validation of the ANN with Route 2 Data
The data obtained in Route 2 of the RDE test are applied as inputs to the generated networks, and it can be observed that the adjustment is highly satisfactory, according to the spreading and distribution diagrams of the errors. The residues of the model show a symmetric quasi-normal behaviour around 0, with no offsets in the estimation of each one of the pollutants. The residues behave completely randomly, so inference from other not considered variables in the training of the ANNs is dismissed, as shown in Figure 8.

Validation of the ANN with Route 2 Data
The data obtained in Route 2 of the RDE test are applied as inputs to the generated networks, and it can be observed that the adjustment is highly satisfactory, according to the spreading and distribution diagrams of the errors. The residues of the model show a symmetric quasi-normal behaviour around 0, with no offsets in the estimation of each one of the pollutants. The residues behave completely randomly, so inference from other not considered variables in the training of the ANNs is dismissed, as shown in Figure 8.

Double Validation of the ANN. Data Set of 1218.9 km
The 1218.9 km data set was randomly obtained in real driving conditions. The datalogger was kept connected in the vehicle for one month, where three drivers made use of the vehicle without any prior driving instruction to ensure that the data obtained were as realistic as possible. The driving cycles generated were random, without urban, rural or motorway route planning.

Processing and Presentation of Results
From the total travelled distance, 295 files are obtained, one for each driving cycle, which is defined as the travelled distance of the vehicle from the moment the engine is started until the engine speed is below 50 RPM and the vehicle speed is equal to 0 km/h [20]. Likewise, each cycle is subdivided into movement areas and stop areas, considering The 1218.9 km data set was randomly obtained in real driving conditions. The datalogger was kept connected in the vehicle for one month, where three drivers made use of the vehicle without any prior driving instruction to ensure that the data obtained were as realistic as possible. The driving cycles generated were random, without urban, rural or motorway route planning.

Processing and Presentation of Results
From the total travelled distance, 295 files are obtained, one for each driving cycle, which is defined as the travelled distance of the vehicle from the moment the engine is started until the engine speed is below 50 RPM and the vehicle speed is equal to 0 km/h [20]. Likewise, each cycle is subdivided into movement areas and stop areas, considering a driving micro-cycle as the travelled distance executed from one stop area to the beginning of the next one, according to what is shown in [29], where a total of 2785 files are generated under these conditions. A matrix Mc n,m is stored per each driving micro-cycle, where n represents the number of cycle from which the microcycle m was obtained. This matrix contains all the operating and driving parameters shown in Table 1, the selected gear, and the CO 2 , CO, NO X and HC [g/s] instantaneous emission values calculated through the ANNs obtained and validated in Sections 2.4 and 2.5.
The emission of each pollutant, travelled distance, average speed and time spent on the route travelled are estimated in each micro-cycle matrix per each selected gear.
The environmental conditions do not show great variations throughout the sampling period, since the city of Cuenca is located in the equatorial zone where the climate is practically constant, therefore its influence on the obtained results are discarded.

Results and Discussion
The data obtained in 1218.9 km of random travel distance through the urban, rural and motorway areas of the city of Cuenca, in a total of 47.06 h, are applied to the models generated, producing a data set of 2,505,459 × 18 data, whose results are shown in Table 4. The results obtained allow evaluating vehicle performance in urban, rural and motorway driving. Stops are considered as periods wherein vehicle speed is less than 1 km/h as specified in [20]. Idle time of the vehicle comprises 14.26% of the total running time, so therefore, emissions generated during stops are: CO 2 = 9039.2 g, CO = 99.91 g, NO X = 3.54 g and HC = 0.9398 g, at a generation rate of 374.04 mg/s, 4.13 mg/s, 0.146 mg/s and 0.039 mg/s respectively, as shown in Figure 8. The relative idling emissions regarding the total generated during the whole analyzed period correspond to 7.35% of CO 2 , 1.51% of CO, 1.85% of HC and 0.38% of NO X . These results do not consider special engine operations during a cold start, which require specific studies in future papers; in this case, the increase of emissions at low temperatures is due to the increase in engine speed and does not consider the enrichment of the mixture, as shown in Figure 9. During vehicle real driving, the emissions generated depend on the parameters specified in Section 2.3, so therefore, these results are influenced by the different operating conditions of each trip [17] and consider congestion real conditions that [1][30] defined as very important for estimation in models based on average speed. Figure 10 shows that the 1st, 2nd and 3rd gears are mainly used during the start-up and low average speeds, in short distances travelled mostly in urban areas and very rarely During vehicle real driving, the emissions generated depend on the parameters specified in Section 2.3, so therefore, these results are influenced by the different operating conditions of each trip [17] and consider congestion real conditions that [1,30] defined as very important for estimation in models based on average speed. Figure 10 shows that the 1st, 2nd and 3rd gears are mainly used during the startup and low average speeds, in short distances travelled mostly in urban areas and very rarely in rural and motorway areas. Emission factors of CO 2 , CO, NO X and HC perform proportionally to vehicle average speed during the period of time where these gears were used, indicating that the lower the speed at which the gear change is made, the lower the pollution generated by the vehicle; for example, the emission factor of CO 2 , CO, NO X and HC at an average speed of 12.96 km/h, may turn, when changing from first to second gear, from [554.17, 46.11, 3.21, 0.141] into [121.98, 7.43, 1.612, 0.076] respectively. From 23.14 km/h on (average speed in urban areas), CO 2 , CO and NO X emissions decrease when the average speed lowers while changing to a higher gear, while HC emissions increase when ascendant changes are made and average speed increases. Several studies have highlighted the gap existing between emissions produced in real driving, both the ones determined in certification tests [3] and those estimated by different models [31]. The differences in the estimation that are shown when using IVE model are due to factors like vehicle characteristics, wherein parameters such as manufacturing standard, legislation, gas treatment and feed system technologies, trip characteristics, fuel and driving, plus weather conditions stand out [32]. For the estimation of emission factors applying the IVE model, the average speed values for each gear, which are shown in Table  4, are used.
The proposed model determines the emission factor by relating the total amount of pollutant generated and the travelled distance using Equations (9) and (10) according to the average driving speed per gear. Figure 11 shows the results of the emission factors obtained from the IVE model, RDE test and from the model based on OBD data (OBDM). During urban driving, average driving speed in the RDE test is 23.14 km/h, which is a value extremely influenced by travelled distances made at relatively high speeds in urban areas, so therefore, emissions generated at low driving speeds become less representative, ensuring that emission factors estimated through IVE and OBDM at low driving speeds are higher than the ones determined by the RDE test. Based on this, CO2 and NOX emissions, which were determined by the three models, are highly similar. HC emissions determined by RDE and OBDM have highly similar values and behaviors, both lower than what was estimated by IVE. The Several studies have highlighted the gap existing between emissions produced in real driving, both the ones determined in certification tests [3] and those estimated by different models [31]. The differences in the estimation that are shown when using IVE model are due to factors like vehicle characteristics, wherein parameters such as manufacturing standard, legislation, gas treatment and feed system technologies, trip characteristics, fuel and driving, plus weather conditions stand out [32]. For the estimation of emission factors applying the IVE model, the average speed values for each gear, which are shown in mboxtabreftabref:sensors-1320657-t004, are used.
The proposed model determines the emission factor by relating the total amount of pollutant generated and the travelled distance using Equations (9) and (10) according to the average driving speed per gear. Figure 11 shows the results of the emission factors obtained from the IVE model, RDE test and from the model based on OBD data (OBDM). During urban driving, average driving speed in the RDE test is 23.14 km/h, which is a value extremely influenced by travelled distances made at relatively high speeds in urban areas, so therefore, emissions generated at low driving speeds become less representative, ensuring that emission factors estimated through IVE and OBDM at low driving speeds are higher than the ones determined by the RDE test. Based on this, CO 2 and NO X emissions, which were determined by the three models, are highly similar. HC emissions determined by RDE and OBDM have highly similar values and behaviors, both lower than what was estimated by IVE. The behavior of CO estimated by RDE and OBDM grows when increasing driving speed, contrary to what is determined by IVE. The average emission factors for each model, determined from the total emission of the pollutant and total travelled distance, are shown in Table 5, wherein great similarity is present in the RDE and OBDM results. The values estimated by IVE are higher than the other models analysed. The main difference is the CO2 emission factor, which, as already analysed, is strongly influenced by low driving speeds in urban areas. The obtained results from RDE and OBDM are very similar because both are based on measurements in real driving conditions; the RDE model proposes that the data be taken in a proportion of travel that is close to 34%, 33% and 33% compared to 58.27%, 29.26% and 12.29% in urban, rural and motorways, respectively, that fed the OBDM model and is shown in Table 6. One result is that there is a greater amount of data in the urban area, which is where the CO2 emission is higher ( Figure 11) and that there is less data on the route on the motorway where emissions are lower, causing the average emission value to rise with respect to that obtained by RDE. The idle time values are similar, so they do not contribute to the difference between models. The average emission factors for each model, determined from the total emission of the pollutant and total travelled distance, are shown in Table 5, wherein great similarity is present in the RDE and OBDM results. The values estimated by IVE are higher than the other models analysed. The main difference is the CO 2 emission factor, which, as already analysed, is strongly influenced by low driving speeds in urban areas. The obtained results from RDE and OBDM are very similar because both are based on measurements in real driving conditions; the RDE model proposes that the data be taken in a proportion of travel that is close to 34%, 33% and 33% compared to 58.27%, 29.26% and 12.29% in urban, rural and motorways, respectively, that fed the OBDM model and is shown in Table 6. One result is that there is a greater amount of data in the urban area, which is where the CO 2 emission is higher ( Figure 11) and that there is less data on the route on the motorway where emissions are lower, causing the average emission value to rise with respect to that obtained by RDE. The idle time values are similar, so they do not contribute to the difference between models.

Conclusions
This article proposes a method for the estimation of pollutant emissions by applying machine learning to an important set of OBD data. A classifier was initially obtained for the evaluation of the gear selected by the driver based on obtaining labels by K-means with an effectiveness of 99.5% and the subsequent training of a classification tree. The biggest errors occur in the small instants that transition lasts between gears. The calculation of pollutant emissions was made with the most important predictors based on the training of the 4 ANNs from the data of measurement campaigns on two routes executed with measuring devices in the RDE test. The coefficients of determination R 2 of the 4 ANNs: 0.985, 0.982, 0.999 and 0.982 for the estimation of CO 2 , CO, HC and NO X , respectively, which together with the analysis of the residues, allow to highlight the strength of statistical modelling.
Vehicle stops comprise 14.26% of the total driving time, so therefore, emissions generated in this operating condition correspond to 7.35% of CO 2 , 1.51% of CO, 1.85% of HC and 0.38% of NO X regarding total emissions generated during the entire travelled distance of the itinerary. These amounts may vary during in-cold operating, a problem that has not been addressed in this research and need the development of future work.
Average driving speeds in urban driving are low, producing a predominant use of the 1st, 2nd and 3rd gears with the consequent increase in pollutant emission factors. In this point, the proposed model has more strength towards different driving conditions and driving styles in urban area, as it is based on the results of random driving of 712.39 km, compared to the 21.63 km of the RDE test and the results of the IVE model.
When the average driving speed increases, the OBDM and RDE test results are highly similar due to the lower influence of traffic on vehicle performance and the lower amount of temporary driving events.
The obtained model is stronger in different driving conditions and shows better results at low average driving speeds than IVE and RDE models; therefore, it is recommended to be used for the calculation of emission and estimation factors of vehicular emission inventories.
In future developments, the model obtained can be adjusted to different parameters such as vehicle age, driving styles, gradient driving, weather condition and in-cold operating, given that under these operative conditions, the engine control system opts for special operating strategies that directly affect the performance of the emissions generated. The proposed methodology must be replicated in those vehicle models with the greatest presence and activity in the vehicle fleet of the city, with the purpose of being able to adjust the results of vehicular emission inventories.