1. Introduction
Nowadays, environmental pollution is a global problem that affects the health of the world’s population, in this regard, to face this issue it is important to understand the reasons and sources that lead to the generation of high levels of environmental pollution. Certainly, there exist three main types of pollution that must be primarily considered: air, water, and soil, where air pollution involves the presence of harmful chemical particles in the air at high concentrations that can be detrimental to plant, animal, and human life. Accordingly, the presence of compounds that degrade the air quality has contributed to climate change and ozone depletion, which negatively impact life on Earth. On the other hand, fossil fuels release harmful atmospheric pollutants even before being burned. Indeed, the need to fulfill all energy requirements demanded by humans has been performed through the combustion of oil, coal, and gas; however, these practices are some of the main sources that contribute to the current global warming crisis. The burning of these fossil fuels produces a range of pollutants, both primary and secondary, such as suspended particles, sulfur dioxide (
), carbon dioxide (
), carbon monoxide (CO), hydrocarbons, organic compounds, chemicals, and nitrogen oxides (NOx). These emissions contain the major greenhouse gases, such as
, methane (CH4), NOx, and fluorinated gases. Therefore, air pollution derived from these activities not only poses a threat to air quality but also partially contributes to climate change and global warming [
1]. Specifically, CO is formed when the combustion of fossil fuel is incomplete, so the main sources of CO in the world are energy producers and consumers, such as industry, commerce, and transportation, primarily using coal, oil, and natural gas. The Scenario of Declared Policies (STEPS) forecasts a CO decrease of about 73% by 2030; however, this decrease would be insufficient to achieve global climate goals if the consumption of these fossil fuels remains high [
2]. In an increasingly interconnected world that is dependent on mobility, the reduction in CO emissions produced by internal combustion engine vehicles (ICEVs) represents an environmental challenge [
3]. Although ICEVs are vital for progress and convenience, they are also an insidious source of pollution that endangers the quality of life and the well-being of the planet [
4]. In recent years, emissions resulting from ICEVs, such as CO, NOx,
, and
, have increased due to the high demand of mobility, especially
in 2022 [
5]. It is also important to mention that in the case of compression ignition engines, the temperature inside significantly influences the emission of toxic gases, i.e., diesel engines are exposed to undergo important thermal processes during the combustion process. Accordingly, the determination of the heat dissipation during operation allows us to check the combustion of the air–fuel mixture in the engine. Hence, burning 50% of the fuel dose allows a relative determination of the combustion phase, which means an angular position at 50% of the heat production. This indicator could be used to control combustion in the diesel engine, reducing the emission of toxic components. For example, by dividing the fuel injection into two parts, the first at 10° before top dead center of compression (BTDC) and the second at 50° BTDC, unburned methane (
) emissions and
emissions are reduced by 60% and 63%, respectively [
6]. In modern vehicles, poor maintenance of the cooling system can result in a high level of pollutant emissions such as HC, CO,
, and NOx, for example. A practical study demonstrated that the type of coolant, thermostat, and fan duty cycle are influential variables in controlling engine temperature and consequently uncontrolled toxic emissions. Removing the thermostat from the cooling system reduces HC emissions, but the engine’s performance would be affected initially until it reaches the optimal operating temperature; meanwhile, CO will be emitted at high levels. On the other hand, if the thermostat fully opens when exceeding the optimal temperature, it reduces the fan’s on-time and HC emissions. To achieve minimal HC emissions, it is necessary for the system to operate with a coolant mixture of 84.84% coolant, and 15.15% water. Minimal HC emissions occur when
levels are at a maximum of 15.43% [
7]. During engine warm-up, CO emissions constitute the largest share (up to 50%) of the annual total emissions. This influence was analyzed based on data from Poland’s pollutant emissions inventory for the years 1990–2017. Volatile organic compounds rank next, while the contribution of NOx is the lowest (less than 5%). As a result of the cold-start emissive behavior of internal combustion engines (ICEs), CO and volatile organic compounds’ emissions show a considerably greater impact on pollutant emissions compared to
, NOx, and particulate matter [
8]. Accordingly, despite the relevance of this issue, in different places across the world there exists a lack of specific vehicular emissions models for transportation, which makes it more difficult to implement sustainability strategies. On the other hand, the lack of mathematical models to assist the diagnosis of failures in sensors that manage the emissions control of internal combustion engines (ICEs) is an additional barrier to be faced, even more so if it is intended to contribute to effective solutions for reducing the carbon footprint and mitigate the effects of vehicular emissions.
In this context, several studies have been already proposed; however, most of them have been proposed for modeling NOx and
, and only a few have addressed the generation of CO. For instance, [
9] presents a case study for assessing NOx emissions from a coal-fired power plant and compares ten dynamic algorithms where the performance is assessed through the Root Mean Square Error (RMSE); the study highlights the effectiveness of certain methodologies for multi-step future horizon prediction, providing insights applicable to other dynamic systems. Different factors contribute to the emission generation, but emissions from passenger cars significantly contribute to
emissions in the European Union (EU); therefore, efforts to reduce
emissions have included material changes in vehicle construction, such as replacing steel with lighter materials like aluminum and magnesium. In this regard, mathematical models have facilitated these changes, aiding in the reduction in
emissions from passenger cars [
10]. Likewise, mathematical and geometric models have been employed to study the absorption process in gasoline engine hydrocarbon traps. These models, which incorporate mass conservation, momentum conservation, and energy conservation equations, enable the analysis and improvement of hydrocarbon trap performance in reducing emissions during cold starts [
11]. On the other hand, a novel multivariate grey model with time delay was proposed to measure the cumulative impact of
emissions from China’s transportation sector [
12]; the model uses a Gaussian formula for discretization and particle swarm optimization for weight coefficient determination, and it outperformed competing models and offered insights for emission mitigation strategies. The use of auto-regressive (AR) mathematical models has been applied to fields such as economics, marketing, political science, among others; thus, although they pose challenges and constraints, the proper implementation represents a suitable solution that can lead to estimate and model vehicular emissions. In fact, there are no reported works based on AR models that focus on CO emissions modeling but other applications have been addressed, for example, [
13] carries out a simulation and modeling of a two-level DC/DC power converter using an AR system identification technique where Auto-Regressive with Exogenous inputs (ARX), auto-regressive moving average with exogenous inputs (ARMAX), and output error (OE) model structures are used to generate a mathematical model of the DC/DC converter. The result shows that the ARX model structure produced the best model with 94.03%, compared to ARMAX and OE with 93.70% and 92.25%.
In the transportation sector, the Auto-Regressive Exogenous (ARX) model has been used for identifying the dynamic model of a quarter-car passive suspension system using real-time test data. Input and output data of a vehicle are recorded during driving on a road surface. The results show that the best ARX model for the vehicle’s passive suspension system fits with 90.65% accuracy, meeting system identification requirements and being acceptable for use in automotive suspension system dynamics analysis [
14]. The transportation sector plays a fundamental role in pollutant emissions, given rapid economic growth and the increasing number of vehicles worldwide. The Vector Autoregression (VAR) model allows for more accurate capturing of dynamic relationships between economic variables and
emissions in China’s transportation sector. Using time series data, the causes and potentials for reducing
emissions in China’s transportation sector were explored, taking into account dynamic changes within the VAR model. The results provide a solid basis for identifying the main causes of
emissions in this sector and proposing effective mitigation measures [
15]. A general linear and nonlinear auto-regressive model with exogenous inputs (GNARX) for NOx prediction uses a recursive least squares algorithm with forgetting factor to estimate model parameters, and a new optimization algorithm based on simulated annealing is developed to identify the model structure. The method is first used to complete model simulation, and then engineering data are used to validate its effectiveness and superiority compared to other methods. Based on grey relationship analysis, the main factors influencing NOx formation, such as net engine torque, turbo speed, and accelerator pedal position, are determined as inputs to model diesel engine NOx emission. The results show that the modeling and prediction accuracy of the GNARX model is higher than that of other models, indicating that the GNARX model is feasible for predicting NOx emission [
16]. A different approach can be observed in the car monitoring model to study the emissions of CO, hydrocarbons (HCs), and nitrogen oxide (NOx) gases from each vehicle using the signal light effects of the traffic light as an exogenous variable. The model collects experimental data, uses a simple ordinary differential equation to describe how the variable changes over time, and then employs the numerical method of the Euler Forward Difference Scheme (EFDS) to discretize the equation. Numerical results show that fuel consumption and emissions from each vehicle are influenced by the traffic light signal, which can help drivers adjust their driving micro-behavior to reduce fuel consumption and emissions [
17]. In some research studies, in addition to using regression methods with exogenous variables, artificial intelligence techniques such as artificial neural networks, deep learning, machine learning, genetic algorithms, and others have been employed to develop models for vehicle pollutant emissions. One approach developed to calculate the temporal emissions of NOx from a Euro IV diesel bus involves the use of the Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) technique in conjunction with a Long Short-Term Memory (LSTM) neural network. This method utilizes CEEMDAN to mitigate the non-stationarity and variability of emission data by dividing them into multiple sub-series with different frequencies. Subsequently, a predictive model is established for each sub-series using an LSTM neural network, and the results of each sub-series prediction are aggregated to obtain the final prediction. In general terms, the suggested hybrid model has the capability to provide more reliable and accurate predictions about instantaneous NOx emissions from diesel vehicles. This could establish a foundation for considering the replacement of physical NOx sensors with this model as a prediction basis [
18]. In the past, some emission models have used artificial intelligence, such as the multilayer perceptron (MLP) method, for predicting fuel consumption, as it provides accurate classification results despite the complex properties of different types of inputs. The model considered external environmental factor parameters, vehicle manipulation, and driver driving habits as input variables. In combination with sensitivity analysis, it was found that the use of MLP better classified the given dataset and that the architectures were able to learn powerful features [
19]. There are also other fields of application for autoregression techniques, such as in the field of mechanics, where autoregression theories with complete mathematical foundations are introduced for the first time, from which we can obtain a reference for the use of exogenous variables. The methodology with exogenous terms: Stationary Subspaces-Vector auto-regressive with exogenous (SSVARX) aims to address the lack of in-depth research on degradation trend estimation (DTE) of rotating machinery using autoregression theories. SSVARX stands out by transforming non-stationary vibration signals into degradation indicators with weakly stationary characteristics and performing degradation trend estimations. This approach demonstrates high precision and computational speed in bearing data, highlighting its superiority compared to other existing health prognosis techniques [
20]. Previous work in the field of vehicle emission modeling has demonstrated significant advancements; however, traditional approaches have mainly explored NOx and
emissions, neglecting the detailed analysis of CO emissions and exogenous variables represented by temperature sensor signals, throttle position, and others, leaving a gap in the comprehensive understanding of vehicular pollutant gases. Additionally, various methodologies have been used, such as ARX, ARMAX, VAR models, and artificial intelligence techniques, which can pose a high computational burden. These strategies often focus on steady-state system operation, limiting their applicability in transient situations, such as cold starts. On the other hand, these studies have laid the groundwork for understanding the complex interactions between variables and emissions, providing valuable predictive tools. However, a more holistic approach is needed to address both CO emissions and transient system states for more effective and accurate management of vehicular emissions.
Therefore, this paper aims to address the need to overcome the generation of CO emissions produced by ICVs; hence, this work proposes the development of a mathematical model based on Auto-Regressive Exogenous (VARX) that predicts the CO percentages produced by an ICE during its start-up. The main contribution of this proposal is to establish a strategy for diagnosing excessive CO emissions caused by changes in the engine temperature, which is measured by the ECT sensor, as well as to promote its novel implementation as a diagnosis tool in automotive maintenance programs to detect the unexpected generation of CO emissions without needing to rely on gas analyzer equipment. The proposed method is evaluated under a real dataset acquired from different experiments, and the obtained results demonstrate its effectiveness, offering an innovative tool capable of predicting CO emissions as a percentage based on ECT measurements regardless of whether the ICE operates in a transient or steady-state regime.
The rest of the paper is composed by the Theoretical Background in
Section 2, the description of the proposed Methodology in
Section 3, as well as, the Experimental Setup in
Section 4, the Results and Discussions in
Section 5 and, finally, the Conclusions in
Section 6.
5. Results and Discussion
This work proposes the development of a mathematical model by a linear autoregression with exogenous variables capable of predicting the behavior of CO emissions depending on one or more future samples of the engine temperature. Regarding the proposed methodology, the ECT signal and CO emissions are acquired from different experimental tests, the acquisitions are stored in a personal computer, then, the analysis and processing of data is carried out under the GNU Octave. As aforementioned, the acquired signals are divided into training and validation data, then, from each one of these two data groups the ECT signal is defined as the exogenous variable (the input to the algorithm) and the CO emissions signal is defined as the output variable. Accordingly, a preliminary analysis over the ECT and CO signals in the cold start test at idle speed is performed in order to interpret and understand the behavior of the data for the exogenous variable and the output variable. Thereby,
Figure 4a,b show the graphical representation of the signals used during the training and validation procedure, respectively. The behavior of CO% observed in
Figure 4a depicts a downward trend as samples are acquired, this trend means that the CO% decreases over time as the ECT voltage decreases, while the engine temperature increases to its normal operating temperature, from 95 °C to 100 °C. The ECT sensor is of thermistor type, so its behavior is normal for automotive applications; the reference voltage signal will be lower as the temperature increases. It is notable that after the 30th second, the CO stabilizes, indicating that both emissions control and the settling of the mechanical parts of the engine have begun to operate normally. On the other hand,
Figure 4b shows the acquired signals that are used during the validation of the obtained model. In this one, the same downward trend in the decrease in the ECT voltage over time is observed, but there is a particular difference in the behavior of the CO percentage; in this case, it does not stabilize completely across the samples. However, this is a normal effect that may be due to internal mechanical conditions of the engine, such as wear on rings or valves, for example, and sometimes they do not always work ideally during cold starts. Nonetheless, the CO percentage in both experimental tests tends to be below 0.5% CO as the engine temperature reaches its optimal levels. Hence, it should be highlighted that the acquired signals presented in
Figure 4a,b belong to acquisition obtained from different experiments; in this sense, in the rest of the manuscript the terms of training data and validation data may refer to the acquired data used for training and validation.
From the data obtained in
Figure 4a (training data) and by applying the digital low-pass filter or “Finite Impulse Response-FIR” filter, a visual representation was obtained to project the original signal and the filtered signal as shown in
Figure 5a,b. The impulse response is designed to attenuate or eliminate the high-frequency components of the ECT signal as shown in
Figure 5a, while allowing the low-frequency components to pass and only respond to a finite number of input samples. In this case, the desired frequency response is specified, and then the window design algorithm is used to determine the filter coefficients that meet those specifications and achieve a satisfactory result as shown in
Figure 5b. However, since this filter may not be fully effective in reducing outliers, a moving average filtering is also applied. The inclusion of this filter, in combination with a low-pass filter, provided a robust and effective strategy to attenuate outliers and improve the reliability of the signal of interest, as observed in
Figure 6.
Moreover, the Spearman correlation coefficient is estimated in order to evaluate the correlation between acquired signals that belong to training data, and this estimated metric allows us to assess the monotonic relationship between variables. Thereby, the Spearman correlation coefficient estimated is around 0.48629; this value depicts a positive correlation between the assessed variables (ECT and CO) and suggests a clear connection that is less susceptible to anomalous influences, especially in scenarios with nonlinear data or outliers. This implies that as one set of data increases, the other tends to increase as well, and vice versa, with a clearer association. Therefore, the Spearman coefficient may be preferable when seeking more robust and clearer relationships between variables, especially in environments with complex or nonlinear data. Subsequently, to obtain a visual representation, a scatter plot was used between the ECT sensor voltage and the CO% from the cold start test, shown in
Figure 7. In this scatter plot, it can be observed that when the ECT voltage remains in the range of 0 to 2.8 V, the CO% stays at a minimum. However, as the voltage increases from 2.8 to 3.6 V, the CO% increases considerably in an exponential manner. In practical terms, this means that the CO% remains high if the engine temperature is low, as the ECT voltage decreases as the engine temperature increases.
Accordingly, once the signals are processed, the development of the CO-ECT VARX emissions model is initiated as described in
Section 3.3. For this purpose, the exogenous variables
x(
t) and the target variables
y(
t), which are the ECT voltage and CO% signals, respectively, are loaded into the program developed in GNU Octave. Consequently, seven models are obtained by selecting different orders in the delays of the input and output variables. For each execution, the matrix of past observations corresponding to training data, called the design matrix, is generated. Coefficients and constants are estimated using the least squares technique, and their components are separated. Subsequently, the model with its respective parameters was obtained and with the obtained VARX model, a graphical representation and a specific mathematically linearly adjusted model were developed for use in future sample projections.
Figure 8a,b display the graph of the VARX model and a test prediction with the best of the models tested. Certainly, the prediction in
Figure 8a is achieved by considering a 4-2 order modeling, whereas the prediction of
Figure 8b is obtained by taking into account a 2-1 order modeling. For both predictions, we used the Akaike criterion, which resulted in about −3739.3, meanwhile, the
were about 0.0061061 and 0.0074449, respectively. On the other hand,
Figure 9 presents the obtained graph of the best VARX model out of the three obtained under the
metric criterion, as it yielded a value of 0.0053971. As explained in
Section 3.4, this criterion is used as a standard statistical metric to measure the performance of a model obtained, so it can be inferred that the performance of the CO-ECT VARX 6-3 model is better compared to the CO-ECT VARX 4-2 and CO-ECT VARX 2-1 models.
On the other hand, according to the general Equation (3) of a VARX model presented in
Section 2.3, the programming of proposed modeling also generates the mathematical model in its sixth-order polynomial form of delays as observed in Equation (12). This model can represent the behavior of CO in a generic ICE during the transient cold start phase, within the temperature ranges of thermostat opening and closing as well as within the following:
where
Y(
t) represents the % of CO at time
t, and −0.0023455 is a constant. The coefficients in the delays
y(
t −
n) and
x(
t −
n) are the autoregression coefficients of CO and ECT, respectively, indicating how past values influence
Y(
t), and
ε(
t) is the error term that captures the variability not explained by past values. The value of the Akaike information criterion (AIC), which was −3739.3, suggests an adequate model in terms of fit and simplicity compared to other alternative models calculated using the same autoregression strategy.
For evaluating the performance of the proposed model and in order to compare it with other approaches, the obtained model is compared with another similar model through simulation using the GNU Octave, where the same initial delay criteria were used. In
Figure 10, the resulting model obtained through simulation is shown, which is very similar to the one obtained with the methodology proposed in this work. In this model, the FED was 96.4% with the training dataset, indicating that the model explains approximately 96.4% of the variability in the training data. A high value like this suggests that the model fits well with the training data and can capture the relationships between the input and output variables. However, using the Akaike criterion to evaluate the same simulation model again, with a value of around −10.07, it becomes evident that the modeling obtained through the proposed methodology is better than that of the simulation, as, according to this criterion, a lower magnitude indicates a better-fitting model. Accordingly, the use of this simulation tool has significant disadvantages compared to the model developed with the proposed methodology: (i) The functionalities available in GNU Octave are limited to the algorithms and methods included in the platform, which can restrict the flexibility and customization capability of the model. (ii) It may be subject to software version limitations, which can lead to additional costs and long-term compatibility issues. (iii) Depending on the level of detail provided by GNU Octave, users may have a less profound understanding of the structure and internal functioning of the model, which can hinder the interpretation of the results. (iv) There may be potential limitations in the metrics used to evaluate the model, either due to incompatibility with standard usage metrics or due to a lack of understanding of them. Similarly to the previous case, the simulation model through the GNU Octave has its respective sixth-order polynomial form, as shown in Equation (13), according to the equation of the VARX models presented in the theoretical section.
To quantify the accuracy of the obtained models, the error was estimated using the
metric, which is commonly employed in predictive model evaluation. In this vein, the performance of the model obtained with the proposed methodology (Equation (12)) was compared to the model obtained by simulation (Equation (13)), using
as the evaluation metric, as detailed in
Section 3.4 “model evaluation”. This criterion allowed for analyzing the performance of both models using the corresponding training data for each. As a result, an
metric of 0.0053971 was obtained for the developed VARX model, while for the simulation model, a value of 0.0010756 was recorded. This evaluation suggests that the simulation model might be better; however, in addition to the disadvantages of this simulation-based model mentioned earlier, the model using the proposed technique has the following advantages: (i) There is complete control over all stages of the process, from variable selection to model specification and interpretation of results. (ii) There is flexibility to adjust and modify the model according to specific problem needs and data characteristics, allowing for a more precise adaptation to study-specific conditions. (iii) With the suggested methodology, deeper insight into the model’s structure and functioning is gained, facilitating result interpretation and identification of potential issues or limitations. (iv) You are not limited by the functionalities available in a specific software tool, allowing for the implementation of customized techniques as needed. Additionally,
Table 3 shows the comparative metric results of the alternative models using the proposed method and the model obtained by simulation. The values in
Table 3 demonstrate a better performance of the developed VARX 6-3 model compared to the simulated ARX 6-3 model under the AIC criterion.
To validate the methodology of this work, the data from a new set of 500 samples of CO and ECT from the ICE obtained from training experimental data of the cold start idle speed test are used as input variables. The graph shown in
Figure 11 represents the response of the CO VARX model obtained using the proposed methodology to a new set of data. This image demonstrates the model’s ability to respond to unknown data, and it is worth noting that the difference in signal patterns between training experiment data and validation experimental data is a normal condition, as emissions may increase slightly over time during ICE operation at idle speed due to various causes, such as the deformation of mechanical components like piston rings or valves due to temperature, which can affect cylinder sealing and combustion efficiency, as well as the ignition system, such as spark plugs, coils, or wires, which may exhibit differences in performance during the ICE cold start test.
Although
Figure 11 serves as a reference to observe the behavior of the proposed model with unknown values, it is necessary to analyze the information provided by the additional metric proposed in the Methodology section. The negative FED in the training dataset (−73.2%) suggests that the model has a narrow-range fit, meaning it may have captured too many idiosyncrasies in the training data. The negative value is not physically significant in this context; it only indicates overfitting. On the other hand, the positive FED in the validation dataset (90.6%) indicates that the model fits well with new data. This suggests that the model generalizes well to data it has not seen before, which is a positive sign that the model may have good predictive ability on new and unknown data. To test the CO-ECT ARX model obtained through simulation and provide a second comparative criterion with respect to the VARX 6-3 CO-ECT model obtained with the proposed method, new data obtained from experiment two of the same cold start test were used.
Figure 12 depicts the graph illustrating the behavior of the simulated alternative model with a set of new data. The FED of −121.5% indicates that the simulation model fails to explain any variability with the new data for validation. This could suggest that this model does not generalize well to unknown data and fails to capture the underlying relationships between the variables in the validation dataset. Overall, such a significant discrepancy between the performance of the simulation model with the training data and the validation data signifies that it is overfitting to the specific details of the training data and cannot generalize correctly to new data.
The VARX 6-3 CO-ECT model developed using the suggested methodology has a negative FED in the training dataset, indicating a narrow-range fit. However, despite indicating a narrow-range fit, its ability to generalize and fit well to new data makes it more reliable than the ARX 6-3 model obtained through polynomial regression estimation using GNU Octave, as shown in
Table 4. On the other hand, the ARX 6-3 model obtained through simulation has a high FED with the training data, suggesting a good fit to the data; however, in the validation dataset, it is negative, indicating that its performance is lower than that of the other model in question. The conclusion is that performance on unknown data is crucial in model evaluation, as successful predictions in previously unknown situations are required.
In order to further validate the proposed method, a comparative analysis was conducted with other established models commonly employed in similar research studies. For this purpose,
Table 5 has been included, showcasing the performance metrics of the proposed model alongside those of Long Short-Term Memory, Stationary Subspaces-Vector auto-regressive with exogenous, and Feedforward neural network models. By juxtaposing the results obtained from these various methodologies, a thorough evaluation of the effectiveness and robustness of the proposed approach is facilitated. This comparative analysis not only underscores the strengths of our method, but also provides valuable insights into its performance relative to other state-of-the-art models in the field.