1. Introduction
With global urbanization accelerating, growing transportation demand has driven a continuous rise in vehicle populations, elevating mobile sources to a leading role in atmospheric pollution [
1,
2] and triggering widespread concern over sustainable development problems. Against this backdrop, countries around the world have been continuously updating emission regulations [
3,
4,
5]. China has also proposed its “dual carbon” goals—carbon peaking by 2030 and carbon neutrality by 2060 [
6]—in an effort to achieve effective management and control of road emissions. Meanwhile, the rapid development of technologies such as the Internet of Vehicles, autonomous driving, and virtual calibration [
7,
8,
9] has generated massive volumes of vehicle operation data [
10]. These data not only provide a foundation for precise emission accounting but also place higher demands on the accuracy and real-time performance of emission accounting models [
11].
In response to increasingly stringent emission regulations, vehicle emission reduction technology is constantly innovating. In terms of physical emission reduction technology, Hassan et al. [
12] developed a sandwich filter based on nanocomposite, which exhibits excellent adsorption performance, achieving removal rates of 88.5 ± 2.2%, 99.2 ± 0.5%, and 80.0 ± 1.8% for CO, SO
2, and NO
x, respectively. In addition, in order to achieve more intelligent emission management, an environmentally friendly filter combined with an Internet of Things (IoT) monitoring system [
13] has also been applied to vehicle emissions reduction. While achieving a 70 ± 3.4% CO
2 removal rate, real-time online monitoring of emission data has been completed through a cloud platform. Complementing these physical emission reduction measures, accurate emission accounting models are equally crucial for vehicle emission control. Current research on road vehicle emission accounting methods mainly focuses on two directions: on one hand, various advanced algorithms, including machine learning, are employed for high-fidelity modeling. Wen et al. [
14] proposed an expertise-guided NO
x emissions modeling method for HEVs of peak-valley enhanced Gaussian process regression (PV-GPR) to accurately capture the mapping characteristics. The results indicated a root mean square error (RMSE) of 0.49 for PV-GPR, significantly outperforming both feedforward neural networks (RMSE = 0.84) and cascade neural networks (RMSE = 1.01). Furthermore, PV-GPR maintained an RMSE of 0.96 even with 50% missing data. Wang et al. [
15] proposed a learning model based on a BP-Adaboost algorithm combined with a transfer learning strategy. This study constructed and trained a real-world NO
x emission sub-model to address the NO
x emissions modeling issue for hybrid diesel vehicles under real-world driving conditions. Validation results showed that the coefficient of determination between the predicted instantaneous NO
x emissions and the measured data was 0.854. On the other hand, correction factors are used to calibrate emission data, thereby enhancing prediction accuracy. LEE, Kyu Jin et al. [
16] proposed a method to derive emission factors across various road conditions by introducing correction factors based on relative differences in fuel consumption, informed by existing laboratory data on atmospheric pollutant emissions. This approach enhanced the credibility of emission factor estimations. Similarly, Wang et al. [
17] addressed significant deviation between calculated and measured highway vehicle emissions. By adjusting emission factors using the measured fuel consumption data, the study reduced the bias of calculated CO and NO
x emissions to within 5% of measured values.
During real-world driving, vehicle emissions are influenced by a combination of factors [
18], including internal observed variables of the vehicle (such as engine speed, intake manifold absolute pressure, etc.) and external observed variables (such as driving speed, road gradient, ambient temperature, etc.) [
19]. Therefore, to construct a more effective emission accounting model, it is necessary to fully consider the impact of these multivariate variables. However, directly processing high-dimensional variables not only incurs high computational costs but may also introduce redundant information, affecting both the accuracy and speed of the model. Thus, how to effectively reduce the dimensionality of high-dimensional data while retaining key information is one of the important issues in constructing a precise and efficient emission model. To extract the essence of vast, high-dimensional datasets, previous studies have explored various feature parameter identification, correlation analysis, and dimensionality reduction methods to capture the most representative information for emission modeling [
20]. Wang et al. [
21] proposed a novel feature engineering processing approach that utilized gray correlation analysis and principal component analysis (PCA) to process 16 initial feature parameters, quantifying their correlation with NO
x emissions, and analyzing the correlation coefficients to eliminate redundant parameters, thereby facilitating rapid model convergence during training. Balogun et al. [
22] addressed the NO
2 pollution prediction problem based on Internet of Things (IoT) emission sensors by proposing a hybrid model that integrated the Boruta algorithm with grid search optimization. The method selected key features from 14 IoT sensors, effectively eliminating low-relevance features and reducing data dimensionality. Mohammad et al. [
23] introduced 37 variables into a data-driven dimensionality reduction, employing supervised learning, such as lasso regression and linear support vector machines, and unsupervised techniques (including PCA and factor analysis) to select and extract relevant features. These features were then used to model NO
x, CO, HC, and soot emissions in order to assess the impact of dimensionality reduction on model accuracy.
As mentioned above, conventional dimensionality reduction methods typically require recalculation when handling incremental data, leading to high computational costs. Meanwhile, although data-driven predictive models are widely used, they often lack physical interpretability. Therefore, it is of great significance to construct a framework that not only achieves rapid incremental dimensionality reduction but also constructs an emission model based on a physically meaningful formula. Hybrid Electric Vehicles (HEVs), as a transitional product between conventional internal combustion engine vehicles and fully electric vehicles, can deliver substantial fuel savings [
24] while alleviating the range anxiety inherent to pure electric vehicles and have thus become a major focus of current research. Therefore, this study focuses on the research of the RDE data for HEVs. Specifically, t-distributed stochastic neighbor embedding (t-SNE) firstly projects the multivariate inputs into a compact representation, preserving the most prominent features while dramatically cutting computational overhead. Subsequently, utilizing these low-dimensional embeddings, a dictionary learning-based incremental dimensionality reduction approach constructs both a high-dimensional dictionary and a low-dimensional dictionary, enabling rapid and precise reduction in incoming data. By using straightforward matrix operations with these dictionaries, each incoming data is embedded swiftly. Considering the long computation time of the dictionary learning training process, this study introduces the FISTA (Fast Iterative Shrinkage-Thresholding Algorithm), along with parameter optimization, improving computational efficiency while maintaining the accuracy of dictionary learning. Meanwhile, the t-SNE embeddings train a SuperLearner regression model such that, when presented with new data, the tailored correction factor is instantly generated by leveraging simple matrix calculation, simplifying the path to accurate, real-time NO
x emission factor (EF
NOx) accounting.
However, a critical gap remains in the current literature: existing methods often fail to balance computational efficiency with prediction accuracy when processing the high-dimensional and nonlinear operating data of HEVs. To address this gap, the main contributions of this study can be summarized as follows: 1. By integrating dictionary learning with the SuperLearner model, a high-precision correction method for the NOx emission factor was established. 2. An incremental dimensionality reduction strategy was developed, which allowed for the fast processing of large-scale driving data through simple matrix operations, avoiding time-consuming retraining. 3. The method was tested using independent RDE datasets, demonstrating superior accuracy and generalization capability.
This study is structured as follows:
Section 2 presents the methodologies employed in this research, including t-SNE, dictionary learning, and SuperLearner.
Section 3 describes the analysis process and results for constructing the refined emission accounting model. Ultimately, some conclusions are provided in
Section 4.
2. Materials and Methods
To better illustrate the system proposed for NO
x emissions for HEVs, which coupled dictionary learning-based incremental dimensionality reduction with a SuperLearner regression model,
Figure 1 summarizes the main steps of the method framework. The raw data of HEVs obtained through real driving emission (RDE) tests was processed in which data quality control is put forward. Based on the RDE test data, a formula for calculating the NO
x emission rate was constructed based on engine operational and emission data from the portable emission measurement system (PEMS) from Japan’s HORIBA company. Considering that there was always a certain bias between the calculated and measured values of NO
x emission rate, a correction factor optimization method was therefore proposed in this study to calibrate the measured NO
x amount. Due to the large number of feature variables obtained from RDE tests, it was necessary to apply dimensionality reduction technology to extract the most relevant information to emissions. Therefore, to enable fast, accurate NO
x emission factor prediction, this study first applied t-SNE to embed the processed high-dimensional feature matrix into a low-dimensional space. Dictionary learning then established a mapping between the high- and low-dimensional dictionaries, allowing each new data to be reduced in dimension via straightforward matrix operations. Additionally, SuperLearner was developed to construct a model linking the low-dimensional parameters to the correction factors. Based on the research above, for upcoming incremental data, the low-dimensional embeddings were derived through the dictionary learning-based incremental reduction method, which was subsequently substituted into the constructed correction factor model to calculate the correction factors in order to obtain fast and accurate NO
x emission factor prediction. These correction factors related to the parameter of interest were used to enhance the accuracy of the NO
x emission rate calculations at a high rate of speed.
2.1. Description of Dataset with Quality Control
The dataset for this study was collected during RDE tests of ten HEVs by the China Automotive Engineering Research Institute Co., Ltd. in the city of Chongqing, China. These 10 HEVs are all hybrid passenger cars based on gasoline engines. The rated power of the selected vehicles is presented in
Table 1. These vehicles cover representative power ranges of low, medium, and high. By ensuring the balance of the dataset, the model can effectively learn and capture the differentiated features of vehicles with varying power performance.
The sample data included vehicle information, location information, vehicle speed, collection time, and corresponding engine operational and emission data. The tests were conducted under three distinct driving conditions: urban (v ≤ 60 km/h), suburban (60 km/h < v ≤ 90 km/h), and highway (v > 90 km/h), for example, as shown in
Figure 2.
Initial processing revealed that certain key parameters were invalid. Therefore, quality control should be primarily carried out to check the effectiveness of the real-time measured data, ensuring that all values were nonnegative and the dataset was reliable for practical use. The data quality control process primarily involved two steps: handling invalid values, such as setting negative vehicle speeds as missing and removing them, and conducting effectiveness checks, for instance, correcting negative NOx emission values to zero to align with physical reality.
2.2. NOx Emission Factor Formulation
A method was proposed to calculate the instantaneous NO
x emission factor based on the real-time collected volumetric NO
x concentration in exhaust, the engine intake air, and the fuel flow rate [
25,
26]. The basic formula was expressed in Formula (1). Furthermore, due to the accessibility of engine intake air or fuel flow rate not always being available, the averaged air–fuel ratio was used instead in Formulas (2) and (3) if one of them was invalid:
- (1)
When the engine intake air rate Q
aM and engine fuel flow rate Q
fv were both valid,
where EF
NOx was the instantaneous NO
x emission factor of the vehicle (g/s); C
NOx(down) was the SCR downstream NO
x concentration (ppm); Q
aM was the engine intake air rate (kg/h); Q
fv was the engine fuel flow rate (L/h); ρ
fuel was the gasoline fuel density, set at 0.73 kg/L; M was the molecular weight of NO
x, which was estimated to be equivalent to that of NO
2 of 46 g, as the rapid conversion of a significant proportion of NO into NO
2 under ambient conditions; (1 − losses), where losses was loss of air at the valve, was an instantaneous coefficient that was obtained by verifying the calculation results using PEMS data; ρ
exhaust was the exhaust density, which was assumed to be roughly equivalent to the air density, set at 29 g/mol.
- (2)
When the engine intake air rate Q
aM was valid but the engine fuel flow rate Q
fv was invalid,
where α was the air–fuel ratio for gasoline engines.
- (3)
When the engine fuel flow rate Qfv was valid but the engine intake air rate QaM was invalid,
However, analysis of the available RDE test data revealed some limitations in the above formula: Firstly, using a fixed α (in Formulas (2) and (3)) may introduce certain biases in the calculation results due to the neglect of the instantaneous variation in α on varied engine output power. Meanwhile, on the other hand, effective dynamic intake air flow data was definitely required when calculating instantaneous α. Therefore, from the view of the engine operating principle, in the present study, the formula for accounting EF
NOx was adopted by replacing the intake air rate with the difference value between the engine exhaust rate and the engine fuel rate, as shown in Equation (4),
where Q
exhaust was the engine exhaust rate (kg/h); Q
fuel was the engine fuel rate (kg/h).
2.3. t-Distributed Stochastic Neighbor Embedding (t-SNE)
Because this study was based on RDE tests encompassing numerous interest parameters, it was essential to employ suitable dimensionality reduction techniques to extract representative features and streamline subsequent computations. Given the nonlinear nature of the data, t-SNE was selected as the foundational dimensionality reduction method for the offline training phase, laying the groundwork for subsequent dictionary learning. t-SNE is a typical nonlinear dimensionality reduction method that effectively maps high-dimensional data into a low-dimensional space while preserving local structures and revealing the global structure of the data. In this study, the low-dimensional representations generated by t-SNE are utilized as training targets to construct both high-dimensional and low-dimensional dictionaries. This strategy effectively leverages t-SNE’s advantages in feature extraction while overcoming its inherent limitation in processing unseen data through the dictionary learning framework.
t-SNE works by transforming the Euclidean distances between data points in the high-dimensional space into probability distributions that assign higher probabilities to similar points, and then constructing another set of probability distributions in the low-dimensional space. The core idea is to minimize the Kullback–Leibler divergence between the high-dimensional and low-dimensional probability distributions, thereby achieving dimensionality reduction. To improve robustness to outliers, t-SNE employs a symmetrized joint probability distribution, which not only simplifies the gradient computation but also accelerates the optimization process. Meanwhile, instead of using the conventional Gaussian distribution in the low-dimensional space, t-SNE uses a Student-t distribution with one degree of freedom. This heavy-tailed distribution effectively alleviates the “crowding problem” that often occurs when mapping high-dimensional data to a low-dimensional space, thereby better reflecting the global structure [
27].
In summary, t-SNE effectively balances local and global data structures, making it one of the most widely used dimensionality reduction methods.
2.4. Dictionary Learning-Based Incremental Dimensionality Reduction
The goal of dictionary learning is to reconstruct the original data as a linear combination of a small number of dictionary atoms, effectively eliminating redundant information in the dataset [
28]. The general model of dictionary learning can be expressed as follows:
where X represents the multivariate dataset, D
H is the data dictionary, C denotes the sparse representation coefficients, and ε refers to the relative error.
Suppose Y
1 was the low-dimensional dataset obtained from X through dimensionality reduction, and D
L was the dictionary for the low-dimensional data. For a given data point x
i, let c
i denote its encoding in the dictionary D
H, which corresponded to the i-th column of the encoding matrix. First, t-SNE was applied to X to obtain the corresponding low-dimensional dataset Y
1. Next, K items were randomly selected from X to initialize the high-dimensional dictionary D
H [
29]. After constructing the initial D
H and encoding matrix C, an iterative optimization process was employed to refine these parameters. This iterative procedure continued until the reconstruction error fell below a predefined threshold or the maximum number of iterations was reached. The detailed computational process was illustrated in
Figure 3.
Through dictionary learning and the iterative optimization process, the solution for low-dimensional data dictionary D
L can be given as
When new data was encountered, its encoding matrix on the high-dimensional dictionary was first obtained. Then, by coupling with the low-dimensional dictionary, rapid dimensionality reduction in the new data was achieved. The dimensionality reduction calculation formula was shown in Equation (7).
Dictionary learning encapsulates data characteristics by extracting a set of representative “atoms” from the original database, enabling a compact representation of the entire data. For new incremental data, only a sparse code computed using the pre-learned dictionary is required, without the need to rebuild the full dictionary each time. This approach significantly reduces iteration time and overall computational cost, offering an efficient solution for incremental data dimensionality reduction.
2.5. SuperLearner Regression Method
In this study, t-SNE was firstly applied to reduce the dimensionality of the original high-dimensional dataset. The resulting low-dimensional embeddings were then used both to obtain a low-dimensional dictionary by dictionary learning and as inputs to a regression model for predicting the NOx emission factor correction factor.
In predictive modeling problems, selecting appropriate machine learning algorithms and the hyper-parameters to fit specific datasets is very important. The SuperLearner algorithm mitigates the problem of selecting a single optimal learner by allowing the inclusion of multiple learners for consideration [
30]. Initially, a k-fold split of data is pre-defined in SuperLearner, followed by evaluating various algorithms and configurations on the same data split. The out-of-fold predictions from all models are then retained and used to train a meta-learner, which combines the predictions to determine the best-performing model [
31].
2.6. Data Splitting and Validation Protocol
In order to effectively evaluate the performance of the proposed method framework and prevent data leakage issues, this study adopted a vehicle-based dataset split strategy. Specifically, during the training phase, RDE test data from 10 HEVs were used for dictionary learning and model regression parameter training. During the validation phase, independent datasets from 3 additional HEVs were used to validate the performance of the model. This data split strategy ensured that the test vehicle data is completely independent of the training phase, thereby ensuring that the evaluation results can truly reflect the model’s ability to predict unknown vehicle emissions under real conditions.