Unsupervised Machine Learning to Detect Impending Anomalies in Testing of Fuel Economy and Emissions of Light-Duty Vehicles

: This work focused on demonstrating the capability of unsupervised machine learning techniques in detecting impending anomalies by extracting hidden trends in the datasets of fuel economy and emissions of light-duty vehicles (LDVs), which consist of cars and light-duty trucks. This case study used the vehicles’ fuel economy and emissions testing datasets for vehicle model years 2015 to 2023 with a total of 34,602 data samples on LDVs of major vehicle manufacturers. Three unsupervised techniques were used: principal components analysis (PCA), K-Means clustering, and self-organizing maps (SOM). Results show that there are clusters of data that exhibit trends not represented by the dataset as a whole. Fuel CO vs. Fuel Economy has a negative correlation in the whole dataset (r = − 0.355 for LDVs model year 2022), but it has positive correlations in certain sample clusters (e.g., LDVs model year 2022: r = +0.62 in a K-Means cluster where the slope is around 0.347 g-CO/mi/MPG). A time series analysis of the results of clustering indicates that Test Procedure and Fuel Type, speciﬁcally Test Procedure 11 and Fuel Type 26 as deﬁned by the US EPA, could be the contributors to the positive correlation of CO and Fuel Economy. This detected peculiar trend of CO-vs.-Fuel Economy is an impending anomaly, as the use of Fuel 26 in emissions testing with Test Procedure 11 of US-EPA has been increasing through the years. With the ﬁnding that the clustered data samples with positive CO-vs.-Fuel Economy correlation all came from vehicle manufacturers that independently conduct the standard testing procedures and not data from US-EPA testing centers, it was concluded that the chemistry of using Fuel 26 in performing Test Procedure 11 should be re-evaluated by US-EPA.


Introduction
Through the years, gas-fueled vehicles have been transformed by various means to improve performance in terms of fuel economy and emissions. With the guidance of regulatory agencies such as the US Environmental Protection Agency (US-EPA), better performance in terms of lower emissions and higher fuel economy to meet regulations has been the goal of vehicle manufacturers. Among the various types of transportation vehicles, the light-duty vehicle (LDV) category, which consists of cars and light-duty trucks [1], has the highest annual sales worldwide, amounting to 70-90 million vehicles per year from 2010 to 2020 [2]. Hence, LDVs have been the subject of early implementation testing of various regulatory procedures and standards for transportation vehicles. LDVs were the only vehicles covered when stringent emission standards were implemented by US-EPA in the 1960s-1970s; on-board diagnostics were first implemented only on LDVs in the 1990s; and a mandatory LDV manufacturer in-use testing program was implemented in

Data Source
The dataset used in this work was composed of the US-EPA datasets on vehicles used for testing fuel economy and emissions for LDVs with model years 2015,2016,2017,2018,2019,2020,2021,2022, and 2023 [16]. The datasets were the combined results from vehicle testing done at the EPA's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and from testing results from vehicle manufacturers that independently implemented the standard testing procedures and submitted their own test data to US-EPA. The whole dataset can be accessed from the GitHub repository for this work [19]: https://github.com/dhanfort/Cars22-FEandEmissions.git (accessed on 9 April 2022), and is also free to download from the US-EPA webpage [16]. Note that the LDVs model year 2023 dataset accounted only for the early reporting data and an updated version for the second half of the year would be typically released by US-EPA.

Data Preprocessing
The raw datasets were preprocessed to extract only the key variables of unsupervised learning. An overview summary of the key variables and their definitions is shown in Table 1. The dataset was also checked for any emissions levels exceeding set limits by US-EPA [1] and it was found that all samples were below the emission limits for light-duty vehicles and light-duty trucks.

Data Analysis
The dataset of vehicles with model year 2022 was first used to show the detailed analysis tools and discussions of results derived from unsupervised machine learning analytics ( Figure 1). This particular vehicle model year was also the source of the most peculiar trends in clusters of emission results, as shown in the results and discussion

Data Source
The dataset used in this work was composed of the US-EPA datasets on vehicles used for testing fuel economy and emissions for LDVs with model years 2015,2016,2017,2018,2019,2020,2021,2022, and 2023 [16]. The datasets were the combined results from vehicle testing done at the EPA's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and from testing results from vehicle manufacturers that independently implemented the standard testing procedures and submitted their own test data to US-EPA. The whole dataset can be accessed from the GitHub repository for this work [19]: https://github.com/dhanfort/Cars22-FEandEmissions.git (accessed on 9 April 2022), and is also free to download from the US-EPA webpage [16]. Note that the LDVs model year 2023 dataset accounted only for the early reporting data and an updated version for the second half of the year would be typically released by US-EPA.

Data Preprocessing
The raw datasets were preprocessed to extract only the key variables of unsupervised learning. An overview summary of the key variables and their definitions is shown in Table 1. The dataset was also checked for any emissions levels exceeding set limits by US-EPA [1] and it was found that all samples were below the emission limits for light-duty vehicles and light-duty trucks.

Data Analysis
The dataset of vehicles with model year 2022 was first used to show the detailed analysis tools and discussions of results derived from unsupervised machine learning analytics ( Figure 1). This particular vehicle model year was also the source of the most peculiar trends in clusters of emission results, as shown in the results and discussion sections. Then, the datasets of other model years, 2015 to 2023, were added to conduct a more comprehensive analysis of the peculiar trends observed in the 2022 dataset. In general, the number of observations in each working dataset for the various vehicle model  years was as follows: 4390 for 2015; 4116 for 2016; 4077 for 2017; 4164 for 2018; 4061 for  2019; 3815 for 2020; 3549 for 2021; 3580 for 2022; and 2850 for 2023. In total, there were  34,602 observations from nine vehicle model years. The number of observations in any  model year was well above the suggested minimum number of samples for clustering analysis [20].

K-Means Implementation
The K-Means algorithm for clustering was implemented via the SAS-JMP software that uses the FASTCLUS procedure [21]. The optimal number of clusters was determined via the fit statistic Cubic Clustering Criterion (CCC), which has larger values at better fit models. The range of 2 to 10 clusters were tested iteratively and the optimal number of K-Means clusters was indicated by the maximum CCC.

SOM Implementation
The SOM algorithm available via the MATLAB R2013a add-in SOM Toolbox [17] was used instead of the SOM in SAS-JMP, as the former is more amenable to model configuration by user compared to the latter. A rectangular topographic map of hexagonal lattice of size 15 neurons by 12 neurons constituted the SOM model. This map size meets the size requirement for SOM to have enough datapoints that can hit most of the map neurons (best matching units, or BMU). The rectangular map was chosen over the cylindrical and toroid maps that were also available from the Toolbox due to its lower quantization error during preliminary testing of the SOM models. The other details of the rectangular SOM and its coded implementation in MATLAB can be checked in the supplementary materials in the online repository for this work [19]. The optimal number of clusters from the trained SOM was determined using the Davies-Bouldin Index (DBI), which must have a minimum value at optimal cluster size [17]. The range of 2 to 10 clusters were tested iteratively and the optimal number of rectangular SOM clusters was indicated by the minimum DBI.

Linear Discriminant Analysis
To evaluate the performance of the clusters from K-Means and SOM as separating planes of the dataset, linear discriminant analysis (LDA) in SAS-JMP [14] was done on all the working variables against the cluster assignments. This analysis was the only supervised learning step in the data analysis. Canonical plots were created to visually show the clusters in terms of the canonical variables. The prediction rates of the clusters were also determined. The receiver operating characteristic (ROC) curve of each cluster on the training data was also calculated to determine the trade-off between the sensitivity and (1-specificity) across a series of cut-off points through the clusters.

PCA Implementation
PCA was implemented via the SAS-JMP software [14]. The eigenvalues of the principal components (PCs) were evaluated to determine the data variance captured by the first two PCs (PC1 and PC2), which were then used as projection axes on two-dimensional loadings plots to render the trends of all the variables relative to each other.

Results and Discussion
Data re-projection onto the first few principal components (PCs) was done to reduce the dimensionality of the multivariate dataset, hence simplifying the comparison of variable trends. With this technique, variable trends in the whole (unsegmented) dataset were compared with the segmented dataset resulting from K-Means and SOM clustering. Then, statistical testing of model fits on the whole dataset and pertinent clusters was done to test the significance of parameter statistics.

Whole Dataset Fuel Economy and Emissions
When looking at the whole dataset projected onto the first few PCs (Figure 2), it was found that the first two PCs were enough to capture most of the variabilities in the dataset. That is, the eigenvalues of the first two PCs were higher compared to the residual eigenvalues after the first two PCs (Figure 2A,B). The combined PC1 and PC2 projections can explain 60.8% of the data variability. The score plot of the samples when projected on PC1 and PC2 ( Figure 2C) shows some samples are far from the centroid, which indicates the possibility of a unique outliers cluster [10]. These outliers increase the variance of the data, which must be reduced to minimize uncertainties in statistic parameters [14]. Variance reduction can be done through data segmentation such as K-Means and SOM clustering. The loadings plot ( Figure 2D) shows the direction (from the center (0, 0)) of variables relative to each other. For example, Fuel Economy is opposite the direction of all emissions variables THC, CO, CO 2, and NO x , which means there is an inverse relationship of Fuel Economy with the emissions variables. As the variables approach orthogonal relation, for example THC with either PWR or Axle Ratio, the correlation becomes negligible. Displacement and CO 2 emission almost coincide, which indicates direct proportionality.
Clean Technol. 2023, 5, FOR PEER REVIEW 5 statistical testing of model fits on the whole dataset and pertinent clusters was done to test the significance of parameter statistics.

Whole Dataset Fuel Economy and Emissions
When looking at the whole dataset projected onto the first few PCs (Figure 2), it was found that the first two PCs were enough to capture most of the variabilities in the dataset. That is, the eigenvalues of the first two PCs were higher compared to the residual eigenvalues after the first two PCs (Figure 2A,B). The combined PC1 and PC2 projections can explain 60.8% of the data variability. The score plot of the samples when projected on PC1 and PC2 ( Figure 2C) shows some samples are far from the centroid, which indicates the possibility of a unique outliers cluster [10]. These outliers increase the variance of the data, which must be reduced to minimize uncertainties in statistic parameters [14]. Variance reduction can be done through data segmentation such as K-Means and SOM clustering. The loadings plot ( Figure 2D) shows the direction (from the center (0, 0)) of variables relative to each other. For example, Fuel Economy is opposite the direction of all emissions variables THC, CO, CO2, and NOx, which means there is an inverse relationship of Fuel Economy with the emissions variables. As the variables approach orthogonal relation, for example THC with either PWR or Axle Ratio, the correlation becomes negligible. Displacement and CO2 emission almost coincide, which indicates direct proportionality.

Clustered Dataset Fuel Economy and Emissions
The implementation and testing of clustering algorithms in segmenting the dataset into groups of similar attributes was done using K-Means and SOM. The optimal number of clusters was determined, then the clusters with distinct trends were examined further to elucidate variable trends.

K-Means Clustering
The results of K-Means clustering are shown in Figures 3 and 4. The optimal number of clusters, which was found to be three, resulted in distinct segmentation of the whole dataset. A visual representation of the segmentation is evident in Figure 3, which shows the projection of the cluster-coded samples onto the first two PCs ( Figure 3A) and onto

Clustered Dataset Fuel Economy and Emissions
The implementation and testing of clustering algorithms in segmenting the dataset into groups of similar attributes was done using K-Means and SOM. The optimal number of clusters was determined, then the clusters with distinct trends were examined further to elucidate variable trends.

K-Means Clustering
The results of K-Means clustering are shown in Figures 3 and 4. The optimal number of clusters, which was found to be three, resulted in distinct segmentation of the whole dataset. A visual representation of the segmentation is evident in Figure 3, which shows the projection of the cluster-coded samples onto the first two PCs ( Figure 3A) and onto the first three PCs ( Figure 3B). The results of LDA on these three clusters ( Figure 4) confirmed that the three clusters achieve a very high area under the ROC (AUC), which is in the range 0.9949-0.9999. AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, with AUC = 1.0 being a perfect classifier and AUC = 0.5 being a uniformly random classifier [14]. Hence, the three clusters from the K-Means clustering meet the requirements for a set of good segmentation planes for the dataset. The classification scores of these three clusters are summarized in Table 2. The prediction rates of the three clusters are close to one, with an overall percent misclassification of only 3.22%. The trends within each of these clusters were then examined using PCA ( Figure 5). the first three PCs ( Figure 3B). The results of LDA on these three clusters ( Figure 4) confirmed that the three clusters achieve a very high area under the ROC (AUC), which is in the range 0.9949-0.9999. AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, with AUC = 1.0 being a perfect classifier and AUC = 0.5 being a uniformly random classifier [14]. Hence, the three clusters from the K-Means clustering meet the requirements for a set of good segmentation planes for the dataset. The classification scores of these three clusters are summarized in Table 2. The prediction rates of the three clusters are close to one, with an overall percent misclassification of only 3.22%. The trends within each of these clusters were then examined using PCA ( Figure 5).    the first three PCs ( Figure 3B). The results of LDA on these three clusters ( Figure 4) confirmed that the three clusters achieve a very high area under the ROC (AUC), which is in the range 0.9949-0.9999. AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, with AUC = 1.0 being a perfect classifier and AUC = 0.5 being a uniformly random classifier [14]. Hence, the three clusters from the K-Means clustering meet the requirements for a set of good segmentation planes for the dataset. The classification scores of these three clusters are summarized in Table 2. The prediction rates of the three clusters are close to one, with an overall percent misclassification of only 3.22%. The trends within each of these clusters were then examined using PCA ( Figure 5).     Cluster 3 showed a peculiar trend ( Figure 5). For Cluster 3, the CO levels were directly proportional to the Fuel Economy levels ( Figure 5D), which had the opposite relation when considering the whole dataset ( Figure 2D). The Axle Ratio was also directly proportional to the CO levels in K-Means Cluster 3, but it did not have any effect on CO levels on average in the whole dataset ( Figure 2D). The same trend exists for Axle Ratio and THC, with Axle Ratio inversely proportional to the THC levels in K-Means Cluster 3, but not having any effect on THC levels on average in the whole dataset ( Figure 2D).

Self-Organizing Maps Clustering
The results of SOM clustering are shown in Figures 6 and 7. A unique set of results from SOM are the projections of the variables onto component planes ( Figure 6B,I) after the SOM model training on the dataset. These are visual renderings of the relative levels of the variables at a specific neuron position (a unit cell on the map) on the maps. The Umatrix, or unified distance matrix ( Figure 6A), represents the Euclidean distance between the codebook vectors of neighboring neurons, and the high values in the map indicate regions of samples of distinct attribute levels [17]. The optimal number of clusters for the model year 2022 dataset was seven, which resulted in the minimum DBI during the SOM clustering ( Figure 6J). Unlike the K-Means, which showed very distinct segmentation of the dataset on a 2D canonical plot, SOM produced a set of clusters that have some overlaps when projected onto the first two canonical variables in the LDA ( Figure 7A). This is expected, as the number of SOM clusters is greater than the first few canonical variable representations. Nonetheless, the AUCs of the seven SOM clusters were in the range of 0.9664-0.9977 ( Figure 7B), which are still high AUC values [10]. The classification scores of these seven SOM clusters, which are in the prediction rate range of 0.794-0.955 (Table As an unsupervised learning technique, K-Means is not guaranteed to produce clusters containing equal numbers of samples. Clusters 1 and 2 were assigned 1357 and 2144 samples, respectively, whereas Cluster 3 was assigned 79 samples (out of the 3580 total samples). When applying PCA in each cluster, Clusters 1 and 2 demonstrated variable trends similar to the whole dataset (see supplementary material for these results), whereas Cluster 3 showed a peculiar trend ( Figure 5). For Cluster 3, the CO levels were directly proportional to the Fuel Economy levels ( Figure 5D), which had the opposite relation when considering the whole dataset ( Figure 2D). The Axle Ratio was also directly proportional to the CO levels in K-Means Cluster 3, but it did not have any effect on CO levels on average in the whole dataset ( Figure 2D). The same trend exists for Axle Ratio and THC, with Axle Ratio inversely proportional to the THC levels in K-Means Cluster 3, but not having any effect on THC levels on average in the whole dataset ( Figure 2D).

Self-Organizing Maps Clustering
The results of SOM clustering are shown in Figures 6 and 7. A unique set of results from SOM are the projections of the variables onto component planes ( Figure 6B,I) after the SOM model training on the dataset. These are visual renderings of the relative levels of the variables at a specific neuron position (a unit cell on the map) on the maps. The U-matrix, or unified distance matrix ( Figure 6A), represents the Euclidean distance between the codebook vectors of neighboring neurons, and the high values in the map indicate regions of samples of distinct attribute levels [17]. The optimal number of clusters for the model year 2022 dataset was seven, which resulted in the minimum DBI during the SOM clustering ( Figure 6J). Unlike the K-Means, which showed very distinct segmentation of the dataset on a 2D canonical plot, SOM produced a set of clusters that have some overlaps when projected onto the first two canonical variables in the LDA ( Figure 7A). This is expected, as the number of SOM clusters is greater than the first few canonical variable representations. Nonetheless, the AUCs of the seven SOM clusters were in the range of 0.9664-0.9977 ( Figure 7B), which are still high AUC values [10]. The classification scores of these seven SOM clusters, which are in the prediction rate range of 0.794-0.955 (Table 3), are not as high as those of K-Means and have an overall percent misclassification of 14.72%. Clean Technol. 2023, 5, FOR PEER REVIEW 8 3), are not as high as those of K-Means and have an overall percent misclassification of 14.72%.  The segmentation of samples in each SOM cluster was examined for any commonality with the K-Means clusters. The samples captured by SOM Cluster 1 are subset samples of the K-Means Cluster 3. That is, of the 78 samples in K-Means Cluster 3, 66 were also assigned to the SOM Cluster 1 ( Figure 8E). This indicates that these two clusters covered   The segmentation of samples in each SOM cluster was examined for any commonality with the K-Means clusters. The samples captured by SOM Cluster 1 are subset samples of the K-Means Cluster 3. That is, of the 78 samples in K-Means Cluster 3, 66 were also assigned to the SOM Cluster 1 ( Figure 8E). This indicates that these two clusters covered  The segmentation of samples in each SOM cluster was examined for any commonality with the K-Means clusters. The samples captured by SOM Cluster 1 are subset samples of the K-Means Cluster 3. That is, of the 78 samples in K-Means Cluster 3, 66 were also assigned to the SOM Cluster 1 ( Figure 8E). This indicates that these two clusters covered almost the same segment of the dataset. This can be visually verified in Figures 4A and  7A, showing these clusters as extreme groups in the first two canonical variables in LDA. The prediction rate of the SOM Cluster 1 is at 0.955, which is the second highest rate in the eight clusters (Table 3).
Clean Technol. 2023, 5, FOR PEER REVIEW 9 almost the same segment of the dataset. This can be visually verified in Figure 4A and Figure 7A, showing these clusters as extreme groups in the first two canonical variables in LDA. The prediction rate of the SOM Cluster 1 is at 0.955, which is the second highest rate in the eight clusters (Table 3).

Performance of K-Means and SOM Clustering
The data segmentation results of the K-Means and the SOM algorithms are not exactly the same, but they both capture clusters of samples, K-Means Cluster 3 and SOM

Performance of K-Means and SOM Clustering
The data segmentation results of the K-Means and the SOM algorithms are not exactly the same, but they both capture clusters of samples, K-Means Cluster 3 and SOM Cluster 1, that exhibit similar trends. These clusters show some variable trends that the whole dataset does not represent. These findings demonstrate the capability of K-Means and SOM in extracting hidden trends in the bigger dataset. Of the 3580 total working samples, 78 (or 2.18 % of 3580) are assigned to K-Means Cluster 3, and 66 (or 1.84% of 3580) are assigned to SOM Cluster 1. This is the kind of data mining problem where clustering algorithms K-Means and SOM are needed-detecting outliers that are of smaller percentages of the data, but that can have attributes with correlations different from the whole dataset [10,11]. The LDA results (Figures 4 and 7; Tables 2 and 3) also confirm this consistency of clustering results for K-Means Cluster 3 and SOM Cluster 1. Between the two clustering techniques, however, K-Means has a lower cluster misclassification rate of 3.32% (Table 2) than that of SOM at 14.72% (Table 3).

Bivariate Analysis on CO vs. Fuel Economy
Among the various clusters determined, Cluster 3 from K-Means and Cluster 1 from SOM showed peculiar trends regarding the relation of CO emissions to Fuel Economy. A bivariate analysis of this relation was done to determine summary statistics and model fitting. Figure 9 shows a summary of the results for the whole dataset samples ( Figure 9A), K-Means Cluster 3 samples (Figure 9B), and SOM Cluster 1 samples ( Figure 9B). These results show that the correlation of CO emissions and Fuel Economy is statistically different between the calculation on the whole dataset and the calculation on the clusters. The direction of proportionality changes from negative correlation r = (−) 0.355 ( Figure 9A) on the whole dataset to a positive correlation on the clusters; r = (+) 0.62 on K-Means Cluster 3 ( Figure 9B) and r = (+) 0.491 on SOM Cluster 1 ( Figure 9C). Model fitting on the data was also done to test the null hypothesis that the slopes of proportionality are zero, which would mean no functional linear relation between CO and Fuel Economy. Rejecting this null hypothesis would mean the slopes are statistically different from zero, which eliminates the possibility of random errors causing the correlations. The calculated slopes of the linear models fitted to the clusters data are different from zero; they are 0.347 CO (g/mi)/MPG and 0.298 CO (g/mi)/MPG for K-Means Cluster 3 and SOM Cluster 1, respectively, with [Prob>|t|] less than 0.001 at a 5% significance level ( Figure 9B,C). In addition, the lackof-fit test was done to confirm the fitting performance of the models; it tested whether the lack-of-fit error is zero (equivalently means significantly smaller than pure error) [14]. With F-statistic probabilities [Prob>F] = 0.0.2349 and 0.4215 for K-Means Cluster 3 and SOM Cluster 1, respectively, the lack-of-fit test cannot reject the null hypothesis at a 5% significance level, which means that the lack-of-fit error is statistically zero. This confirms the inference that the linear models for CO vs. Fuel Economy statistically fit the data in each cluster. These results mean that when looking at the whole dataset, the statistical inference is that the CO emission decreases with Fuel Economy. On the other hand, when looking at the two clusters, K-Means Cluster 3 and SOM Cluster 1, the statistical inference is that CO emission is increasing with increasing Fuel Economy. Clean Technol. 2023, 5, FOR PEER REVIEW 11

Other Notable Variable Correlations
There are other notable trends seen on the whole dataset and on the K-Means Cluster 3 and SOM Cluster 1 as shown by the multivariate correlation graphs in Figure 10.

Unsupervised Learning Uncovers an Impending Anomaly
With the clusters K-Means Cluster 3 and SOM Cluster 1 for LDVs model year 2022 showing peculiar trends relative to the whole dataset, the distributions of clustered data based on the categorical variables in the dataset were evaluated. These categorical variables were not used in the unsupervised clustering of the dataset done in the preceding discussions, but their dominance in the K-Means Cluster 3 and SOM Cluster 1 could help explain the peculiar CO-vs.-Fuel Economy correlations. Two categorical variables were found to have dominant levels in K-Means Cluster 3 and SOM Cluster 1: Test Procedure and Fuel Type. Figure 11 shows a distribution analysis of Test Procedure and Fuel Type in the whole 2022 dataset and in K-Means Cluster 3. Note that because SOM Cluster 1 is a subset of K-Means Cluster 3, and because K-Means has a lower misclassification rate (3.32% in Table 2) than SOM, the distribution analysis used only the clustered K-Means Cluster 3 against the whole dataset. The categorical level codes for the Test Procedure and Fuel Type are based on the US-EPA coding described in Tables 4 and 5.

Unsupervised Learning Uncovers an Impending Anomaly
With the clusters K-Means Cluster 3 and SOM Cluster 1 for LDVs model year 2022 showing peculiar trends relative to the whole dataset, the distributions of clustered data based on the categorical variables in the dataset were evaluated. These categorical variables were not used in the unsupervised clustering of the dataset done in the preceding discussions, but their dominance in the K-Means Cluster 3 and SOM Cluster 1 could help explain the peculiar CO-vs.-Fuel Economy correlations. Two categorical variables were found to have dominant levels in K-Means Cluster 3 and SOM Cluster 1: Test Procedure and Fuel Type. Figure 11 shows a distribution analysis of Test Procedure and Fuel Type in the whole 2022 dataset and in K-Means Cluster 3. Note that because SOM Cluster 1 is a subset of K-Means Cluster 3, and because K-Means has a lower misclassification rate (3.32% in Table 2) than SOM, the distribution analysis used only the clustered K-Means Cluster 3 against the whole dataset. The categorical level codes for the Test Procedure and Fuel Type are based on the US-EPA coding described in Tables 4 and 5.  Figure 1. This allowed for a trends analysis that leveraged on the strength of the K-Means algorithm to test the hypothesis that increasing the use of Test Procedure 11 and Fuel 26 results in a higher tendency to have positive correlation of CO vs. Fuel Economy. This concept is based on the fact that segmentation of dataset via K-Means results in more distinct clusters that exhibit particular trends as the number of observations increases [10,20]. That is, higher percentages of a particular test or fuel type implemented on LDVs should result in higher chances of their being clustered together due to the higher dominance of their influence on the features used in clustering. If the influence of a test procedure or a fuel type is not peculiar, then it should be clustered by K-Means with the majority of the dataset amid the increasing percentage of its count in   To perform a more comprehensive evaluation of the influence of Test Procedure and Fuel Type, the datasets for other vehicle model years 2015, 2016, 2017, 2018, 2019, 2020, 2021, and 2023 were also analyzed in the same data analytics workflow applied to model year 2022 as depicted in Figure 1. This allowed for a trends analysis that leveraged on the strength of the K-Means algorithm to test the hypothesis that increasing the use of Test Procedure 11 and Fuel 26 results in a higher tendency to have positive correlation of CO vs. Fuel Economy. This concept is based on the fact that segmentation of dataset via K-Means results in more distinct clusters that exhibit particular trends as the number of observations increases [10,20]. That is, higher percentages of a particular test or fuel type implemented on LDVs should result in higher chances of their being clustered together due to the higher dominance of their influence on the features used in clustering. If the influence of a test procedure or a fuel type is not peculiar, then it should be clustered by K-Means with the majority of the dataset amid the increasing percentage of its count in the dataset; otherwise, its peculiar influence would stand out with its increasing count in the dataset. Figure 12 shows the graphical results of this analysis.
The use of Fuel 26 as part of US-EPA's emissions testing standard fuel set has been increasing through the years ( Figure 12B). The use of Fuel 26 in Test Procedure 11 has also been increasing through the years ( Figure 12C). Applying the concept that the increasing sample size of a particular treatment can affect the clustering of samples with peculiar feature trends [20], it can be inferred that Fuel 26 and Test Procedure 11 may have been the factors behind the positive correlation of CO and Fuel Economy ( Figure 12A). Isolating the effects of each factor may be difficult because Fuel 26 has been increasingly used in Test Procedure 11 in recent years. Also worth noting is that the originators of the data samples in the clusters with positive correlation of CO vs. Fuel Economy are the vehicle manufacturers (see captions of Figures 5 and 8) and not a US-EPA testing center. This was found in both the LDVs model years 2022 and 2023. Considering that manufacturers follow established test procedures and use test fuel standards independent of each other and US-EPA, the independence of sampling was guaranteed in the clustered datasets. This also eliminates the possible issue of US-EPA testing centers being factors in the peculiar trends. This leads the inquiry to the chemistry of Fuel 26 being used in Test Procedure 11. Clean Technol. 2023, 5, FOR PEER REVIEW 15 already demonstrated some techniques in such an inquiry. Such evaluation may elucidate necessary adjustments to the established testing procedures and standard fuels [7,8].

CO vs. Fuel Economy Anomaly in the Big Picture of LDVs Market
The fuel economy of vehicles has been a common parameter used in the valuation of vehicle performance, not just in the US, but also worldwide [24,25]. Hence, vehicle manufacturers have been aiming to constantly improve fuel economy ratings. Part of having these vehicles be available for purchase by consumers, however, is the need for certifications issued by regulatory agencies such as US-EPA that consider emissions performance in addition to fuel economy ratings. Emissions performance has been the center of legislative and regulatory issues; for example, the state of California in the US has been imposing stricter emission standards relative to US-EPA standards [26,27]. A higher fuel economy rating does not necessarily mean a good emissions rating, as shown in this work ( Figure 9). However, emissions ratings are the result not solely of vehicle engine design, but also of the implemented testing procedures and the test fuels used in testing, as shown in this work (Figure 11, Tables 4 and 5). Therefore, it is necessary for regulatory agencies to make sure the test procedures and test fuels are appropriate, especially for LDVs that are averaging sales of around 70-90 million vehicles per year [2]. Amid the efforts for the massive use of electric vehicles, the use of petrol-based vehicles is still the largest fraction of transportation worldwide, especially the LDVs category [28]. Gasoline and diesel are still the major energy sources, but new technologies are diffusing into the LDV sector in response to fuel efficiency and emissions standards [29]. The exact causal relations of these two factors cannot be determined using the datasets in this work, as strong correlation is not sufficient to model any mechanistic relations of variables. However, the correlations can be used to warrant some actions by US-EPA, such as re-evaluating the chemistry of Fuel 26 when it is used in Test Procedure 11. Fuel 26 and test Procedure 11 supposedly simulate a cold start of a vehicle [16]. Previous works have investigated the case of cold-start emissions and compared the standard limits of California's LDVs surveillance program and found that the cold-start emissions in the actual setup produced lower levels than the levels predicted by the standard model, and concluded that the importance of cold-start emissions may be overstated in emission inventories [22]. This also leads to the question of how accurate is using Fuel 26 with Test Procedure 11 in modeling actual driving conditions, and some of the literature [23] has already demonstrated some techniques in such an inquiry. Such evaluation may elucidate necessary adjustments to the established testing procedures and standard fuels [7,8].

CO vs. Fuel Economy Anomaly in the Big Picture of LDVs Market
The fuel economy of vehicles has been a common parameter used in the valuation of vehicle performance, not just in the US, but also worldwide [24,25]. Hence, vehicle manufacturers have been aiming to constantly improve fuel economy ratings. Part of having these vehicles be available for purchase by consumers, however, is the need for certifications issued by regulatory agencies such as US-EPA that consider emissions performance in addition to fuel economy ratings. Emissions performance has been the center of legislative and regulatory issues; for example, the state of California in the US has been imposing stricter emission standards relative to US-EPA standards [26,27]. A higher fuel economy rating does not necessarily mean a good emissions rating, as shown in this work ( Figure 9). However, emissions ratings are the result not solely of vehicle engine design, but also of the implemented testing procedures and the test fuels used in testing, as shown in this work ( Figure 11, Tables 4 and 5). Therefore, it is necessary for regulatory agencies to make sure the test procedures and test fuels are appropriate, especially for LDVs that are averaging sales of around 70-90 million vehicles per year [2]. Amid the efforts for the massive use of electric vehicles, the use of petrol-based vehicles is still the largest fraction of transportation worldwide, especially the LDVs category [28]. Gasoline and diesel are still the major energy sources, but new technologies are diffusing into the LDV sector in response to fuel efficiency and emissions standards [29].
If emissions test procedures and test fuel types are found to be not the issue, then the findings in this work (Figure 9) imply that certain LDVs are emitting higher CO levels at higher Fuel Economies. The fact that the positive correlation of CO and Fuel Economy was detected in the test LDVs datasets that meet emissions limits alludes to the questions: "What is the trend of CO and Fuel Economy in the LDVs that did not meet emissions limits?" and "Are Test Procedure and Fuel Type still probable significant factors for any emissions anomaly in the LDVs that did not meet emissions limits?" These are questions that this work may not be able to answer due to dataset limitations. Nonetheless, the data analytics workflow demonstrated in this work ( Figure 1) would still be appropriate in answering such questions.

Conclusions
This study demonstrated that unsupervised machine learning algorithms PCA, K-Means, and SOM can elucidate trends in a large collection of testing datasets on vehicle fuel economy and emissions of LDVs collected by US-EPA. The combined application of these techniques shows that variable trends for the whole dataset can be different from the variable trends within certain K-Means and SOM clusters. Among the bivariate trends that significantly change, the trends between the Fuel Economy and CO emission levels are evidently significantly different when calculated on the whole dataset and when calculated in clusters. CO vs. Fuel Economy has a negative correlation in the whole dataset, but it has positive correlations in certain sample clusters. Upon performing a comprehensive analysis of datasets for LDVs model years 2015 to 2023, it was found that Test Procedure and Fuel Type could be the significant factors behind the positive correlation of CO and Fuel Economy. Specifically, the increasing use of Test Fuel 26 used in Test Procedure 11 was found to be the probable cause. This is an impending anomaly, as the use of Fuel 26 in emissions testing with Test Procedure 11 of US-EPA has been increasing through the years. With the finding that the clustered data samples with positive CO-vs.-Fuel Economy correlation all came from vehicle manufacturers that independently conduct the standard testing procedures, it is suggested that the chemistry of using Fuel 26 in performing Test Procedure 11 be re-evaluated by US-EPA.  Data Availability Statement: The datasets used and generated in this work are provided open to the public via a GitHub repository for the paper. This repo also includes the MATLAB code for the SOM, and the SAS-JMP files for the statistical analyses. GitHub repo: https://github.com/dhanfort/Cars2 2-FEandEmissions.git (accessed 9 April 2022).