In this study, we used three indicators to evaluate the classification systems: accuracy, execution time, and CPU utilization. Accuracy refers to the ratio of correctly classified samples to the total number of samples. Execution time refers to the total time taken by the model from training to evaluation through 10-fold cross-validation. CPU utilization refers to the proportion of CPU resources consumed during the model’s execution. This study carried out the experiments on a computer equipped with an Intel(R) Core (TM) i5-12400F processor, 32.0 GB of memory, an NVIDIA GeForce RTX 4060 GPU, and 1 TB of SSD storage, running the Windows 10 operating system.
4.1. Results for a Single Multi-Class Classification System
Based on the principle of the FS method shown in
Figure 6, to select the optimal subset of features, this study sets the parameter settings for each FS method as follows: The filtering threshold of PCC and MI is set to 0.2, based on the analysis in
Appendix A.1, to avoid too much degradation of model performance and the appearance of empty sets in feature subsets. As the classifier model for SFS and SBS, we utilized the Gaussian Naive Bayes (GaussianNB) model [
47], using accuracy as the performance metric for evaluating model subsets. This choice was made due to its low computational resource requirements, aligning with our goal of minimizing computational costs. We configured the parameters for ET using the default initialization settings in the sklearn library to avoid unnecessary complexity. Before performing FS operations and implementing the ML model for classification, the dataset was standardized to eliminate the impact of different feature dimensions. This standardization helps ensure that the subsequent FS methods and ML model calculations are more effective, and it can also accelerate the convergence of the model during training.
Figure 7 shows the number of features selected by each FS method. All FS methods effectively reduced the feature dimension compared with the original data. Compared to the original data, even the MI method, which selects the largest number of features, only selected 23 features out of the original 90 features from the XYZ-axis acceleration sampling data, while the SFS method selects the least number of features from the 90 original data features, only 9. ET selects 11 features, the second smallest number of features. PCC selects 15 features, the third smallest number of features, followed by SBS, which selects 19 features, the fourth smallest number of features, only 4 features less than MI, which selects the largest number of features.
This study evaluated the accuracies of the five ML models using the features selected by the five FS methods, respectively.
Table 4 shows the results, including those obtained from using the original data. From the table, we can see that all FS methods show a similar level of accuracy as the original data. The accuracy of some FS methods is even better than that of the original data. For example, the accuracy of DT using 90 features from the original data is 93%, while it is 94.4% for the SFS method with only 9 features. The accuracy of KNN using 90 features from the original data is 92.9%, while this is improved to 97.1% for the ET method with 11 features. The accuracy of SVM using 90 features from the original data is 94.9%, while this is improved to 97.1% for the SBS method with 19 features.
Comparing all the accuracies of the ML model and FS method combinations, the best-performing combination is XGBoost and SBS. This combination achieved an accuracy of 98.5%. Although this result was the same as that obtained using the original data, the number of features was reduced from 90 to 19, which is a significant reduction in feature dimensionality. This result is enough to show that from the perspective of saving computing resources, the combination of XGBoost and SBS is better than using XGBoost alone to classify the original 90 acceleration data features.
Table A2 and
Table A3 show the execution time and CPU utilization of each ML model for different FS methods. We can see that FS has a significant effect on reducing time and CPU utilization. For example, the time taken by RF for the original data is up to 2.392 s, while the time taken by RF using the features selected by ET is reduced to 1.099 s. The CPU utilization of XGBoost for the original data is up to 25.40%, while it is reduced to 14.75% when using SFS. Among all the combinations, the combination that takes the least time is KNN and SFS, which takes only 0.029 s. In terms of CPU utilization, the combination of KNN and ET is relatively good, consuming only 0.10% of CPU.
Looking at the results, we found that the combinations with the best performance in terms of accuracy, execution time, and CPU utilization are different. By observing the values of the three indicators, we determined the following two criteria for evaluating the best model:
(1) The minimum accuracy must be maintained above 95%.
(2) Models with high accuracy, low execution time, and low CPU utilization should be chosen as much as possible.
According to the above two evaluation criteria, selecting the combination of ML and FS can balance the relationship between accuracy and efficiency, thus helping us find the best combination of ML and FS with both accuracy and efficiency. So according to the criteria, the best combination is KNN and ET. The accuracy of this combination is 97.1%, which exceeds the criterion of 95%. Although the accuracy is not the highest, it has the lowest execution time and CPU utilization of 0.300 s and 0.10%, respectively. Compared with the results of other models, such as RF and XBboost, under the premise of using ET as the FS method, although they obtained higher accuracies of 98.0% and 98.3%, respectively, their execution time and CPU utilization were also higher. Taking the RF model as an example, its execution time is 1.099 s and CPU utilization is 1.65%, which is higher than the best combination of KNN and ET that we selected.
4.2. Results for Separate Binary Classification Systems
To explore the best way to identify container collision positions, in this experiment, we also evaluated the four binary classification systems where each selected class, such as top, bottom, left, or right, is distinguished from all other classes. The selected class is labeled as positive (1), and all other classes are combined as negative (0).
Figure 8 shows the number of features selected by the five FS methods for each classification system. In this study, we call each classification system Class 0, Class 1, Class 2, and Class 3 according to the top, bottom, left, and right positions, respectively. Specifically, the minimum number of features for Class 0 is given by MI, which is only 1, while those for Class 1, Class 2, and Class 3 are given by SFS, which are 3, 4, and 2, respectively, in contrast to
Figure 8, where the minimum number of features was 9 for SFS. The binary classification systems can reduce the feature dimensions more significantly, obtaining a smaller subset of features.
Table 5,
Table A4 and
Table A5 show the accuracy, execution time, and CPU utilization for Class 0. Based on the criteria we set before, the best combination is KNN and PCC. First, its accuracy of 95.8% is higher than the criterion of 95%. Secondly, the execution time, 0.310 s, is the least among all the combinations with an accuracy greater than 95%. Finally, the CPU utilization is 0.20%, the minimum among all the combinations.
Table 6,
Table A6 and
Table A7 show the accuracy, execution time, and CPU utilization for Class 1. Overall, KNN and SBS form the best combination, with an accuracy of 97.9%, an execution time of 0.03 s, and a CPU utilization of 0.15%. The accuracy is higher than the criterion of 95%. Although the execution time is greater than that of the KNN and SFS combination, the accuracy is higher. Furthermore, the CPU utilization is the least, only 0.15%.
Table 7,
Table A8 and
Table A9 show the accuracy, execution time, and CPU utilization for Class 2. The best combination is DT and SFS. Its accuracy is 99.1%, which is higher than the criterion of 95%. In addition, it has the lowest execution time and CPU utilization, only 0.013 s and 0.25%, respectively.
Table 8,
Table A10 and
Table A11 show accuracy, execution time, and CPU utilization for Class 3. The best combination is DT and ET, with an accuracy of 98.0%, which is above the criterion of 95%. In terms of execution time, although the execution time of DT+MI, DT+SFS, and DT+SBS are better than DT+ET’s 0.018 s, which are 0.013 s, 0.011 s, and 0.015 s respectively, their accuracy rates are 97.7%, 95.6%, and 97.0%, respectively, which are lower than DT+ET’s 98.0%. Furthermore, the combination of DT and ET has the lowest CPU utilization, only 0.10%.
Table 9 summarizes the best combination for each classification system and its performance. We can see that the performances are different from each other. The best performance is seen for Class 2, with the highest accuracy of 99.1%, the lowest execution time of 0.013 s, and the lowest CPU utilization of 0.05%, respectively. The performance of Class 0 is the lowest, with an accuracy of 95.8%, an execution time of 0.031 s, and a CPU utilization of 0.20%, respectively. When comparing this with the best performance of the single multi-class classification system (accuracy 97.1%, execution time 0.300 s, and CPU utilization 0.10%), we know that using the single multi-class classification system produces more stable results than using a different binary classification system according to the target position.
Figure 9 shows the accuracy heat map for the four classes. The vertical axis represents the ML model, and the horizontal axis represents the FS method. Each number is the accuracy of the corresponding combination; the closer the color is to yellow, the higher the accuracy, and the closer the color is to blue, the lower the accuracy. We see that the overall color for Class 2 is closest to yellow, followed by Class 3 and Class 1. In contrast, Class 0 shows that the overall color is closest to dark green. This indicates that all combinations for Class 0 exhibit relatively poor performance. According to
Table 5, the combination with the worst classification performance for Class 0 is DT and MI, with an accuracy of 80.2%, while the worst performances for Class 1, Class 2, and Class 3 do not fall below an accuracy of 94.0%.
Figure 10 shows the distribution of the data for the four class systems after applying two-dimensional Principal Component Analysis (PCA) [
48]. We found that the data distribution for Class 0 is not particularly concentrated compared with the other classes, which is the reason why the classification performance of Class 0 is poor. Collisions at different positions will produce different distributions of data. During data collection, collisions were conducted at different positions on the container. When collisions occurred, the container usually underwent a slight offset along the direction of the collision, so the data generated will be distributed more concentratedly along the collision direction, making it easier for the ML model to identify and classify. For example, in
Figure 10, Class 2 and Class 3 represent the data distributions for the left and right positions, respectively. Their data distributions are relatively concentrated. Class 2, even for the worst combination of DT and MI, achieved an accuracy 98.6%, while for the best combination of RF and PCC, it achieved an accuracy 99.9%.
In contrast, when the collision occurs at the top position of the container, a dis-placement caused by the collision is less likely to occur since the container is usually placed on the ground. The container can only vibrate at its original position, producing data that is more widely distributed than for the other positions and correspondingly more difficult to classify. As we can see from the results in
Table 7 and
Figure 9, the performance for Class 2 is high, which means that Class 2 is easier to classify, while the performance for Class 0 is low, which means that Class 0 is the most difficult to classify. This shows that the container collision data collected by sensor data have different properties.
In conclusion, the results show that it is more reasonable to use the single multi-class classification system with a combination of KNN and ET than separate binary classification systems. Compared with a single multi-class classification system, the performances of separate binary classification systems varies greatly among different categories. Therefore, using a single multi-class classification system is the most stable option and produces the best performance.
Appendix A.3 shows the feature changes after FS. The results show that FS can effectively preserve the feature properties and facilitate the efficient training of model classification.
Appendix A.4 shows the precision, recall, and F1 score of the single multi-class classification system. The results show that our best selected classification systems, KNN and ET, have stable performance. Furthermore, one system is operationally simpler since separate systems demand a lot of additional workloads and a more complicated workflow.
4.3. Statistical Analysis of Results for Best Classification System
To verify the credibility of the classification system results constructed by our optimal combination of KNN + ET, we employed the Friedman test from three perspectives—accuracy, execution time, and CPU utilization—to statistically analyze the performance of different models (DT, KNN, SVM, RF, XGBoost) and different FS methods (PCC, MI, SFS, SBS, ET). The Friedman test is a non-parametric test that is suitable when the data is repeatedly measured [
49], that is, the same group of subjects is measured multiple times at different points in time or under different conditions. It is often used to compare differences between three or more matched groups. The reason for choosing the Friedman test is that, in this study, each ML model or FS method is evaluated under multiple conditions of the other factor. In addition, our sample size is relatively small, and multiple treatment groups (≥3 groups) need to be compared simultaneously. The Friedman test is a non-parametric alternative to repeated-measures ANOVA that meets the requirements of our research.
In this study, the statistical analysis focused on exploring the following performance results:
(1) Whether there is a statistically significant difference in the performance of the five models under different FS methods.
(2) Whether there is a statistically significant difference in the performance of the five FS methods under different models.
Table 4,
Table A2 and
Table A3 contain the accuracy, execution time, and CPU utilization results of the five models (rows) under the five FS methods (columns) and the original dataset characteristics, respectively, forming a 5 × 6 data matrix, which also includes the performance of our best combination KNN + ET. These performance results are all derived from the same dataset, ensuring that each model–FS combination was tested under consistent conditions. In this study, when considering the ML model as the factor, each model is evaluated using the five FS methods plus the original dataset features (paired measurements). When considering the FS method as the factor, each FS method plus the original dataset features is evaluated under five models (paired measurements). This study used Python with the pandas library (version 2.2.2) for data processing and the scipy.stats module (version 1.15.3) for statistical testing to implement the Friedman test. The data were initially stored in a wide format (Model × FS Method) and then converted to a long format for grouping analysis. We conducted two separate Friedman tests to evaluate differences at both the model level and the FS method level, employing a χ
2 statistic for each test. We used a significance level (α) of 0.05 to determine statistical significance, where results with
p < 0.05 were considered statistically significant. In this study,
H0 and
H1 are used to represent the null hypothesis and alternative hypothesis, respectively, in the statistical analysis process.
Based on the accuracy shown in
Table 4, we performed two separate Friedman tests, using the models’ accuracy results as the treatment group’s
H0 and
H1:
H0. Under the original dataset features plus the five FS methods, there is no significant difference in accuracy among the five models (DT, KNN, SVM, RF, XGBoost).
H1. There is a significant difference in accuracy among the five models.
First, for the model-level Friedman test (grouped by model), the test statistic χ2 = 21.2773 and p = 0.000279 (rounded). Because p < 0.05, we reject H0, indicating that there is a statistically significant difference in the accuracy of the five models under the six FS methods. In other words, at least one model exhibits performance that differs from the other models by more than random error can explain. From the observed results, more complex models (such as RF and XGBoost) consistently show higher accuracy than simpler models (such as DT and KNN).
Using FS methods as the treatment group’s H0 and H1:
H0: Under the five models, there is no significant difference in accuracy between the original dataset features and the five FS methods (PCC, MI, SFS, SBS, ET).
H1: There is a significant difference in accuracy between the original dataset features and the five FS methods.
Next, for the FS method-level Friedman test (grouped by FS method), the test statistic χ2 = 6.5294 and p = 0.2581. Because p > 0.05, we fail to reject H0, indicating that there is no statistically significant difference in the accuracy between the original dataset features and the five FS methods under the five models. Although numerical differences exist (e.g., MI vs. original), these differences did not reach statistical significance with the current dataset and sample size. From a practical perspective, this demonstrates that using FS methods can indeed reduce the original dataset features without affecting the model’s accuracy.
In summary, combined with the contents of parts
Appendix B.1 and
Appendix B.2, although complex models (e.g., XGBoost) may improve accuracy, they also substantially increase execution time and CPU utilization. Thus, selecting an appropriate model requires consideration of computing resource limitation constraints. Meanwhile, FS methods (e.g., PCC, ET) can significantly lower CPU consumption by reducing feature dimensionality. This aligns with our previously identified optimal model combination, indicating that KNN combined with ET also meets the criteria of being the best system from a statistical standpoint.