1. Introduction
In recent years, the rapid growth of the global biopharmaceutical industry has accelerated the cross-border movement of special biological resources. Among them, infectious substances present a particularly significant challenge for the precise and efficient supervision of customs ports, owing to the complexity and diversity of their associated risks. Ensuring efficient risk assessment and accurate risk identification during port supervision has therefore become a pressing technological challenge in the regulation of infectious substances at ports of entry. Addressing this challenge is essential for customs authorities to safeguard national border security and to support the sustainable development of the biotechnology industry [
1].
Machine learning methods have been widely employed in security risk assessment and risk grading research because of their strong capability to process complex data structures and to uncover potential risk patterns. These advantages make them an important technical approach for building scientific and rational risk assessment systems. Previous studies [
2,
3,
4,
5,
6,
7,
8] have integrated Fault Tree Analysis (FTA) and frequency statistics with methods such as Event Sequence Diagrams (ESD), Bayesian Networks (BN), risk matrix approaches, and fuzzy theory. These combined methodologies have been applied in safety risk assessments across diverse domains, including lightning strike incidents, urban gas systems, desulfurization processes, and hemodialysis. Such applications have effectively strengthened the systematic nature of risk identification and improved the accuracy of assessment outcomes. Studies [
9,
10,
11,
12,
13,
14] have extensively applied decision tree algorithms in risk assessment across diverse domains, including agricultural supply chains, diabetes prediction, underground engineering permeability incidents, tree collapse, and cognitive impairment among the elderly. Findings suggest that pruning optimization and integration with algorithms such as random forest or Bayesian networks can significantly enhance the generalization ability, accuracy, and sensitivity of decision tree models, thereby improving their practicality and reliability in various risk prediction scenarios. Studies [
15,
16,
17,
18,
19] highlight the widespread application and strong performance of the Random Forest (RF) algorithm in risk assessment across multiple domains, including urban water pipeline bursts, earthquake information release safety, power cable repair decisions, landslide hazard evaluation, and wildfire risk prediction. By integrating diverse factors such as weather conditions, vegetation types, and infrastructure, RF models have demonstrated both high accuracy and robust generalization ability in various case studies. In some applications, the model achieved a maximum fitting rate of 99.70%, offering effective decision support for infrastructure maintenance, disaster management, and ecosystem risk identification. Collectively, these studies underscore the advantages of RF in handling nonlinear, high-dimensional, and imbalanced data, making it particularly suitable for multi-source risk modeling and prediction in complex scenarios. Recent studies have demonstrated that Support Vector Machine (SVM) algorithms exhibit remarkable performance in infectious disease prediction and clinical risk modeling. For instance, Lilhore et al. developed a hybrid random forest–SVM model for hepatitis C virus (HCV) classification, achieving > 96% accuracy through the integration of feature ranking and oversampling techniques to address data imbalance [
20]. Similarly, Ghazal et al. applied a fine Gaussian SVM to predict hepatitis C staging, obtaining 97.9% accuracy and effectively assisting in early diagnosis [
21]. Wu et al. constructed multiple machine learning models, including SVM, for the preoperative identification of urinary infection stones, demonstrating robust discrimination performance with AUC values up to 0.77 [
22]. In addition, He et al. employed SVM to predict HIV infection risk among men who have sex with men (MSM), providing a data-driven approach to enhance HIV surveillance and prevention strategies [
23]. Sun et al. further used SVM and other machine learning methods to predict post-chemotherapy lung infection in cancer patients, illustrating the effectiveness of SVM-based algorithms in clinical infection risk assessment [
24]. Collectively, these studies highlight the robust generalization ability and strong classification performance of SVM in complex infectious disease prediction tasks, providing a solid foundation for its application in port-of-entry infectious substance risk assessment. These advances provide effective technical support for risk prevention and control in meteorological disasters, urban safety management, healthcare, and industrial production. However, despite the maturity of SVM-based risk assessment models in multiple domains, research addressing the complex risk characteristics associated with imbalanced and multidimensional datasets of infectious substances at ports of entry remains limited.
Meanwhile, recent studies have demonstrated that hybrid models integrating Support Vector Machine (SVM) and Random Forest (RF) algorithms can further enhance prediction accuracy and robustness across various complex domains. For instance, Xu et al. improved rural land-cover classification performance by combining SVM and RF using the C5.0 algorithm, achieving higher accuracy than conventional classifiers such as XGBoost [
25]. Similarly, Sivaraja et al. developed an RF–Least Squares SVM model to predict the anti-carbonation performance of concrete with an R
2 of 0.999 [
26], while Gelmez et al. applied an iterative SVM-RF approach to identify autism-related gene signatures with 92% diagnostic accuracy [
27]. These findings collectively indicate that integrating SVM and RF can effectively exploit the strengths of both algorithms—leveraging RF’s feature selection and ensemble learning capabilities alongside SVM’s high-dimensional nonlinear classification power—to improve model generalization, stability, and interpretability. Building upon these insights, the present study introduces an improved SVM-RF model tailored to the complex, imbalanced, and multidimensional datasets characteristic of infectious substances at ports of entry.
Unlike previous research achievements in fields such as meteorological disasters, industrial production, and infrastructure safety, existing machine learning methods all exhibit certain limitations in the risk assessment of infectious substances at ports of entry. The primary reason lies in the fact that such risk subjects exhibit both high complexity and multidimensional characteristics. Simultaneously, certain high-risk or special-category infectious substances occur with extremely low frequency, resulting in significantly imbalanced sample data. Therefore, this paper employs the accident case risk factor frequency statistical method to systematically organize infectious substance risk factors, logically and intuitively presenting safety risk failure pathways to clarify potential risks at each stage. By comparing simulation results from decision tree algorithms, random forests, support vector machines, and hybrid RF-SVM models, an innovative improved SVM-RF model was constructed. The model’s generalization performance was analyzed using four metrics: precision, recall, specificity, and F1-score, followed by comparative evaluation of the prediction models.
Therefore, this study aims to develop a data-driven and generalizable risk assessment model for infectious substances intercepted at ports of entry by integrating the strengths of Support Vector Machine (SVM) and Random Forest (RF) algorithms. By combining the ensemble learning and feature extraction capability of RF with the nonlinear classification advantage of SVM, the proposed approach seeks to overcome the limitations of traditional machine learning models that struggle with high-dimensional, imbalanced, and heterogeneous data typically observed in port supervision scenarios. The ultimate goal is to enhance both the accuracy and interpretability of infectious substance risk classification, thereby supporting more scientific and objective decision-making in customs biosecurity management.
To achieve this objective, the study focuses on addressing the following research questions.
First, what are the key risk factors that significantly influence the classification and assessment of infectious substances at ports of entry? Understanding these factors is essential for constructing a representative feature set and improving the transparency of the model’s decision process.
Second, how can a hybrid SVM–RF algorithm be optimized to enhance classification accuracy and generalization performance compared with conventional single-algorithm approaches? This question aims to explore algorithmic optimization and model integration strategies to improve predictive robustness.
Third, to what extent can the proposed model align with expert evaluations and enhance the objectivity and efficiency of customs risk management for infectious substances? This inquiry evaluates the model’s practical applicability and its contribution to intelligent risk-based supervision.
By systematically addressing these research questions, this study establishes a scientific and interpretable framework for risk identification and decision-making in customs biosecurity supervision, contributing to the advancement of data-driven approaches for port-based infectious substance management.
2. Machine Learning Model Theory and Optimization
Decision tree algorithms are characterized by an intuitive tree structure and clear rule interpretability, enabling effective handling of mixed numerical and categorical data while maintaining high training efficiency. The random forest (RF) algorithm utilizes a bagging framework that integrates multiple decision trees, substantially reducing variance through dual randomization of samples and feature subsets, thereby achieving both high accuracy and strong resistance to overfitting. Support Vector Machine (SVM) algorithms construct optimal classification hyperplanes based on the principle of structural risk minimization, demonstrating strong generalization performance in small-sample, high-dimensional datasets and showing particular suitability for complex pattern recognition with well-defined decision boundaries. The fusion of SVM and Random Forest (SVM-RF) combines the capability of SVM to construct optimal separating hyperplanes in high-dimensional spaces with the ensemble advantages of RF in reducing variance and enhancing generalization, aiming to further improve overall model performance. To systematically evaluate the advantages and applicability of these algorithms, this study constructs classification models using four methods—decision trees, random forests, SVMs, and the improved SVM-RF—for comparative analysis.
2.1. Fundamental Theory of Machine Learning Algorithms
To facilitate a clear understanding of the theoretical basis underlying the model development process, this section first introduces the fundamental concepts of several representative machine learning algorithms used in this study. These include the decision tree, random forest, and support vector machine (SVM) models, each of which provides distinct advantages in handling classification and prediction tasks involving complex, multidimensional data. A schematic overview of their theoretical relationships and structural characteristics is presented in
Figure 1, serving as the foundation for subsequent model construction and performance comparison.
- (1)
A decision tree is a supervised machine learning algorithm that represents decision rules and classification outcomes using a hierarchical tree structure. Structurally, it consists of nodes and edges arranged in a top-down, tree-like format, where internal nodes correspond to feature-based decision criteria, branches denote decision paths, and leaf nodes indicate classification results.
- (2)
Support Vector Machine (SVM) is a supervised learning algorithm grounded in the principle of structural risk minimization. Its central concept is to construct a maximum-margin hyperplane that separates classes in the feature space, thereby achieving strong generalization performance for classification or regression tasks. The optimal hyperplane maximizes the margin between sample classes, enhancing the model’s robustness. The classifier is defined by support vectors, which are the data points closest to the separating hyperplane that determine the decision boundary.
- (3)
Random Forest (RF) is an ensemble learning algorithm based on the Bagging framework, employing decision trees as base learners. The algorithm generates multiple training subsets through bootstrap sampling from the original dataset and selects random feature subsets as candidates for splitting during the construction of each tree. By aggregating the predictions of all trees, RF effectively reduces overfitting and improves generalization, predictive accuracy, and model stability.
2.2. Improved SVM-RF Algorithm
Support Vector Machine-Random Forest (SVM-RF) is an innovative ensemble learning framework that constructs robust predictive models by combining the high-dimensional mapping capability of Support Vector Machines (SVM) with the ensemble decision-making advantages of Random Forests (RF). In the context of border inspection risk assessment for infectious substances, the data are characterized by high dimensionality, multi-source origins, and significant class imbalance. Direct application of conventional machine learning models often results in overfitting or insufficient recognition of minority-class risks, thereby undermining predictive performance. To address these challenges, this study proposes an improved SVM-RF algorithm that integrates RF with SVM to enable precise modeling and efficient identification of infectious substance risks at border ports.
To ensure the reliability and generalization of the improved SVM-RF model, a systematic hyperparameter optimization process was implemented. Specifically, a grid search combined with 10-fold cross-validation was employed to identify the optimal parameter settings for both the RF and SVM components. For the Random Forest, the number of trees and the maximum depth of trees were tuned within the ranges of 50–500 and 3–15, respectively. For the SVM with an RBF kernel, the penalty coefficient (C) and kernel width (γ) were explored within the ranges of 0.1–1000 and 10−4–10, respectively. The combination yielding the highest average F1-score on the validation set was selected as the final model configuration. This approach ensures that the optimized parameters are not overfitted to the training data and that the model maintains robust predictive performance across different risk categories of infectious substances.
The random forest component shown in
Figure 2 is primarily used for feature selection and importance evaluation. The prediction function of the random forest can be expressed as
Here, represents the final prediction result, denotes the number of decision trees, and is the output of the -th decision tree. During training, the random forest measures feature importance by calculating each feature’s contribution to reducing Gini impurity across all decision trees. Leveraging the infectious substance case repository, this study employs this mechanism for recursive feature selection of risk factors. Variables with minimal contribution to model classification are progressively eliminated, yielding an optimal feature subset that effectively reflects port risk characteristics.
After completing feature selection, support vector machines are employed to construct the final classification model. Its optimization objective is
Here, denotes the weight vector, represents the bias term, and signifies the input sample and its corresponding risk label. Through kernel function mapping, the SVM projects the multidimensional risk factors of infectious substances into a high-dimensional feature space. Within this space, it locates the optimal classification hyperplane, enabling effective differentiation between high-risk and low-risk samples.
2.3. Algorithm
Platform: Python 3.11; Libraries: Scikit-learn, Pandas, NumPy
Input: Complete dataset
Feature matrix ; Label vector
Output: Optimized model parameters (C, γ), Confusion Matrix, Accuracy (A), Precision (P), Recall (R), and F1-score (F1)
- (1)
Data Partitioning: Split and into training and testing subsets: .
- (2)
Initialize Random Forest (RF): Set a fixed random seed and train the Random Forest Classifier on .
- (3)
Feature Importance Evaluation: Compute Gini-based feature importance; eliminate redundant or low-contribution variables.
- (4)
Feature Selection: Retain the top-ranked features to form the optimized feature subset .
- (5)
Generate RF Predictions: Obtain preliminary prediction probabilities and feature importance weights from the trained RF model.
- (6)
Construct SVM Input Set: Combine the selected feature subset and RF-derived prediction probabilities as new input data .
- (7)
Parameter Optimization: Perform grid search with cross-validation to tune SVM parameters and under an RBF kernel.
- (8)
Model Training: Train the optimized SVM model using .
- (9)
Model Evaluation: Apply the trained model to ; compute A, P, R, and F1.
- (10)
Result Output: Return the confusion matrix, accuracy, precision, recall, and F1-score.
As illustrated in
Figure 3, the improved SVM-RF model adopts a two-stage fusion framework for assessing infectious substance risks at border ports. The figure presents the overall workflow of the algorithm, where the Random Forest (RF) component performs feature selection and importance evaluation, and the Support Vector Machine (SVM) component conducts the final classification.
This model employs a two-stage fusion mechanism for assessing infectious substance risks at border ports. In the first stage, a Random Forest (RF) is trained on multidimensional risk factors. By aggregating predictions from multiple decision trees through majority voting, it generates preliminary prediction probabilities. Feature importance for each risk factor is calculated based on Gini impurity reduction, effectively removing redundant variables and mitigating noise caused by sample imbalance. In the second stage, the RF-generated predictions and importance weights are incorporated as new features into a Support Vector Machine (SVM). Utilizing a Radial Basis Function (RBF) kernel allows nonlinear mapping, enabling the construction of an optimal separating hyperplane in high-dimensional space that maximizes classification margins among different categories of infectious substances. This staged fusion mechanism preserves the strengths of RF in feature selection and generalization while leveraging the precision of SVM for nonlinear classification in high-dimensional spaces. Compared with alternative models, it achieves more efficient identification and accurate prediction of infectious substance risk levels at ports.
3. Sample Data Sources and Data Processing
The import and export of infectious substances involve multiple dimensions of operational information, including biological characteristics, inspection approvals, packaging features, external environmental conditions, and emergency management protocols. The sources of risk in this process are highly interconnected and multidimensional. Risk factors encompass not only technical elements, such as sensitivity deviations in detection equipment and algorithmic limitations in customs data analysis systems, but also management and societal components. These include the professional competence of frontline quarantine personnel, interdepartmental coordination mechanisms, adherence to international standards for infectious substances, and public awareness regarding biohazardous materials.
3.1. Case Statistics and Risk Factor Selection
This study employs a frequency-based statistical approach to analyze risk factors in accident cases. For risk tracing of special biological resources, it is essential to systematically integrate equipment operation logs, quarantine anomaly records, historical violation cases, and routine monitoring data. The workflow of this assessment method is illustrated in
Figure 4.
By analyzing the case database of risk incidents at Beijing Customs over the past five years, this study identifies eight critical factors that should be prioritized in constructing a risk analysis model: the intrinsic characteristics of infectious substances, transportation capacity requirements, risk profiles of their origins, laboratory safety levels at user facilities, intended import purposes, completeness of customs declaration materials, emergency response speed, and equipment readiness at user facilities. A multi-parameter coupled dynamic assessment system is then established to enable comprehensive risk analysis and tiered early warning for infectious substances. The classification of primary influencing factors is illustrated in
Figure 5.
3.2. Dataset Quantification and Standardization
The infectious substance risk assessment dataset constructed in this study primarily draws on risk cases involving infectious substances intercepted at Beijing Customs over the past five years. Due to incomplete records in certain risk factor dimensions, a portion of the raw inspection data contains missing values. Direct use of such incomplete data for model training may compromise the accuracy and stability of predictive models due to insufficient valid samples.
To address this issue, a detailed analysis of missing value distribution was first conducted. Missing values were primarily concentrated in three feature dimensions—emergency response speed (8%), equipment readiness (6%), and completeness of customs declaration materials (5%), while the remaining variables exhibited less than 2% missingness. The missingness pattern followed a Missing at Random (MAR) distribution, indicating no systematic bias in data collection.
To ensure data integrity, three imputation strategies—mean imputation, k-nearest neighbors (KNN) imputation, and correlation-based interpolation—were comparatively evaluated using ten-fold cross-validation. The results showed that KNN imputation produced slightly better performance, improving model accuracy by approximately 0.8–1.2% compared with the other two methods, while maintaining consistent classification trends across all algorithms. Based on these results, KNN imputation was adopted as the final preprocessing approach, as it effectively preserved the local data structure and minimized bias.
Furthermore, key indicators for high-risk sample types were manually supplemented and corrected through field research and expert consultation. This comprehensive data refinement process ultimately produced 500 complete and standardized data samples suitable for machine learning modeling. In accordance with the nationally prescribed risk classification table for special biological resources, the risks of infectious substances at ports were categorized using the grading criteria presented in
Table 1.
Through the collection and analysis of 500 infectious substance risk assessment cases, eight key risk factors were identified. Based on the specific characteristics of infectious substance risks, the corresponding risk factor indicators were quantified, as summarized in
Table 2.
Prior to model construction, the dataset underwent a comprehensive data quality inspection. Missing values were identified and addressed through appropriate imputation methods, and outliers were examined using interquartile range (IQR) and z-score analysis to ensure data consistency. The dataset was further analyzed for class distribution to assess balance among different risk levels of infectious substances. The results indicated that the dataset was moderately imbalanced. To mitigate this imbalance and improve model generalization, a combination of stratified sampling and class-weight adjustment strategies was employed during model training. These preprocessing steps ensured the robustness and reliability of the subsequent machine learning analysis.
In the machine learning workflow, dataset normalization is a crucial preprocessing step. It constrains data within a defined range through mathematical transformations, eliminating differences in scale and magnitude among feature variables and producing a dimensionless dataset. This ensures that all features are on a comparable scale, facilitating analysis and the development of reliable predictive models. In practical applications, dataset features often exhibit varying magnitudes and distribution patterns; normalization effectively mitigates the impact of these differences on model performance. In this study, Z-score normalization (standard deviation normalization) is employed, as defined in Equations (3) and (4). In this process,
represents the original data, and
denotes the normalized value, For clarity, the variables are defined as follows:
represents the
i-th data point in the dataset.
is the population mean of the dataset, calculated over all
data points; and
is the standard deviation of the dataset, This approach transforms the data into a standard normal distribution with a mean of 0 and a standard deviation of 1.
5. Comparative Analysis of Machine Learning Results
In machine learning, model complexity alone does not always determine classification performance; the feature attributes of the dataset also substantially influence model outcomes. To comprehensively assess the applicability of different models for predicting infectious substance risks at ports, this study constructs each model using an 8:2 training-to-test split based on identical dataset partitioning criteria. Four classification models are compared: decision tree, support vector machine, random forest, and SVM-RF. All models were trained and tested under a unified data preprocessing workflow to ensure comparability of evaluation results. Notably, all classification models in this study were developed and implemented using the Java programming language, providing scalability and portability aligned with practical engineering requirements for subsequent system integration and deployment.
Figure 7 presents the evaluation results of the decision tree model. The test set comprises 100 cases, of which 90 were correctly predicted, resulting in an overall prediction accuracy of 90%. Among the true categories, there are 32 samples at Level 1, 28 at Level 2, 23 at Level 3, and 17 at Level 4. In the predicted categories, 28 samples are classified as Level 1, 24 as Level 2, 21 as Level 3, and 17 as Level 4.
Figure 8 presents the evaluation results of the support vector machine (SVM) model. The test set comprises 100 cases, of which 90 were correctly predicted, resulting in an overall prediction accuracy of 90%. Among the true categories, there are 32 samples at Level 1, 28 at Level 2, 23 at Level 3, and 17 at Level 4. In the predicted categories, 28 samples are classified as Level 1, 26 as Level 2, 21 as Level 3, and 17 as Level 4.
Figure 9 presents the evaluation results of the random forest (RF) model. The test set comprises 100 cases, of which 90 were correctly predicted, resulting in an overall prediction accuracy of 90%. Among the true categories, there are 32 samples at Level 1, 28 at Level 2, 23 at Level 3, and 17 at Level 4. In the predicted categories, 28 samples are classified as Level 1, 25 as Level 2, 21 as Level 3, and 17 as Level 4.
Figure 10 presents the evaluation results of the SVM-RF model. The test set comprises 100 cases, of which 90 were correctly predicted, resulting in an overall prediction accuracy of 90%. Among the true categories, there are 32 samples at Level 1, 28 at Level 2, 23 at Level 3, and 17 at Level 4. In the predicted categories, 27 samples are classified as Level 1, 27 as Level 2, 21 as Level 3, and 17 as Level 4. The detailed number of correctly predicted samples in the confusion matrix is provided in
Table 4.
The accuracy rates of the four models are 0.91, 0.92, 0.91, and 0.93, respectively. Based solely on prediction accuracy, no significant difference is observed between the support vector machine (SVM) and the optimized SVM-RF model. Therefore, further comparison using additional evaluation metrics is conducted.
Figure 11 presents the specific values of these performance metrics, clearly demonstrating that the improved SVM-RF model outperforms all other models across all indicators. To further validate the classification performance, Receiver Operating Characteristic (ROC) analysis was conducted. The ROC curves presented in
Figure 12 provide a clear visual comparison of the classification capability of the four models. Each ROC curve represents the trade-off between the true positive rate and the false positive rate under varying thresholds, with the area under the curve (AUC) serving as a quantitative indicator of model performance.
As shown in
Figure 12, the AUC values for DT, RF, SVM, and SVM-RF are 0.9041, 0.9592, 0.9670, and 0.9828, respectively. The improved SVM-RF model exhibits the largest AUC, indicating its superior ability to discriminate between high-risk and low-risk samples. Its ROC curve lies closest to the upper-left corner, signifying an optimal balance between sensitivity and specificity. These results provide strong visual and quantitative evidence of the enhanced predictive accuracy of the improved model.
Overall, the improved SVM-RF algorithm demonstrates greater stability, precision, and adaptability compared with single models. By effectively leveraging the nonlinear feature extraction of SVM and the anti-overfitting capability of RF, the model offers a reliable framework for data-driven risk identification and classification of infectious substances at ports of entry. This contributes to improving the efficiency and scientific basis of port biosecurity risk assessment and management.
To further ensure the reliability of the comparative results, a paired t-test was performed on the accuracy and AUC values obtained from ten-fold cross-validation. The statistical results showed that the performance improvement of the SVM-RF model over the DT, RF, and SVM models was significant at the 0.05 level (p < 0.05). This finding confirms that the observed enhancement in predictive accuracy is statistically meaningful rather than a result of random variation, thereby reinforcing the robustness of the proposed model.
6. Discussion and Conclusions
This study employs machine learning algorithms to construct a risk assessment model for infectious substances at ports of entry, aiming to enhance the efficiency of risk identification and evaluation for such materials through a data-driven approach. By comparing multiple machine learning methods—including decision trees, support vector machines, random forests, and the improved SVM-RF algorithm—the enhanced SVM-RF model demonstrated superior accuracy and generalization capabilities compared to individual algorithms. It combines the high-dimensional classification ability of support vector machines with the anti-overfitting advantages of random forests, making the model more robust in practical applications and capable of addressing complex infectious substance risk assessment challenges. Understanding how these factors influence risk levels aids in further optimizing risk assessment models and strengthening control measures.
Although the improvement in overall accuracy of the SVM-RF model over the baseline algorithms appears modest (from 0.91 to 0.93), a paired t-test on ten-fold cross-validation results confirmed that this gain is statistically significant (p < 0.05). Importantly, this 2% increase in accuracy corresponds to an approximately 18% reduction in the misclassification rate of high-risk infectious substance samples, which is practically meaningful in the context of customs risk management. Even small improvements in predictive precision can substantially enhance early warning systems, inspection prioritization, and the efficient allocation of limited safety resources. From a sustainability perspective, optimizing resource use while maintaining or improving safety aligns with broader goals of sustainable public health management and risk mitigation.
Nevertheless, this study relies on historical case data from Beijing Customs, and the dataset remains relatively small and geographically constrained. These factors may limit the generalizability of the proposed model to other ports or regions with distinct operational patterns, data structures, or inspection protocols. Furthermore, rare or emerging infectious substances may be underrepresented in the current dataset, potentially affecting the predictive robustness for such cases. Future research should expand the dataset to include samples from multiple ports and diverse temporal contexts, thereby enhancing the external validity, adaptability, and sustainability of the proposed risk assessment framework.
Beyond generalizability, another important aspect concerns the interpretability of the proposed model in practical applications. While feature importance is derived from the Random Forest component, the interpretability of the final SVM-RF framework also deserves attention. Although the SVM-RF hybrid model primarily focuses on improving predictive performance, its decisions can still be explained to end-users through several means. Specifically, the Random Forest sub-model provides feature importance rankings that reveal which variables contribute most to risk classification, offering an interpretable basis for understanding model behavior. In addition, representative case analyses can be used to visualize decision boundaries between risk levels, enabling customs officers to understand why specific samples are classified as high or low risk. In future research, integrating explainable AI (XAI) tools such as SHAP or LIME could further enhance interpretability, helping domain experts trace model outputs back to specific features and strengthening trust in model-driven risk assessments.
In future research, the integration of singular pooling [
29] could further enhance the feature representation capability of the SVM-RF framework. Singular pooling, as a spectral-domain pooling paradigm based on singular value decomposition (SVD), enables efficient compression of redundant features while preserving key discriminative information. Incorporating this mechanism into the SVM-RF fusion process may facilitate more compact and information-rich feature representations before classification, effectively mitigating the challenges posed by high-dimensional and multi-source risk data. Therefore, future work will focus on designing an extended “SVM-RF + Singular Pooling” framework or applying singular pooling directly to the infectious substance risk dataset to explore its potential in improving model robustness, generalization, and computational efficiency.