Next Article in Journal
The Brazilian Native Fruits Araçá, Guabijú, and Guabiroba: A Brief and Integrative Review on Their Phenolic Composition and Analytical Methods
Previous Article in Journal
Liupao Tea Extract Alleviates Rheumatoid Arthritis in Mice by Regulating the Gut–Joint Axis Mediated via Fatty Acid Metabolism
Previous Article in Special Issue
Environmental and Food Safety Assessment of Pre-Harvest Activities in Local Small-Scale Fruit and Vegetable Farms in Northwest Portugal: Hazard Identification and Compliance with Good Agricultural Practices (GAPs)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identifying Key Drivers of Foodborne Diseases in Zhejiang, China: A Machine Learning Approach

1
Department of Nutrition and Food Safety, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou 310051, China
2
School of Management, Zhejiang University, Hangzhou 310058, China
*
Authors to whom correspondence should be addressed.
Foods 2025, 14(16), 2857; https://doi.org/10.3390/foods14162857
Submission received: 21 June 2025 / Revised: 25 July 2025 / Accepted: 30 July 2025 / Published: 18 August 2025
(This article belongs to the Special Issue Emerging Challenges in the Management of Food Safety and Authenticity)

Abstract

Foodborne diseases represent a significant public health challenge worldwide. This study systematically analyzed the temporal dynamics, key predictors, and seasonal patterns of pathogen-specific foodborne diseases using a dataset of 56,970 cases from Zhejiang Province, China, spanning 2014 to 2023. A comprehensive set of 91 candidate variables was constructed by integrating epidemiological, environmental, socioeconomic, and agricultural data. Lasso regression was employed to identify 41 important predictors. Based on these variables, supervised machine learning models (Random Forest and XGBoost) were trained and evaluated, achieving training set classification accuracies of 86% and 87%, respectively, demonstrating robust performance. Feature importance analysis revealed that patient age, food type, climate policy, and processing methods were the most influential determinants, highlighting the combined impact of host, exposure, and environmental factors on disease risk. The results demonstrated significant shifts in the pathogen spectrum over the past decade, including a steady decline in Vibrio parahaemolyticus, an increase in Salmonella after 2016, and persistent seasonal peaks in Norovirus and Vibrio parahaemolyticus during warmer months. Seasonal ARIMA modeling and time-series decomposition further confirmed the critical role of seasonal and trend components in bacterial incidence. Overall, this study demonstrates the value of integrating machine learning and time-series analysis for pathogen-specific surveillance, risk prediction, and targeted public health interventions.

1. Introduction

Foodborne diseases represent a growing global health concern, with millions of cases reported annually, impacting public health systems and economies across various regions [1,2,3]. In China, where rapid economic development has led to changes in food production, distribution, and consumption patterns, foodborne illnesses remain a significant issue [4,5]. In recent years, with the intensification of uncertainties such as climate change, region-specific risks of foodborne diseases have become increasingly prominent [6]. Located on the eastern coast of China, Zhejiang Province is characterized by a dense population, a vibrant food industry, and diverse dietary practices, making it a hotspot for foodborne illness outbreaks. Its unique socio-economic landscape presents a range of challenges for food safety governance. The rapid pace of urbanization, high volumes of tourism, and growing demand for diversified food products have collectively heightened the risk of foodborne disease transmission. Particularly during the summer months, the province’s hot and humid climate creates favorable conditions for the proliferation and spread of foodborne pathogens. These environmental factors, combined with inadequate food storage and improper handling practices, further exacerbate the risk of contamination. Such a complex interplay of environmental and behavioral drivers positions Zhejiang Province as a valuable case for investigating regionally specific patterns and determinants of foodborne disease risk [7,8].
Between 2014 and 2023, Zhejiang Province recorded 57,474 cases of foodborne diseases. These cases provide a valuable basis for analyzing the specific factors influencing foodborne disease outbreaks in the region. Existing research has already identified key drivers of foodborne illnesses globally, such as microbial contamination, poor sanitation, and unsafe food practices; however, localized factors specific to Zhejiang remain underexplored. Understanding the regional dynamics, such as the influence of food supply chain complexities, seasonal fluctuations, and regional food preferences, is critical for developing targeted interventions.
Zhejiang Province, located on the southeast coast of China, is home to over 65 million residents, with a well-developed urban infrastructure. The province boasts one of the highest GDP per capita levels in China, driven by dynamic sectors such as manufacturing, trade, agriculture, and a burgeoning food processing industry. Zhejiang’s dietary culture features a mix of traditional fresh foods, such as seafood, rice, and vegetables, and an increasing consumption of processed and ready-to-eat products, reflecting broader national trends in food system transformation. Annual average temperatures in Zhejiang range from 16 °C to 19 °C, with humid subtropical conditions and abundant rainfall (averaging over 1500 mm per year), creating favorable environments for foodborne pathogen growth, particularly in summer months. The province’s food supply chain is characterized by both local production and substantial imports from other regions, further increasing complexity and exposure risk.
This study aims to bridge that gap by analyzing the factors that contributed to foodborne diseases in Zhejiang over the ten-year period. The research will focus on identifying patterns and trends within the dataset, examining how environmental conditions, food production practices, and consumer behaviors intersect to influence food safety outcomes. Furthermore, this analysis will explore the effectiveness of existing food safety regulations and public health interventions, with the objective of proposing data-driven recommendations for mitigating future outbreaks. By identifying the most significant risk factors in Zhejiang, this research seeks to inform more localized food safety policies and contribute to the broader understanding of foodborne disease prevention in China and beyond.
Foodborne diseases are influenced by a variety of factors, including microbial contamination, environmental conditions, and human behavior. Global literature has extensively explored the impact of foodborne pathogens, with bacteria such as Salmonella, Escherichia coli, and Listeria identified as major causative agents. These pathogens are often associated with improper hygiene practices during food processing, cross-contamination, and inadequate food storage [6]. Environmental factors, particularly temperature and humidity, are also widely recognized as critical elements that promote pathogen growth, especially under conditions where food is poorly stored or handled [9,10].
Studies have shown that foodborne diseases often exhibit seasonal patterns. Specifically, higher rates of illness are observed during summer months, when elevated temperatures accelerate microbial reproduction, increasing the likelihood of disease outbreaks [4,11]. For example, in many regions of China, including Zhejiang Province, the incidence of foodborne diseases is significantly higher in the summer. Additionally, improper food handling practices in households and the food service industry have been found to significantly increase the risk of foodborne illnesses [12,13].
Socioeconomic factors also play a significant role in the spread of foodborne diseases. Rapid urbanization, the growing demand for ready-to-eat foods, and the increasing complexity of the food supply chain have heightened the risk of contamination [9,10,14]. In Zhejiang Province, rapid urban development, alongside the expansion of the food industry and changes in consumer habits, has created more opportunities for foodborne disease outbreaks. Furthermore, small- and medium-sized enterprises (SMEs) in the food sector, due to regulatory shortcomings and inadequate enforcement of food safety standards, are often cited as potential sources of foodborne disease risk [9,15,16].
In terms of food safety regulations, while China has implemented various safety standards for food production and distribution, challenges remain in the enforcement of these standards, particularly in smaller enterprises. These challenges contribute to the persistence of food safety hazards [17,18]. Additionally, a lack of consumer awareness about food safety further exacerbates the risk of foodborne disease outbreaks [19,20].
In the context of international research, numerous studies have confirmed that climate, food supply chain complexity, socioeconomic development, and population dietary structure are key risk factors shaping the epidemiology of foodborne diseases in diverse regions [11,12,13], localized factors specific to Zhejiang Province which remain underexplored. For example, the role of Zhejiang’s climate, dietary habits, and the complexity of the regional food supply chain in influencing foodborne disease outbreaks requires further research. Moreover, the effectiveness of public health interventions and government regulations at that level has not been sufficiently evaluated. This gap in the literature highlights the need for more localized studies to analyze these complex factors and inform targeted foodborne disease prevention policies in Zhejiang Province.
While the existing literature has made significant progress in understanding the common causes and contributing factors of foodborne diseases, a clear research gap exists in the localized analysis of these factors in Zhejiang Province. Most studies focus on global or national trends, often overlooking the specific conditions in Zhejiang, such as its climate, dietary habits, and the complexity of its regional food supply chain. Furthermore, there is limited research evaluating the effectiveness of existing food safety regulations and public health interventions at the local level. Therefore, this study aims to fill this gap by analyzing the factors influencing foodborne diseases in Zhejiang Province from 2014 to 2023, providing data-driven insights for the development of more effective and targeted prevention strategies.

2. Data

2.1. Data Source

We utilized foodborne disease surveillance data from Zhejiang Province, China, spanning the years 2014 to 2023. The dataset was obtained from the Zhejiang Provincial Center for Disease Control and Prevention (CDC), specifically from its Foodborne Disease Surveillance System, which aggregates case reports from 101 designated hospitals across the province.
A total of 57,474 individual foodborne disease case records were initially collected (see Appendix A, Table A1). After excluding cases in which patients received medical treatment outside Zhejiang Province, 56,970 valid cases were retained for analysis. The dataset also includes 91 risk-related indicators, encompassing demographic characteristics, clinical symptoms, dietary exposure, and environmental conditions.
The data cover 11 municipal-level cities in Zhejiang Province, representing a total population of approximately 66 million. This comprehensive coverage ensures the representativeness and reliability of the dataset for identifying temporal trends and regional variations in foodborne disease etiology.
In addition, we incorporated provincial-level socioeconomic and environmental indicators extracted from the Zhejiang Statistical Yearbooks (2014–2024). These indicators were matched with the foodborne disease case records based on corresponding years, enabling integrated analyses of epidemiological trends in relation to regional development factors such as gross domestic product (GDP), agricultural output, population density, and climate-related variables. A detailed list of variables and their descriptive statistics is provided in Appendix A, Table A2.

2.2. Data Integration

To enable the application of machine learning models (Random Forest, XGBoost), all collected data were linked and integrated into a unified database. Specifically, the foodborne disease records from Zhejiang Province (a total of 56,970 cases) were expanded by incorporating 91 indicator variables as previously described (see Appendix A, Table A2). This process resulted in a fully integrated dataset comprising 56,970 records. In matching the foodborne disease cases with corresponding regional, climatic, and socioeconomic data, two key variables from the original records, the county (or district) of the patient’s residence and the date of medical consultation, were used as the basis for data alignment.

2.3. Data Description

To identify key predictors of foodborne disease types, we compiled 91 candidate features based on case records and matched external data sources, including climate, agriculture, demographics, and food-related variables (see Appendix A, Table A2 and Table A3). Lasso regression was applied to perform feature selection using L1 regularization (α = 0.01), resulting in 26 retained features with non-zero coefficients (see Figure 1). These features included ‘Age’, ‘Prefecture’, ‘Food name’, ‘Processing and Packaging Methods’, ‘Occupation’, ‘Purchase’, ‘Eat place’, ‘Climate policy’, ‘Mortality rate’, ‘Fruit yield’, ‘Orange yield’, ‘Cattle’, ‘Hospitals’, ‘Insurance’, and so on. These 26 features were ultimately used as the input variables for subsequent model training.
Descriptive statistical analysis was conducted on the selected variables (see Appendix A, Table A3). To better understand the epidemiological characteristics of foodborne disease cases in Zhejiang Province from 2014 to 2023, we conducted descriptive statistical analysis on key categorical variables, including food category, patient occupation, food purchase source, and type of eating venue (see Figure 2). Among all reported cases, households were the most common eating venues, accounting for 56.6%, followed by other types of locations (27.8%), which mainly include informal or unclassified settings. Catering services accounted for 11.5% of cases, canteens or schools for 3.2%, and retail stores for only 0.9%. These results suggest that home-cooked meals remain a major setting for foodborne disease outbreaks, underscoring the need to strengthen public education on food safety and promote safe food handling practices at the household level. In terms of food procurement sources, the largest proportion of cases (46.9%) were associated with “other” sources, likely referring to unrecorded or informal channels. This was followed by household purchases (26.5%), shops (13.3%), and retail markets (10.2%). Purchases from catering services and canteens each accounted for 3.1%. The high percentage of “other” sources reflects potential limitations in the current data classification system and highlights the need for improved precision in food source reporting. Regarding patient occupation, the majority of cases (58.6%) occurred among migrant workers and related labor groups, followed by individuals from the education and health sectors, including students, teachers, and healthcare workers (26.1%). Patients with unknown or other occupations made up 9.7%, and those employed in the catering industry accounted for 5.7%. This occupational distribution suggests that labor-intensive populations with relatively low health literacy may be at elevated risk of foodborne illness. As for the implicated food categories, aquatic products accounted for the highest proportion (21.2%), followed by meat and meat products (16.1%), nuts/seeds and legumes (16.3%), and grains and grain-based products (9.5%). Vegetables, eggs/dairy, and mixed foods were also significant contributors. Notably, “unknown” or unclassified foods (category 10) accounted for 6.3% of cases, indicating persistent challenges in food traceability and classification accuracy.
In summary, the descriptive analysis reveals that home consumption, informal food sourcing, and labor-intensive occupational groups are the primary correlates of reported foodborne disease cases in Zhejiang Province. These findings provide important insights for risk assessment and the design of targeted public health interventions.

3. Methods

To identify key determinants of foodborne disease incidence and improve prediction accuracy, we developed a comprehensive machine learning framework integrating two classifiers, Random Forest and XGBoost, in combination with SMOTE oversampling and Lasso-based feature selection. This approach ensures robust model training, effective handling of imbalanced data, and interpretable identification of significant risk factors.

3.1. Machine Learning Algorithms

Random Forest (RF). Random Forest is an ensemble learning algorithm introduced by Breiman [21], designed to improve prediction accuracy and control overfitting by combining the results of multiple decision trees built from bootstrap samples of the data. In RF, each tree is trained on a random subset of the observations and a random subset of the predictors at each split, resulting in diverse models whose aggregated predictions provide enhanced generalization. This approach makes Random Forest particularly robust to noise and outliers, and suitable for high-dimensional data, mixed variable types, and situations with multicollinearity. Additionally, the algorithm naturally provides variable importance measures, enabling interpretation and feature selection.
RF is non-parametric and does not assume any specific distribution of the predictors or outcome variable. It is especially useful in biomedical and epidemiological applications where the relationships between predictors and outcomes are complex, possibly non-linear, and include higher-order interactions. In our study, RF was employed to classify foodborne disease types based on a wide range of demographic, clinical, environmental, and agricultural features [22].
Extreme Gradient Boosting (XGBoost). XGBoost is an advanced implementation of gradient boosting decision trees, developed for speed and performance. Unlike bagging-based methods such as RF, XGBoost is based on boosting, where trees are built sequentially and each new tree corrects the residual errors of the combined previous ensemble. XGBoost introduces several enhancements including L1 and L2 regularization, shrinkage (learning rate), column and row subsampling, and efficient handling of missing data, which together improve both accuracy and generalization. XGBoost has demonstrated state-of-the-art results in many data mining competitions and is particularly effective for tabular datasets with complex feature interactions, class imbalance, or missing values [23].
In the context of multi-class classification problems with potentially imbalanced class distributions and high-dimensional predictors, XGBoost offers improved control over overfitting and often outperforms single-tree or linear models. The algorithm also provides detailed feature importance metrics such as gain, cover, and frequency, facilitating model interpretation and providing insight into key determinants.
Rationale for Algorithm Selection. The selection of Random Forest and XGBoost was motivated by several considerations. First, both algorithms are highly effective for multi-class classification tasks involving structured tabular data, and they are capable of handling both categorical and numerical features without extensive preprocessing. Both perform well even in the presence of missing values and outliers—conditions commonly encountered in real-world public health datasets. Second, their ensemble nature allows them to model complex, non-linear relationships and variable interactions that may not be captured by traditional parametric approaches. Third, both RF and XGBoost offer intrinsic methods for evaluating variable importance, which is critical for interpreting model results and identifying key risk factors of foodborne disease types.
Moreover, the combined use of RF (a bagging-based algorithm that reduces variance) and XGBoost (a boosting-based algorithm that addresses bias and incorporates flexible regularization) enables a comprehensive assessment of model performance, generalizability, and feature relevance. Comparing the results from both models helps identify consistent predictors and provides confidence in the robustness of the findings.

3.2. Model Implementation

All data preprocessing, machine learning modeling, and model evaluation were performed using Python 3.8 on the Anaconda 3 Jupyter Notebook 6.4.12 platform. Major libraries included scikit-learn (sklearn), xgboost, pandas, numpy, matplotlib, and seaborn. All analyses were conducted on a personal computer equipped with an Intel i7 processor and 16 GB RAM, running Windows 10.
In this study, model development and evaluation followed these steps: (1) Data preprocessing included automatic encoding detection, removal of rows with missing values, and replacement of infinite values with zero. (2) Feature matrix included all predictors except the target variable (Bacteria), which was encoded as integer classes. (3) The dataset was split into training (80%) and validation (20%) sets using stratified random sampling to maintain class proportions. (4) Both RF and XGBoost models underwent hyperparameter optimization using grid search with five-fold cross-validation. (5) The best models were trained on the entire training set and evaluated on the validation set. (6) Model performance was assessed using accuracy, precision, recall, F1-score, and confusion matrices for each class. (7) Feature importance was extracted and visualized to support epidemiological interpretation.
The dataset exhibited a notable class imbalance, with certain foodborne pathogens represented far less frequently than others. To address this issue, we employed the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic samples of the minority class by interpolating between existing observations in feature space. This method has been shown to enhance classifier performance on imbalanced datasets by improving the representation of minority classes during model training. SMOTE was applied exclusively to the training data to avoid information leakage and ensure realistic evaluation on the validation set.
A total of 91 candidate predictors were initially considered, encompassing environmental, behavioral, and socio-economic variables. To reduce dimensionality and prevent overfitting, we applied Lasso regression, a linear model with L1 regularization that shrinks the coefficients of less relevant variables to zero, thereby performing implicit feature selection. The regularization strength (alpha) was set at 0.01 based on the prior literature and tuning experiments. Only features with non-zero coefficients were retained for subsequent modeling, ensuring parsimony and interpretability.
After initial data cleaning and exclusion of out-of-province cases, 56,970 valid records remained, representing all 11 municipal-level cities in Zhejiang Province. The dataset was randomly divided into a training set (80%) and a validation set (20%) to facilitate robust model evaluation. Preprocessing steps included automatic file encoding detection and correction, imputation of missing values using mean or mode strategies, standardization of continuous features and label encoding of categorical variables, and target assignment based on confirmed pathogen categories (e.g., Vibrio parahaemolyticus, Salmonella, Norovirus, E. coli, etc.). These procedures ensured that the input data were clean, consistent, and suitable for training all supervised learning models.
To ensure optimal predictive performance and minimize the risk of overfitting, systematic hyperparameter tuning was conducted for both the Random Forest and XGBoost models in this study. A grid search with five-fold cross-validation was employed to comprehensively evaluate key hyperparameters for each algorithm. For the Random Forest model, the optimal configuration was determined as follows: max_depth = None, max_features = ‘sqrt’, min_samples_leaf = 1, min_samples_split = 10, and n_estimators = 500. This parameter set enables the construction of fully expanded trees, random feature selection at each split, and a substantial number of trees, enhancing the model’s robustness and ability to capture complex patterns. For the XGBoost model, the best hyperparameters were identified as follows: colsample_bytree = 0.8, learning_rate = 0.1, max_depth = 10, n_estimators = 100, and subsample = 0.8. This configuration balances model complexity with generalization capacity and facilitates effective learning from heterogeneous data. The details of the hyperparameter tuning process and the final parameter settings are provided in the Methods Section and are reflected in the reported model performance.
The two classifiers, Random Forest and XGBoost were trained on the selected features using the SMOTE-balanced training set. After training, feature importance was evaluated using both model-specific and model-agnostic approaches. Random Forest provided impurity-based importance scores that reflect each feature’s contribution to classification accuracy, while XGBoost generated gain-based metrics indicating how often and effectively features were used for tree splits. To unify the interpretation across models, SHAP (SHapley Additive exPlanations) was employed to quantify each feature’s marginal contribution to individual predictions, offering a consistent and interpretable assessment of variable influence. Bar plots of SHAP values and model-derived importance scores were created to visually compare the top contributing features across models, enabling a robust evaluation of the most influential risk factors for foodborne disease occurrence.
To comprehensively evaluate model performance, we applied multiple standard classification metrics across both the training and validation sets. These included accuracy, precision, recall, and F1-score, which were derived from classification reports to quantify the models’ ability to correctly identify different foodborne disease categories. Confusion matrices were also generated and visualized using heatmaps, allowing for a clear comparison between predicted and actual labels. Furthermore, we conducted cross-model evaluations to assess the consistency, robustness, and interpretability of predictions among the four classifiers. This comparative analysis not only highlighted the relative strengths and weaknesses of each model but also provided insight into the trade-offs between model complexity, predictive accuracy, and explanatory capacity.
To examine the temporal dynamics of foodborne pathogens, we first conducted exploratory time-series visualizations using monthly and quarterly groupings. Seasonal patterns were then assessed using seasonal differencing and the Augmented Dickey–Fuller test to ensure stationarity. A Seasonal Autoregressive Integrated Moving Average (SARIMA) model was applied to account for trends, seasonality, and randomness in the data. Additionally, classical time-series decomposition was conducted to isolate trend, seasonal, and residual components. Model diagnostics, including the Ljung–Box and heteroscedasticity tests, confirmed the validity and stability of the SARIMA model.

4. Results

4.1. Trends in Pathogen Composition of Foodborne Diseases in Zhejiang Province

Figure 3 illustrates the annual composition of major foodborne disease pathogens in Zhejiang Province from 2014 to 2023, with proportions standardized to 100% each year. The data reveal substantial temporal variation in the dominant causative agents. Vibrio parahaemolyticus and Salmonella were the leading pathogens throughout most of the study period. Vibrio parahaemolyticus showed a marked decline in relative prevalence after 2016, dropping from approximately 45% to below 20% by 2023. In contrast, Salmonella infections increased notably after 2016 and remained the most prominent pathogen from 2019 to 2022, peaking at over 45% in 2020. Norovirus consistently ranked among the top three pathogens, with fluctuations between 20% and 45%, and demonstrated a resurgence in 2023. E. coli infections remained relatively stable and less prevalent, typically accounting for around 5–10% of annual cases. The “Others” category, encompassing less common pathogens, contributed a consistently minor proportion across all years. These temporal trends suggest a shifting landscape of foodborne disease etiology in Zhejiang Province, potentially reflecting changes in environmental conditions, food handling practices, surveillance sensitivity, or public health interventions.

4.2. Risk Drivers of Foodborne Disease Types

4.2.1. Model Performance Comparison

Both the Random Forest and XGBoost models were applied to classify foodborne disease types using the selected feature set. Table A4 summarizes the precision, recall, F1-score, and support for each class in both the training and validation sets.
The optimal Random Forest model, determined by five-fold grid search, achieved an overall accuracy of 0.86 on the training set and 0.70 on the validation set. The weighted average F1-score was 0.87 for training and 0.71 for validation (Appendix A, Table A4). The model demonstrated robust performance for the majority classes (e.g., class 1 and class 3), with validation precision and recall of 0.72/0.78 and 0.69/0.66, respectively. However, performance on minority classes remained limited: class 0 and class 4 in the validation set exhibited very low recall (0.00 and 0.21, respectively) and correspondingly low F1-scores. The macro-averaged F1-scores (0.60 for training, 0.38 for validation) highlighted the difficulty of achieving balanced classification across all categories, especially for rare types.
The XGBoost model, following hyperparameter optimization, yielded similar overall results: accuracy of 0.87 on the training set and 0.71 on the validation set. Weighted F1-scores were 0.87 (training) and 0.68 (validation). For the dominant classes, XGBoost achieved high recall (class 1: 0.86, class 3: 0.64) and F1-scores, closely matching the performance of the Random Forest. Minority classes (especially class 0 and class 4) remained challenging to predict, with validation recall of 0.00 and 0.29, respectively. The macro F1-score was 0.40 in the validation set, reflecting the continued imbalance in predictive performance across classes.
Both models exhibited strong performance on the majority foodborne disease types, with validation set accuracy exceeding 0.70 and consistent weighted F1-scores. However, minority classes remained difficult to classify, as evidenced by substantially lower recall and F1-score for these groups. These results underscore the intrinsic challenges of multi-class, imbalanced classification in epidemiological datasets, where the majority of cases cluster in a few categories.
Confusion matrix analysis (see Figure 4) revealed that most misclassifications occurred among the minority classes, with substantial confusion between certain underrepresented disease types. Feature importance analysis indicated that demographic, environmental, and food-related variables contributed most to model predictions.

4.2.2. Feature Importance Analysis

To further elucidate the factors contributing to the classification of foodborne disease types, feature importance was examined for both the Random Forest and XGBoost models (Figure 5).
In the Random Forest model, ‘Age’ was identified as the most influential predictor, followed by ’Food name’ and ‘Climate policy’. Other important features included ‘Processing and Packaging Methods’, ‘Occupation’, ‘Eat place’, and several agricultural or environmental variables such as ‘Orange yield’, ‘Fruit yield’, and ’Mortality rate’. The prominence of age- and food-related characteristics highlights the combined effect of host susceptibility and exposure routes in foodborne disease risk, while the significance of climate and agricultural variables suggests the importance of external and environmental factors.
In the XGBoost model, ‘Food name’ and ‘Age’ ranked as the top two most important features, followed by ‘Climate policy’, ‘Processing and Packaging Methods’, and ‘Purchase’. Notably, both models consistently identified food attributes, age, and policy/environmental factors as primary determinants of disease type. The overlap in high-ranking variables between the two models underscores the robustness of these predictors across different machine learning frameworks.
These results indicate that both individual level (e.g., age, occupation) and environmental or systemic factors (e.g., climate policy, agricultural production) jointly shape the distribution of foodborne disease types. The findings support the multifactorial nature of foodborne disease risk and provide a basis for targeted interventions and further etiological investigation.

4.3. Time Pattern Analysis of Foodborne Disease Types

Through the analysis of the temporal patterns of bacterial types, the study found significant seasonal differences in the transmission of various bacterial types. To investigate the seasonal variation in bacterial types, the data were first grouped by month, and an area plot was used to illustrate the incidence rate of each bacterial type across different months. Additionally, quarterly grouping was applied, and a line graph was used to display the variation in the incidence frequency of each bacterial type across the four seasons (Figure 6a). Norovirus (Category 2) showed higher incidence rates in the spring and winter, suggesting that climatic conditions during the warmer seasons may facilitate the transmission of these pathogens. Vibrio parahaemolyticus (Category 3) demonstrates strong seasonality, with significantly elevated incidence during the warm summer months. This pattern is consistent with its known ecological preference for high-temperature and high-humidity conditions, which facilitate bacterial growth and increase the risk of food contamination. In contrast, Salmonella (Category 1) and other bacterial types (Category 0) did not exhibit significant seasonal variations, indicating that their transmission mechanisms may not be significantly influenced by seasonal changes. For Escherichia coli (Category 4), which shows a marked increase in incidence during the summer months. This trend may be attributed to the bacteria’s preference for warmer temperatures, which promote its growth and survival in food sources. Escherichia coli, a pathogen commonly associated with foodborne outbreaks, thrives under these conditions, thus increasing the risk of contamination. Unlike Vibrio parahaemolyticus (Category 3), which has a strong seasonality linked to high temperature and humidity, Escherichia coli’s peak during the warmer months reflects its adaptive survival strategy in temperature-sensitive environments.
In addition, to further elucidate the temporal regularity and short-term trends of foodborne disease incidence, a Seasonal Autoregressive Integrated Moving Average (SARIMA) model was constructed using historical surveillance data to forecast pathogen occurrence over the upcoming period (see Figure 6b and Appendix A, Table A5). The SARIMA model effectively captured the seasonal fluctuations observed in previous years and provided reasonable interval predictions for the projected bacterial counts in 2023–2024. The relatively wide prediction intervals indicate that future disease incidence may be influenced by a variety of unpredictable factors, such as extreme weather events or changes in human behavior. Therefore, dynamic surveillance and real-time early warning based on time series models such as SARIMA are essential for strengthening risk management and preparedness for foodborne diseases.
We further investigated the temporal patterns of different bacterial types. Through seasonal differencing and the ADF stationarity test (ADF Statistic: −4.718; p-value: 0.000), we confirmed the stationarity of the foodborne disease bacterial type data after seasonal adjustment. The Seasonal Autoregressive Integrated Moving Average (SARIMA) model was employed to capture trends, seasonal fluctuations, and randomness in the foodborne disease time series data. The impact of seasonal factors on bacterial type variation was significant, particularly the coefficient and significance level of the seasonal moving average term (MA.S.L12), suggesting that seasonal changes, such as temperature and humidity, may play a crucial role in bacterial transmission. The model fit was good, and although there were some non-normal residuals, they did not significantly affect the model’s predictive capability. The Ljung–Box and heteroscedasticity tests supported the model’s validity and stability.
The time effect decomposition revealed the temporal patterns of bacterial types, with the fluctuation of seasonal components being particularly prominent. Figure 6c presents the results of the time effect decomposition of bacterial type data, consisting of four subplots: the original observed values, trend component, seasonal component, and residual component. It is evident that the data exhibit clear periodic fluctuations, with repeated oscillation patterns across different years. The trend plot demonstrates the long-term change trend of bacterial types, showing that the average level of bacterial types remained relatively stable between 2014 and 2023, with no obvious upward or downward trend. The seasonal component plot reflects the peaks and valleys of bacterial types within fixed time periods each year, indicating that the variation in bacterial types is significantly influenced by seasonal factors. The residual plot suggests that, although most of the variation can be explained by the trend and seasonal components, there are still some fluctuations not captured by the model. These fluctuations may represent randomness or other unconsidered factors (see Figure 6c).

5. Conclusions

This study provides a comprehensive assessment of the temporal evolution, risk determinants, and seasonal dynamics of foodborne disease pathogens in Zhejiang Province from 2014 to 2023. The analysis revealed pronounced shifts in the dominant etiological agents over the past decade: The relative prevalence of Vibrio parahaemolyticus declined markedly after 2016, while Salmonella emerged as the leading pathogen from 2019 onwards, peaking in 2020. Norovirus consistently ranked among the top three pathogens, exhibiting notable fluctuations and a resurgence in 2023. Escherichia coli infections remained relatively stable and less prevalent, and other bacterial types contributed a consistently minor proportion each year. These findings suggest that changes in environmental conditions, food handling practices, surveillance sensitivity, and public health interventions have collectively shaped the landscape of foodborne disease etiology in the region.
In terms of risk prediction, both Random Forest and XGBoost models achieved robust performance in classifying the major foodborne disease types, with validation set accuracies exceeding 0.70 and strong weighted F1-scores. Feature importance analyses highlighted patient age, food name, climate policy, and processing methods as key determinants, underscoring the joint influence of host susceptibility, exposure routes, and environmental or systemic factors on disease type distribution. However, the models’ ability to identify minority classes remained limited, reflecting the inherent challenges of multi-class, imbalanced classification in epidemiological datasets and signaling the need for further methodological refinement.
Temporal analyses demonstrated significant seasonal fluctuations in the incidence of specific bacterial types, particularly Vibrio parahaemolyticus and Norovirus, which showed peaks during the warmer months. The SARIMA model effectively captured these seasonal variations and provided dynamic forecasts of future pathogen occurrence, while time series decomposition confirmed the predominance of periodic and trend components in the temporal patterns of foodborne diseases. Nevertheless, some residual variation remained unexplained, likely due to stochastic or unmeasured influences.
Overall, this study confirms the multifactorial and spatiotemporally heterogeneous nature of foodborne disease risk, offering theoretical and practical insights for dynamic early warning, targeted interventions, and evidence-based policy development. Future work should prioritize the integration of high-resolution individual-level data, multi-source information fusion, and the advancement of methods for rare class detection, with the aim of improving model generalizability and strengthening public health preparedness.

6. Discussion

This study provides a comprehensive investigation into the temporal and structural dynamics of foodborne diseases in Zhejiang Province from 2014 to 2023, yielding insights into changing pathogen prevalence, key predictors, and seasonal transmission patterns. These findings contribute to the broader public health effort to achieve Sustainable Development Goal 3 (SDG 3: Good Health and Well-being) by enhancing disease surveillance and prevention systems.
First, the longitudinal analysis of pathogen composition revealed a shifting landscape in foodborne disease etiology. The declining dominance of Vibrio parahaemolyticus and the concurrent rise of Salmonella suggest evolving environmental or behavioral risk contexts. These changes may be attributed to improvements in seafood hygiene practices, modifications in dietary habits, or shifts in public health reporting accuracy. Notably, Norovirus maintained a consistently high prevalence and resurged in 2023, underscoring the need for persistent vigilance in viral gastroenteritis control. These findings align with the WHO’s emphasis on food safety as a fundamental component of health security under SDG 3.9, which targets reductions in deaths and illnesses from hazardous food and water.
Second, machine learning models demonstrated that a relatively small subset of well-selected variables can effectively predict the causative agent of foodborne diseases. The high classification accuracy achieved by Random Forest (86%) and XGBoost (87%) models confirms the predictive value of features both individual-level (e.g., age, occupation) and environmental or systemic factors (e.g., climate policy, agricultural production). These predictors are not only statistically robust but also actionable from a policy perspective. For instance, food names and processing methods can guide targeted inspections and traceability interventions, while demographic indicators like age highlight vulnerable population groups requiring enhanced protection.
Third, the seasonal analysis emphasized the climatic sensitivity of specific pathogens. Vibrio parahaemolyticus and Norovirus showed strong seasonal patterns, peaking in warmer seasons, consistent with the prior microbiological and epidemiological literature. The SARIMA modeling and time decomposition further validated the presence of significant seasonal components, implying that disease prevention strategies must integrate climate-based early warning systems. These insights are particularly relevant to SDG 13 (Climate Action) and SDG 3.D, which call for improved capacity for early warning and risk reduction in national health risks.
However, some limitations should be acknowledged. First, while the dataset is large and comprehensive, it reflects only reported cases from sentinel hospitals, potentially omitting subclinical or underreported infections. Second, the use of administrative-level aggregated features (e.g., climate, agriculture) may mask individual-level variation. Future research could benefit from higher-resolution data, including household-level exposures and microbiological subtyping, to enhance causal inference and generalizability.
Furthermore, it is important to further clarify that while LASSO regression was used for preliminary feature selection, we fully recognize its inherent limitations as a linear method—particularly in datasets where complex non-linear relationships may exist. To mitigate potential information loss, we systematically incorporated Random Forest and XGBoost in the subsequent modeling stages, both of which are capable of capturing high-order, non-linear effects and interactions. Additionally, we conducted model-based feature importance ranking using these algorithms to re-evaluate the contribution of all retained variables. Our results demonstrate that certain features, such as food type and climate policy, were consistently assigned high importance by non-linear models, even after the initial LASSO selection, highlighting the models’ capacity to identify complex interaction and non-linear signals. The stability of feature importance rankings across both Random Forest and XGBoost further supports the robustness of the key predictors identified in this study. This multi-stage process of feature selection and model interpretation effectively balances dimensionality reduction with information retention and enhances model generalizability across different data structures. Future work could further explore the integration of multiple feature selection strategies, feature engineering, and interaction term expansion to better address highly non-linear and imbalanced epidemiological datasets.
This study highlights the value of integrating machine learning, environmental data, and time-series modeling in understanding foodborne disease dynamics. It underscores the need for targeted, seasonally adaptive, and geographically sensitive public health strategies to reduce the burden of foodborne illnesses and advance sustainable health system resilience.
Future research should integrate higher-resolution data sources, such as individual and household-level exposure records, dietary behavior, and pathogen subtyping, to address the limitations of aggregated datasets. This approach would enable more precise identification of high-risk populations and critical transmission routes, thereby enhancing the targeting and scientific basis of disease prevention efforts. Additionally, the application of spatial epidemiology and advanced artificial intelligence modeling is encouraged to systematically assess the roles of geographic heterogeneity, spatial clustering, and climate change in foodborne disease dynamics, providing a more forward-looking foundation for early warning and adaptive intervention.
Moreover, future work should emphasize the integration of multi-source big data—including electronic health records, food traceability systems, and remote sensing meteorological data—with state-of-the-art algorithms such as deep learning and graph neural networks. It is recommended that future research employ causal inference and policy simulation methods to evaluate the real-world effectiveness and cost-effectiveness of different public health interventions. Additionally, strengthening international comparisons and cooperation is essential to advance knowledge sharing and global governance in foodborne disease prevention.

Author Contributions

C.J. conceived and designed the study, performed the data analysis, and interpreted the results. X.Q., J.W., L.C., and J.C. conceived and designed the study. H.Y. assisted with language editing and formatting of the initial manuscript draft. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC), grant 72303209 (C.J.); the China Postdoctoral Science Foundation, grant GZB20230644 (C.J.); and the Medical and Health Science and Technology Project of Zhejiang Province, grants 2025KY766 (J.W.) and 2025KY767 (X.Q.).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article. For access to the anonymized dataset and the Python code used in this study, please contact the corresponding author.

Acknowledgments

The authors gratefully acknowledge all individuals whose support, feedback, or assistance contributed to the successful completion of this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Example case records of foodborne diseases in Zhejiang Province.
Table A1. Example case records of foodborne diseases in Zhejiang Province.
CodeItemsContents
1Case ID***
2Outpatient Number***
3Follow-up Visit (Yes/No)No
4Hospitalized (Yes/No)No
5Inpatient Number***
6GenderFemale
7Date of Birth1999-01-01
8Age26
9OccupationHealthcare workers
10Workplace***
11Patient CategoryOther districts in this city
12Address—ProvinceZhejiang
13Address—CityNingbo
14Address—District/CountyHaishu
15Detailed Address***
16Onset Date2023-02-16 02
17Consultation Date2023-02-16 16
18Time of Death-
19Main Symptoms and Signs[Digestive system] Nausea, vomiting once/day
20Medical HistoryNone
21Preliminary Diagnosis[Viral] Norovirus
22Antibiotic Use Before Consultation (Yes/No)No
23Suspected Foodborne Case (Yes/No)Yes
24Biological Sample (Yes/No)Yes
25Food Namebeef noodles
26Food CategoryMeat and meat products
27Processing and Packaging MethodCatering service industry
28Purchase ProvinceZhejiang
29Purchase CityNingbo
30Purchase District/CountyJiangbei
31Detailed Purchase Address***
32Purchase Location TypeShop
33Eating Location—ProvinceZhejiang
34Eating Location—CityNingbo
35Eating Location—District/CountyJiangbei
36Detailed Eating Location Address***
37Eating Location TypeCatering
38Number of People Dining Together1
39Eating Time2023-02-10 19
40Food Sample Collected (Yes/No)No
41Other People Affected (Yes/No)No
42Sample ID***
43Sample TypeFecal sample
44Sample Value5
45Sample UnitPcs
46Sample Collection Date2023-02-13
47Test ItemNorovirus
48Testing Institution***
49Testing Date2023-02-13
50Qualitative Result+
51Testing Unit
52Strain Isolated (Yes/No)Yes
53Strain ID***
54Identification MethodNomal PCR
55Target Gene Detection-
56Serotyping-
57Identification ConclusionNorovirus
91Strain Depository Organization***
Note: For patient privacy, this table contains fictitious data modeled after real information. *** indicates redacted personal information. +: Positive. -: Not available.
Table A2. Descriptions of the identified factors, their units, and data sources.
Table A2. Descriptions of the identified factors, their units, and data sources.
CodeVariablesDescriptionData Cleaning RulesData Sources
1PrefectureThe patient’s municipal addressEncoded with the corresponding administrative division codes, for example: Hangzhou → 3301; Nanjing → 3201; Wuhan → 4201.Zhejiang Provincial Center for Disease Control and Prevention, 2014–2023.
2FoodFood categoryThe encoding for food categories is as follows:
Grains and their products (including starch sugars, baked goods, and various staple foods) = 1; Meat and meat products = 2; Oils and fats = 2; Aquatic animals and their products = 3; Eggs and egg products = 4; Dairy products = 4; Fruits and their products (including dried and preserved fruits) = 5; Vegetables and their products = 6; Fungi and their products = 6; Algae and their products = 6; Legumes and their products = 7; Nuts and seeds and their products = 7; Beverages and frozen drinks = 8; Alcoholic beverages and their products = 8; Packaged drinking water (including bottled water) = 8; Mixed foods = 9; Various foods = 9; Blank = 10; Packaged bulk products = 10; Unknown foods = 10; Condiments = 10; Other foods = 10; Candies, chocolates, honey, and their products = 10; Infant foods = 10
3AgeThe patient’s age, which includes different units such as “years,” “months,” and “days.”The age values are standardized and converted into numerical values expressed in years. For example:
“40 years old” → 40; “1 year and 4 months” → 1 + 4/12 ≈ 1.33; “8 months” → 8/12 ≈ 0.67; “9 months and 30 days” → 9/12 + 30/365 ≈ 0.78
4SexRecord the patient’s gender.Convert to numerical values: Male = 1; Female = 0.
5OccupationIndicating the occupation of the patientThe coding for numerical values is as follows: Catering industry = 1; Migrant workers = 2; Unknown = 4; Teachers = 3; Students = 3; Others = 4; Farmers = 2; Dispersed children = 2; Commercial services = 1; Pastoralists = 2; Fishermen = 2; Administrative staff = 3; Preschool children = 3; Healthcare workers = 3; Retired personnel = 3; Homemakers and unemployed = 2; Workers = 2.
6Eat placeThe type of location where food is consumed.The categorical variables were encoded as numerical values: Catering industry = 1; Canteen = 2; Household = 3; Rural banquets = 1; Retail market = 4; Other = 5; School = 2; Type of eating venue = 5.
7PurchaseRepresents the type of location where food is purchased.The categorical variables were encoded as follows: Catering Industry = 1, Canteen = 2, Household = 3, Street Vendor = 1, Retail Market = 4, Others = 6, and Shop = 5.
8BacteriaIndicating the bacterial classification or category.Classification codes: Salmonella = 1; Norovirus = 2; Vibrio parahaemolyticus = 3; Escherichia coli = 4; Others =0
9DiagnosisThe duration of “time of consultation” and “time of onset,” measured in hours.The “time of visit” minus the “time of onset” yields the duration of medical consultation in hours.
10GDPGross Domestic ProductGross Domestic Product in prefectures of Zhejiang province during 2014–2023. Unit: billion yuan
11GDP1Gross Domestic Product of the Primary Industry Gross Domestic Product of the Primary Industry in prefectures of Zhejiang province during 2014–2023. Unit: billion yuan
12GDP2Gross Domestic Product of the Secondary Industry Gross Domestic Product of the Secondary Industry in prefectures of Zhejiang province during 2014–2023. Unit: billion yuan
13GDP3Gross Domestic Product of the Tertiary IndustryGross Domestic Product of the Tertiary Industry in prefectures of Zhejiang province during 2014–2023. Unit: billion yuan
14Average GDPGross Domestic Product per capita Gross Domestic Product per capita in prefectures of Zhejiang province during 2014–2023. Unit: yuan
15HouseholdTotal Number of HouseholdsTotal Number of Households in prefectures of Zhejiang province during 2014–2023. Unit: household
16PopulationTotal Population Total Population in prefectures of Zhejiang province during 2014–2023. Unit: ten thousand people
17MortalityMortality rateMortality rate in prefectures of Zhejiang province during 2014–2023. Unit: ‰
18EmploymentNumber of Employed PersonsNumber of Employed Persons in the Entire Society at the End of the Year in prefectures of Zhejiang province during 2014–2023. Unit: Ten Thousand Persons
19Income disposableThe per capita disposable income of urban and rural residentsThe per capita disposable income of urban and rural residents in prefectures of Zhejiang province during 2014–2023. Unit: yuan
20Consumption expenditureThe per capita living consumption expenditureThe per capita living consumption expenditure of urban and rural residents in prefectures of Zhejiang province during 2014–2023. Unit: yuan
21Total agricultureThe total output value of agricultureThe total output value of agriculture, forestry, animal husbandry, and fishery in prefectures of Zhejiang province during 2014–2023. Unit: billion yuan
22Sown areaThe area of crop plantingThe area of crop planting in prefectures of Zhejiang province during 2014–2023. Unit: thousand hectares
23Total grain yieldTotal Grain yieldTotal Grain yield in prefectures of Zhejiang province during 2014–2023. Unit: million tons
24Cereal yieldCereal yieldCereal yield in prefectures of Zhejiang province during 2014–2023. Unit: tonsZhejiang Statistical Yearbook
25Rapeseed yieldyield of rapeseedYield of rapeseed in prefectures of Zhejiang province during 2014–2023. Unit: ten thousand tons
26Cotton yieldCotton yieldCotton yield in prefectures of Zhejiang province during 2014–2023. Unit: tons
27Fruit yieldFruit yieldFruit yield in prefectures of Zhejiang province during 2014–2023. Unit: tons
28Meat yieldMeat yieldMeat yield in prefectures of Zhejiang province during 2014–2023. Unit: tons
29Pork yieldPork yieldPork yield in prefectures of Zhejiang province during 2014–2023. Unit: tons
30Egg yieldEgg yieldEgg yield in prefectures of Zhejiang province during 2014–2023. Unit: tons
31Milk yieldMilk yieldMilk yield in prefectures of Zhejiang province during 2014–2023. Unit: tons
32Fish yieldAquatic product yieldAquatic product yield in prefectures of Zhejiang province during 2014–2023. Unit: ten thousand tons
33Marine fish yieldMarine fish yieldMarine fish yield in prefectures of Zhejiang province during 2014–2023. Unit: ten thousand tons
34Freshwater fish yieldFreshwater fish yieldFreshwater fish yield in prefectures of Zhejiang province during 2014–2023. Unit: Ten Thousand tons
35agricultural plasticThe usage of agricultural plastic film The usage of agricultural plastic film in prefectures of Zhejiang province during 2014–2023. Unit: tons
36FertilizerUsage of Agricultural Fertilizers Usage of Agricultural Fertilizers in prefectures of Zhejiang province during 2014–2023. Unit: tons
37Nitrogen Nitrogen fertilizer usageNitrogen fertilizer usage in prefectures of Zhejiang province during 2014–2023. Unit: tons
38Phosphate Phosphate fertilizer usagePhosphate fertilizer usage in prefectures of Zhejiang province during 2014–2023. Unit: tons
39Potassium Potassium fertilizer application Potassium fertilizer application in prefectures of Zhejiang province during 2014–2023. Unit: tons
40Compound Compound fertilizer usageCompound fertilizer usage in prefectures of Zhejiang province during 2014–2023. Unit: tons
41PesticidePesticide UsagePesticide Usage in prefectures of Zhejiang province during 2014–2023. Unit: tons
42Wholesale retailTotal Sales of Goods in Wholesale and Retail Trade Above Designated SizeTotal Sales of Goods in Wholesale and Retail Trade Above Designated Size in prefectures of Zhejiang province during 2014–2023. Unit: hundred million yuan
43IncomeTotal Fiscal RevenueTotal Fiscal Revenue in prefectures of Zhejiang province during 2014–2023. Unit: hundred million yuanZhejiang Statistical Yearbook
44ExpenditureLocal Government Fiscal ExpenditureLocal Government Fiscal Expenditure in Zhejiang province during 2014–2023. Unit: hundred million yuan
45Public expenditureGeneral Public Service Expenditure General Public Service Expenditure in prefectures of Zhejiang province during 2014–2023. Unit: hundred million yuan
46TemperatureAnnual Average TemperatureAnnual Average Temperature in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 °C
47Temperature1Average Temperature in JanuaryAverage Temperature in January in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
48Temperature2Average Temperature in FebruaryAverage Temperature in February in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
49Temperature3Average Temperature in March Average Temperature in March in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
50Temperature4Average Temperature in AprilAverage Temperature in April in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
51Temperature5Average Temperature in MayAverage Temperature in May in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
52Temperature6Average Temperature in JuneAverage Temperature in June in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
53Temperature7Average Temperature in JulyAverage Temperature in July in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
54Temperature8Average Temperature in AugustAverage Temperature in August in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
55Temperature9Average Temperature in SeptemberAverage Temperature in September in Zhejiang province during 2014–2023. Unit: Celsius
56Temperature10Average Temperature in October Average Temperature in October in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
57Temperature11Average Temperature in November Average Temperature in November in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
58Temperature12Average Temperature in December Average Temperature in December in prefectures of Zhejiang province during 2014–2023. Unit: Celsius
59PrecipitationAnnual PrecipitationAnnual Precipitation in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
60Precipitation1Precipitation in JanuaryPrecipitation in January in prefectures of Zhejiang province during 2014–2023. Unit:0.1 mm
61Precipitation2Precipitation in FebruaryPrecipitation in February in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
62Precipitation3Precipitation in MarchPrecipitation in March in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
63Precipitation4Precipitation in AprilPrecipitation in April in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
64Precipitation5Precipitation in MayPrecipitation in May in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
65Precipitation6Precipitation in JunePrecipitation in June in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
66Precipitation7Precipitation in JulyPrecipitation in July in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
67Precipitation8Precipitation in AugustPrecipitation in August in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
68Precipitation9Precipitation in SeptemberPrecipitation in September in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
69Precipitation10Precipitation in OctoberPrecipitation in October in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
70Precipitation11Precipitation in NovemberPrecipitation in November in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
71Precipitation12Precipitation in DecemberPrecipitation in December in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 mm
72SunshineTotal Annual Sunshine HoursTotal Annual Sunshine Hours in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
73Sunshine1Sunshine Hours in JanuarySunshine Hours in January in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 hZhejiang Statistical Yearbook
74Sunshine2Sunshine Hours in FebruarySunshine Hours in February in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
75Sunshine3Sunshine Hours in MarchSunshine Hours in March in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
76Sunshine4Sunshine Hours in AprilSunshine Hours in April in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
77Sunshine5Sunshine Hours in MaySunshine Hours in May in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
78Sunshine6Sunshine Hours in JuneSunshine Hours in June in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
79Sunshine7Sunshine Hours in JulySunshine Hours in July in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
80Sunshine8Sunshine Hours in AugustSunshine Hours in August in prefectures of Zhejiang province during 2014–2023.Unit: 0.1 h
81Sunshine9Sunshine Hours in SeptemberSunshine Hours in September in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
82Sunshine10Sunshine Hours in OctoberSunshine Hours in October in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
83Sunshine11Sunshine Hours in NovemberSunshine Hours in November in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
84Sunshine12Sunshine Hours in DecemberSunshine Hours in December in prefectures of Zhejiang province during 2014–2023. Unit: 0.1 h
85water resourcesTotal Water Resources Total Water Resources in prefectures of Zhejiang province during 2014–2023. Unit: hundred million cubic meters
86water supplyTotal Water SupplyTotal Water Supply in prefectures of Zhejiang province during 2014–2023. Unit: hundred million cubic meters
87HospitalsNumber of Hospitals and Health CentersNumber of Hospitals and Health Centers in prefectures of Zhejiang province during 2014–2023 (Units).
88Hospital bedsNumber of Hospital BedsNumber of Hospital Beds in prefectures of Zhejiang province during 2014–2023 (Units).
89DoctorsNumber of DoctorsNumber of doctors in prefectures of Zhejiang province during 2014–2023 (Units).
90InsuranceNumber of Basic Medical Insurance EnrolleesNumber of Basic Medical Insurance Enrollees in prefectures of Zhejiang province during 2014–2023. Unit: ten thousand yuanZhejiang Statistical Yearbook
91Climate policyClimate Policy IndexUsing manual auditing and the deep learning algorithm MacBERT model, we constructed the CCPU index for China at the national, provincial, and major city levels from January 2000 to December 2022. This index is based on 1,755,826 articles from six mainstream newspapers in China: People’s Daily, Guangming Daily, Economic Daily, Global Times, Science and Technology Daily, and China News Service. The research framework consists of six parts: data collection, data cleaning, manual auditing, model construction, index calculation and normalization, and technical validation.[24]
Table A3. Data description.
Table A3. Data description.
CodeVariableCountMeanStdMinMax
1age56,97032.1122.00099
2sex56,9700.550.5001
3occupation56,9702.410.7414
4food51,6074.822.91110
5purchase49,5704.441.7716
6eat place 51,4683.331.2415
7bacteria56,9703.252.0016
8diagnosis56,97027.4541.3202221
9City code56,9703305.843.3133013311
10year56,9702019.342.6720142023
11GDP42,7496181.534608.54971.4718,753
12GDP142,749193.8792.0221.05382
13GDP242,7492608.881731.65369.937413
14GDP342,7493112.682561.17457.6312,287.31
15average GDP42,74998,051.5731,005.4039,721167,134
16household48,9371,515,721.00783,651.17339,2242,664,533
17population48,444452.29253.3176.66846.75
18mortality41,0615.920.794.58
19employment42,749385.57201.0872.14759.68
20Income disposable41,33248,961.1811,069.1422,42670,281
21consumption expenditure41,33231,268.957043.3913,87546,440
22total agriculture42,749310.34145.5430.29589.82
23sown area42,749201.0967.7613.45320.18
24total grain yield44,59456.2524.412.48122.34
25cereal yield42,749514,214.28213,044.5316,8321,158,805
26rapeseed yield43,2522.331.770.136.62
27cotton yield40,022880.621323.5419488
28fruit yield42,749699,250.71397,609.4963,0021,499,253
29meat yield42,749107,953.6756,592.714686322,988
30pork yield42,74974,989.7150,859.051524250,047
31egg yield42,74936,035.1524,953.02189130,870
32milk yield40,37917,794.6016,598.84265,073
33freshwater fish yield43,25212.1513.270.0660.46
34agricultural plastic43,2526505.723589.6632513,567
35fertilizer42,74970,619.1725,618.093907112,811
36nitrogen43,25231,332.9716,029.53150078,100
37phosphate43,8546618.993886.90014,700
38potassium43,2525428.942510.9910012,500
39compound43,25226,427.0414,814.84130056,000
40pesticide43,2523841.391705.313247510
41wholesale retail42,4648950.8311,985.53312.3446,799.786
42income42,4641189.071186.74119.024590.08
43expenditure42,464863.83595.2672.232542.09
44public expenditure42,46486.4549.7111.12210.01
45temperature141,33273.2217.0235115
46temperature241,33285.2622.4249137
47temperature341,332131.2912.83105160
48temperature441,332177.9210.35148204
49temperature541,332222.9413.27190254
50temperature641,332254.2810.57226280
51temperature741,332292.7915.19256329
52temperature841,332294.7013.20259334
53temperature941,332255.4712.77232286
54temperature1041,332202.7312.30180234
55temperature1141,332151.8014.48126184
56temperature1241,33287.1918.4148134
57temperature41,332186.077.73168200
58precipitation141,332795.06500.62552153
59precipitation241,332897.30538.901692764
60precipitation341,3321376.98522.214672879
61precipitation441,3321187.64602.803693576
62precipitation541,3321746.45820.465234241
63precipitation641,3322764.66995.918775882
64precipitation741,3321894.351338.432086353
65precipitation841,3322018.231231.232234990
66precipitation941,3321745.051166.36626069
67precipitation1041,332823.14780.02303577
68precipitation1141,332845.91531.14572804
69precipitation1241,332576.98445.65452048
70precipitation41,33216,680.642846.7611,89225,596
71sunshine141,332973.60387.024401957
72sunshine241,332954.48395.541391748
73sunshine341,3321179.75250.306011810
74sunshine441,3321405.16333.946392097
75sunshine541,3321356.47314.987072102
76sunshine641,3321057.39302.463341832
77sunshine741,3321866.88609.406262918
78sunshine841,3322061.95511.968273140
79sunshine941,3321499.85387.906902451
80sunshine1041,3321328.47385.163182392
81sunshine1141,332993.67311.683001698
82sunshine1241,3321138.37430.893241917
83sunshine41,33215,811.291932.721098619,961
84water resource41,332110.2158.086.93257.23
85water supply41,33217.367.771.4557.38
86hospitals42,749160.27100.4028403
87hospital beds42,74930,391.5020,020.09385187,950
88doctors42,74920,681.8013,479.55260957,455
89insurance42,749324.46193.545.21954.43
90climate policy 56,9700.860.70010.78
Table A4. Performance metrics of Random Forest and XGBoost classifiers on training and validation sets.
Table A4. Performance metrics of Random Forest and XGBoost classifiers on training and validation sets.
Class/MetricRandom Forest (Train)XGBoost (Train)Random Forest (Valid)XGBoost (Valid)
Class 0 Precision0.80.9400
Class 0 Recall0.050.4200
Class 0 F1-score0.10.5800
Class 0 Support77771111
Class 1 Precision0.840.850.720.73
Class 1 Recall0.960.970.860.86
Class 1 F1-score0.890.90.780.79
Class 1 Support15,26415,26438643864
Class 2 Precision0.90.90.430.41
Class 2 Recall0.490.550.170.18
Class 2 F1-score0.640.690.250.25
Class 2 Support24842484620620
Class 3 Precision0.890.910.690.7
Class 3 Recall0.830.840.640.64
Class 3 F1-score0.860.870.660.67
Class 3 Support9142914222452245
Class 4 Precision0.920.950.450.5
Class 4 Recall0.360.580.140.29
Class 4 F1-score0.520.720.210.29
Class 4 Support849849215215
Accuracy0.860.870.70.71
Macro Avg Precision0.870.910.460.47
Macro Avg Recall0.540.670.360.38
Macro Avg F1-score0.60.750.380.4
Macro Avg Support278162781669556955
Weighted Avg Precision0.860.880.680.68
Weighted Avg Recall0.860.870.70.71
Weighted Avg F1-score0.850.870.680.68
Weighted Avg Support278162781669556955
Table A5. SARIMA Results.
Table A5. SARIMA Results.
CoefStd ErrZP > |z[0.0250.975]
ar.L10.2408 0.351 0.687 0.492 −0.446 0.928
ma.L10.1107 0.383 0.289 0.773 −0.640 0.861
ar.S.L12−0.1659 0.141 −1.178 0.239 −0.442 0.110
ma.S.L12−0.5302 0.141 −3.769 0.000 −0.806 −0.254
sigma20.3394 0.039 8.746 0.000 0.263 0.415
Ljung–Box(L1)(Q):0.07 Jar que-Bera(JB):18.23
Prob(Q):0.80 Prob(JB):0.00
Heteroskedasticity(H):1.19 Skew:−0.51
Prob(H)(two-sided):0.63 Kurtosis:4.90

References

  1. Holst, M.M. Contributing Factors of Foodborne Illness Outbreaks—National Outbreak Reporting System, United States, 2014–2022. MMWR Surveill. Summ. 2025, 74, 1–12. [Google Scholar] [CrossRef] [PubMed]
  2. Mixão, V.; Pinto, M.; Brendebach, H.; Sobral, D.; Santos, J.D.; Radomski, N.; Uldall, A.S.M.; Bomba, A.; Pietsch, M.; Bucciacchio, A.; et al. Multi-country and intersectoral assessment of cluster congruence between pipelines for genomics surveillance of foodborne pathogens. Nat. Commun. 2025, 16, 3961. [Google Scholar] [CrossRef]
  3. Sadilek, A.; Caty, S.; DiPrete, L.; Mansour, R.; Schenk, T., Jr.; Bergtholdt, M.; Jha, A.; Ramaswami, P.; Gabrilovich, E. Machine-learned epidemiology: Real-time detection of foodborne illness at scale. NPJ Digit. Med. 2018, 1, 36. [Google Scholar] [CrossRef]
  4. Chen, Y.; Wan, G.; Song, J.; Dai, J.; Shi, W.; Wang, L. Food Safety Practices of Food Handlers in China and their Correlation with Self-reported Foodborne Illness. J. Food Prot. 2024, 87, 100202. [Google Scholar] [CrossRef]
  5. Xue, J.; Zhang, W. Understanding China’s food safety problem: An analysis of 2387 incidents of acute foodborne illness. Food Control 2013, 30, 311–317. [Google Scholar] [CrossRef]
  6. Thaivalappil, A.; Young, I.; Paco, C.; Jeyapalan, A.; Papadopoulos, A. Food safety and the older consumer: A systematic review and meta-regression of their knowledge and practices at home. Food Control 2020, 107, 106782. [Google Scholar] [CrossRef]
  7. He, Y.; Wang, J.; Zhang, R.; Chen, L.; Zhang, H.; Qi, X.; Chen, J. Epidemiology of foodborne diseases caused by Salmonella in Zhejiang Province, China, between 2010 and 2021. Front. Public Health 2023, 11, 1127925. [Google Scholar] [CrossRef]
  8. Qi, X.; Alifu, X.; Chen, J.; Luo, W.; Wang, J.; Yu, Y.; Zhang, R. Descriptive study of foodborne disease using disease monitoring data in Zhejiang Province, China, 2016–2020. BMC Public Health 2022, 22, 1831. [Google Scholar] [CrossRef]
  9. Duchenne-Moutien, R.A.; Neetoo, H. Climate Change and Emerging Food Safety Issues: A Review. J. Food Prot. 2021, 84, 1884–1897. [Google Scholar] [CrossRef]
  10. Li, W.; Huang, T.; Liu, C.; Wushouer, H.; Yang, X.; Wang, R.; Xia, H.; Li, X.; Qiu, S.; Chen, S.; et al. Changing climate and socioeconomic factors contribute to global antimicrobial resistance. Nat. Med. 2025, 31, 1798–1808. [Google Scholar] [CrossRef]
  11. Wang, Z.; Huang, C.; Liu, Y.; Chen, J.; Yin, R.; Jia, C.; Kang, X.; Zhou, X.; Liao, S.; Jin, X.; et al. Salmonellosis outbreak archive in China: Data collection and assembly. Sci. Data 2024, 11, 244. [Google Scholar] [CrossRef]
  12. Archer, E.J.; Baker-Austin, C.; Osborn, T.J.; Jones, N.R.; Martínez-Urtaza, J.; Trinanes, J.; Oliver, J.D.; González, F.J.C.; Lake, I.R. Climate warming and increasing Vibrio vulnificus infections in North America. Sci. Rep. 2023, 13, 3893. [Google Scholar] [CrossRef]
  13. Simpson, R.B.; Zhou, B.; Naumova, E.N. Seasonal synchronization of foodborne outbreaks in the United States, 1996–2017. Sci. Rep. 2020, 10, 17500. [Google Scholar] [CrossRef] [PubMed]
  14. Liu, Y.; Liu, F.; Zhang, J.; Gao, J. Insights into the nature of food safety issues in Beijing through content analysis of an Internet database of food safety incidents in China. Food Control 2015, 51, 206–211. [Google Scholar] [CrossRef]
  15. Pires, S.M.; Desta, B.N.; Mughini-Gras, L.; Mmbaga, B.T.; Fayemi, O.E.; Salvador, E.M.; Gobena, T.; Majowicz, S.E.; Hald, T.; Hoejskov, P.S.; et al. Burden of foodborne diseases: Think global, act local. Curr. Opin. Food Sci. 2021, 39, 152–159. [Google Scholar] [CrossRef] [PubMed]
  16. Lake, I.R. Food-borne disease and climate change in the United Kingdom. Environ. Health 2017, 16, 117. [Google Scholar] [CrossRef]
  17. Jin, C.; Levi, R.; Liang, Q.; Renegar, N.; Springs, S.; Zhou, J.; Zhou, W. Testing at the Source: Analytics-Enabled Risk-Based Sampling of Food Supply Chains in China. Manag. Sci. 2021, 67, 2985–2996. [Google Scholar] [CrossRef]
  18. Gao, Q.; Levi, R.; Renegar, N. The Link between Food Safety and Zoonotic Disease Risks at Wholesale and Wet Markets in China. SSRN Electron. J. 2020. [Google Scholar] [CrossRef]
  19. Jin, C.; Levi, R.; Liang, Q.; Renegar, N.; Zhou, J. Food safety inspection and the adoption of traceability in aquatic wholesale markets: A game-theoretic model and empirical evidence. J. Integr. Agric. 2021, 20, 2807–2819. [Google Scholar] [CrossRef]
  20. Levi, R.; Singhvi, S.; Zheng, Y. Economically Motivated Adulteration in Farming Supply Chains. Manag. Sci. 2020, 66, 209–226. [Google Scholar] [CrossRef]
  21. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  22. Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  23. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Available online: https://dl.acm.org/doi/10.1145/2939672.2939785 (accessed on 21 June 2025).
  24. Ma, Y.; Liu, Z.; Ma, D.; Zhai, P.; Guo, K.; Zhang, D.; Ji, Q. A news-based climate policy uncertainty index for China. Sci. Data 2023, 10, 881. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Lasso feature selection coefficients for predicting foodborne disease categories. The chart displays the magnitude and direction of influence of each variable based on Lasso regression coefficients, where negative values indicate inverse relationships with foodborne disease types.
Figure 1. Lasso feature selection coefficients for predicting foodborne disease categories. The chart displays the magnitude and direction of influence of each variable based on Lasso regression coefficients, where negative values indicate inverse relationships with foodborne disease types.
Foods 14 02857 g001
Figure 2. Distribution of foodborne disease cases by key characteristics in Zhejiang Province, 2014–2023. This figure presents the distribution of reported foodborne disease cases in Zhejiang Province from 2014 to 2023, classified by four key categorical variables: eating venue (Eat Place), food purchase source (Purchase), patient occupation (Occupation), and implicated food type (Food). Coding schemes for each variable are provided in Appendix A, Table A2.
Figure 2. Distribution of foodborne disease cases by key characteristics in Zhejiang Province, 2014–2023. This figure presents the distribution of reported foodborne disease cases in Zhejiang Province from 2014 to 2023, classified by four key categorical variables: eating venue (Eat Place), food purchase source (Purchase), patient occupation (Occupation), and implicated food type (Food). Coding schemes for each variable are provided in Appendix A, Table A2.
Foods 14 02857 g002
Figure 3. Annual composition of foodborne disease categories in Zhejiang Province (2014–2023). Category 1: Salmonella; Category 2: Norovirus; Category 3: Vibrio parahaemolyticus; Category 4: Escherichia coli; Category 0: Other types of bacteria.
Figure 3. Annual composition of foodborne disease categories in Zhejiang Province (2014–2023). Category 1: Salmonella; Category 2: Norovirus; Category 3: Vibrio parahaemolyticus; Category 4: Escherichia coli; Category 0: Other types of bacteria.
Foods 14 02857 g003
Figure 4. Performance comparison of XGBoost and Random Forest classifiers via confusion matrices on training and validation sets. The Y-axis represents the true class labels (i.e., the actual categories) for the confusion matrix.
Figure 4. Performance comparison of XGBoost and Random Forest classifiers via confusion matrices on training and validation sets. The Y-axis represents the true class labels (i.e., the actual categories) for the confusion matrix.
Foods 14 02857 g004
Figure 5. Factors influencing the types of foodborne diseases in Zhejiang Province, China. Feature importance rankings from Random Forest and XGBoost models.
Figure 5. Factors influencing the types of foodborne diseases in Zhejiang Province, China. Feature importance rankings from Random Forest and XGBoost models.
Foods 14 02857 g005
Figure 6. Time pattern of foodborne disease types. (a) Distribution of different bacterial types in foodborne diseases across months. Category 1: Salmonella; Category 2: Norovirus; Category 3: Vibrio parahaemolyticus; Category 4: Escherichia coli; Category 0: Other types of bacteria. (b) Temporal patterns of different bacterial types. (c) Decomposition of the time series of foodborne disease counts. The Y-axis indicates “Standardized Bacteria Count”. The four subplots represent the observed series (Observed), long-term trend (Trend), seasonal component (Seasonal), and random residual (Residual), respectively. Both trend and seasonal components have been standardized for comparative purposes.
Figure 6. Time pattern of foodborne disease types. (a) Distribution of different bacterial types in foodborne diseases across months. Category 1: Salmonella; Category 2: Norovirus; Category 3: Vibrio parahaemolyticus; Category 4: Escherichia coli; Category 0: Other types of bacteria. (b) Temporal patterns of different bacterial types. (c) Decomposition of the time series of foodborne disease counts. The Y-axis indicates “Standardized Bacteria Count”. The four subplots represent the observed series (Observed), long-term trend (Trend), seasonal component (Seasonal), and random residual (Residual), respectively. Both trend and seasonal components have been standardized for comparative purposes.
Foods 14 02857 g006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, C.; Qi, X.; Wang, J.; Chen, L.; Chen, J.; Yin, H. Identifying Key Drivers of Foodborne Diseases in Zhejiang, China: A Machine Learning Approach. Foods 2025, 14, 2857. https://doi.org/10.3390/foods14162857

AMA Style

Jin C, Qi X, Wang J, Chen L, Chen J, Yin H. Identifying Key Drivers of Foodborne Diseases in Zhejiang, China: A Machine Learning Approach. Foods. 2025; 14(16):2857. https://doi.org/10.3390/foods14162857

Chicago/Turabian Style

Jin, Cangyu, Xiaojuan Qi, Jikai Wang, Lili Chen, Jiang Chen, and Han Yin. 2025. "Identifying Key Drivers of Foodborne Diseases in Zhejiang, China: A Machine Learning Approach" Foods 14, no. 16: 2857. https://doi.org/10.3390/foods14162857

APA Style

Jin, C., Qi, X., Wang, J., Chen, L., Chen, J., & Yin, H. (2025). Identifying Key Drivers of Foodborne Diseases in Zhejiang, China: A Machine Learning Approach. Foods, 14(16), 2857. https://doi.org/10.3390/foods14162857

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop