Next Article in Journal
In Vitro Antiviral Activity of Red Algae Extracts from Chondracanthus teedei var. lusitanicus and Osmundea pinnatifida Against Coxsackievirus A12 and a Lentiviral Vector
Previous Article in Journal
Seasonal Dynamics Versus Vertical Stratification of Mosquitoes (Diptera: Culicidae) in an Atlantic Forest Remnant, Brazil: A Focus on the Mansoniini Tribe
Previous Article in Special Issue
Spatial Analysis of Drug-Resistant Tuberculosis in Colombia (2020–2023): Departmental Rates, Clusters, and Associated Factors
 
 
Article
Peer-Review Record

AI-Assisted Differentiation of Dengue and Chikungunya Using Big, Imbalanced Epidemiological Data

Trop. Med. Infect. Dis. 2026, 11(2), 40; https://doi.org/10.3390/tropicalmed11020040
by Thanh Huy Nguyen 1 and Nguyen Quoc Khanh Le 2,3,4,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Trop. Med. Infect. Dis. 2026, 11(2), 40; https://doi.org/10.3390/tropicalmed11020040
Submission received: 23 October 2025 / Revised: 16 January 2026 / Accepted: 29 January 2026 / Published: 30 January 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript addresses an important public-health problem and leverages one of the largest openly available arboviral datasets. The study is timely, relevant, and potentially impactful. However, several major issues require clarification and revision before the manuscript can be considered for publication.

 

  1. Major Comments

1.1. Data Quality and Missing Clinical Information

Several clinical feature counts for chikungunya (e.g., fever, headache, myalgia) are implausibly low given the known symptomatology of the disease. This strongly suggests either misclassification or preprocessing errors resulting in missing values being treated as “absence.”

Please re-examine the raw dataset and provide a detailed clarification of how missing/ignored values were encoded and processed. This is crucial to ensure that the model is not learning artefacts from incorrect symptom coding.

In addition, important diagnostic features (e.g., days since symptom onset, severity markers) are absent. This should be discussed more prominently in the Methods and not only in the Discussion.

 

1.2. Handling of Class Imbalance

The use of SMOTE on a very large, predominantly categorical dataset requires stronger justification. SMOTE can generate unrealistic synthetic records when applied to categorical variables and may distort epidemiological patterns.

Please clarify why SMOTE was chosen over more appropriate alternatives (e.g., SMOTE-NC, ADASYN, class-weighting, undersampling).

A performance comparison (with and without SMOTE) is recommended.

 

1.3. Potential Data Leakage Through Geographic Features

Feature importance analysis shows that several location-based variables (e.g., COMUNINF, ID_MN_RESI, ID_REGIONA) are among the top predictors. These variables may allow the model to identify regional epidemiological patterns rather than true patient-level clinical differences, thereby inflating performance.

Please evaluate model performance after removing all geographical identifiers and discuss the implications for real-world generalizability.

 

1.4. External Validation Dataset

The description of the external test set is insufficient. Key information is missing:

- Years covered

- Geographic regions

- Class distribution

- Whether municipalities overlap with the training data

Please provide a full description of the external dataset to support claims of generalizability.

 

1.5. Model Interpretability

For clinical adoption, it is essential to understand which features drive predictions.

Please include interpretability analyses, such as SHAP values or feature contribution plots, and provide a clinical explanation of the top predictors for each class.

 

1.6. Deep Learning Model Justification

The ANN architecture is relatively shallow and lacks justification. No details are provided on hyperparameter optimization or alternative architectures.

Please expand the rationale behind the chosen model, and indicate whether other architectures (e.g., 1D-CNN, TabNet, Transformers) were considered.

 

  1. Minor Comments

2.1. English Language and Grammar

The manuscript would benefit from professional English language editing to correct grammatical errors, improve readability, and reduce repetition.

2.2. Figures and Image Quality

Figures (particularly Figures 2 and 3) require higher resolution and clearer axis labels to meet MDPI publication standards.

2.3. Table Consistency

In Table 3, several percentages and frequencies—especially for chikungunya symptoms—appear inconsistent with known epidemiology. Please verify and correct as needed.

2.4. Citations

Some statements would benefit from clearer linkage to the cited literature. Please ensure that claims, especially regarding previous multiclass work, are directly supported by the referenced studies.

I recommend that the authors cite and discuss the recent Lancet Regional Health Southeast Asia article (https://www.thelancet.com/journals/lansea/article/PIIS2772-3682(25)00101-5/fulltext), which provides updated insights into arboviral epidemiology and diagnostic challenges. Incorporating this reference would strengthen the contextual foundation of the Introduction and better situate the current study within the most recent literature.

  1. Recommendation

Major Revision

The manuscript has significant potential, but the issues outlined above—especially those related to data quality, feature leakage, imbalance handling, and interpretability—must be addressed before the work can be considered for publication.

Comments on the Quality of English Language

na

Author Response

Comment 1: Data Quality and Missing Clinical Information.

Several clinical feature counts for chikungunya (e.g., fever, headache, myalgia) are implausibly low given the known symptomatology of the disease. This strongly suggests either misclassification or preprocessing errors resulting in missing values being treated as “absence.” Please re-examine the raw dataset and provide a detailed clarification of how missing/ignored values were encoded and processed. This is crucial to ensure that the model is not learning artefacts from incorrect symptom coding.

Response: We appreciate the reviewer for this valuable suggestion. We re-examined the original article and found that the authors emphasized in page 7 of their original paper that there is an absence of symptoms in records of confirmed chikungunya cases, and this situation directly affects the percentage of these symptoms. In the codebook, the authors stated that they encoded the presence of symptoms as 1, the absence symptoms as 2. But when checking the raw dataset, we found that they also encoded ignored/missing values as 9. We processed them as one class for each clinical feature since this is a common phenomenon in the daily practice of epidemiologists in the field.

Comment 2: In addition, important diagnostic features (e.g., days since symptom onset, severity markers) are absent. This should be discussed more prominently in the Methods and not only in the Discussion.

Response: We appreciate the reviewer for this important suggestion. We re-examined the original article and found that the authors did not collect these features in their dataset. Therefore, we emphasized the absence of important diagnostic features in the Methods section (page 4, line 185-187) as follows:

“One important characteristic of this dataset is that, some important diagnostic features such as days from symptom onset, and severity markers were not collected by the researchers”

Comment 3: Handling of Class Imbalance.

The use of SMOTE on a very large, predominantly categorical dataset requires stronger justification. SMOTE can generate unrealistic synthetic records when applied to categorical variables and may distort epidemiological patterns. Please clarify why SMOTE was chosen over more appropriate alternatives (e.g., SMOTE-NC, ADASYN, class-weighting, undersampling).

Response: We appreciate the reviewer for this important suggestion. Actually, we also performed other techniques besides SMOTE, such as Random Over Sampling and Random Under Sampling, but the performance was very low (the values of recall, precision and F1-score were less than 0.50). Therefore, we chose SMOTE as the main approach to handle the issue of imbalanced datasets despite its potential limitation since the dataset used in this study has the minority class of chikungunya, which is the important class that we want to predict, as emphasized in the revised manuscript (page 5, line 226-229) as follows:

“the imbalanced data was handled using various techniques, including random undersampling, random oversampling, and the Synthetic Minority Oversampling Technique (SMOTE) [3], since the dataset used in this study has the minority class of interest (chikungunya cases)”.

Comment 4: A performance comparison (with and without SMOTE) is recommended.

Response: We totally agree with the reviewer's assessment. We added the performance comparison of our models with and without SMOTE in Table 4.

Comment 5: Potential Data Leakage Through Geographic Features

Feature importance analysis shows that several location-based variables (e.g., COMUNINF, ID_MN_RESI, ID_REGIONA) are among the top predictors. These variables may allow the model to identify regional epidemiological patterns rather than true patient-level clinical differences, thereby inflating performance. Please evaluate model performance after removing all geographical identifiers and discuss the implications for real-world generalizability.

Response: We agree with the reviewer's assessment. We ran the machine learning models again using only 23 features (age, gender, race, fever, myalgia, headache, rash, vomit, nausea, back pain, conkunctivitis, arthritis, arthralgia, petechiae, tourniquet test, retro-orbital pain, diabetes, liver disease, kidney disease, hypertension, peptic acid disease, and autoimmune disease), and achived extremely low performance, with the macro-averaged values of recall, precision and F1-score less than 0.50, as presented in the table below:

ML models

Recall

Precision

F1-score

AUC

RF

0.36

0.45

0.34

0.69

DT

0.36

0.43

0.34

0.68

AD

0.34

0.39

0.28

0.65

GB

0.36

0.41

0.32

0.70

XG

0.35

0.39

0.32

0.47

KNN

0.34

0.39

0.29

0.67

Therefore, we decided to add epidemiological features in the models to see whether they could improve the prediction capability of machine learning and deep learning algorithms.

Comment 6: External Validation Dataset

The description of the external test set is insufficient. Key information is missing:

- Years covered

- Geographic regions

- Class distribution

- Whether municipalities overlap with the training data

Please provide a full description of the external dataset to support claims of generalizability.

Response: We agree with the reviewer's assessment. The external validation set should be called “internal test set”, which is split the first time from the original dataset and kept untouched during the training. Therefore, the internal test set has similar features to the training set, including years, geographic regions, municipalities and class distribution. We added the description of the internal test set in the Methods section (page 5, line 221-224) as follows:

 “The testing data is also called the internal test set, which has similar features to the training set, including years, geographic regions, municipalities and class distribution. This test set is kept separately during model training”

 

Comment 7: Model Interpretability

For clinical adoption, it is essential to understand which features drive predictions. Please include interpretability analyses, such as SHAP values or feature contribution plots, and provide a clinical explanation of the top predictors for each class.

Response: We thank the reviewer for this valuable suggestion. We agreed with the reviewer and tried to run the SHAP analysis as suggested. However, due to package incompatibility issues, we could not run SHAP analysis for our proposed models to investigate how much each feature contributed to the model’s performance and prediction for each class. We see the lack of interpretability analysis of top predictors as one of our major limitations and will apply this analysis in future work, as stated in the revised manuscript (page 12-13, line 426-429) as follows:

“In addition, lack of interpretability analysis of top prediction features for clinical adoption is another limitation of this study, and we hope to apply SHAP analysis in future work to understand more which features contribute most to disease prediction.”   

Comment 8: Deep Learning Model Justification

The ANN architecture is relatively shallow and lacks justification. No details are provided on hyperparameter optimization or alternative architectures. Please expand the rationale behind the chosen model, and indicate whether other architectures (e.g., 1D-CNN, TabNet, Transformers) were considered.

Response: We really appreciate the reviewer's valuable suggestion. We also applied the 1D-CNN and TabNet architectures as suggested, but not Transforemer due to the limitation of computational ability. Unfortunately, the performance of suggested architecture was not as good as expected. For example, the TabNet model only achieved validation accuracy of 0.65297 after 3 hours training with 30 epochs as follows:

epoch 0  | loss: 0.73888 | val_0_accuracy: 0.63981 |  0:05:04s

epoch 1  | loss: 0.73685 | val_0_accuracy: 0.64009 |  0:09:54s

epoch 2  | loss: 0.73661 | val_0_accuracy: 0.63817 |  0:14:47s

epoch 3  | loss: 0.73669 | val_0_accuracy: 0.64341 |  0:19:31s

epoch 4  | loss: 0.73553 | val_0_accuracy: 0.64801 |  0:24:18s

epoch 5  | loss: 0.73507 | val_0_accuracy: 0.64266 |  0:29:12s

epoch 6  | loss: 0.73485 | val_0_accuracy: 0.65053 |  0:33:59s

epoch 7  | loss: 0.73471 | val_0_accuracy: 0.6319  |  0:38:49s

epoch 8  | loss: 0.73456 | val_0_accuracy: 0.64986 |  0:43:42s

epoch 9  | loss: 0.73442 | val_0_accuracy: 0.6453  |  0:48:26s

epoch 10 | loss: 0.7343  | val_0_accuracy: 0.65084 |  0:53:08s

epoch 11 | loss: 0.73424 | val_0_accuracy: 0.64915 |  0:57:56s

epoch 12 | loss: 0.73428 | val_0_accuracy: 0.65048 |  1:03:03s

epoch 13 | loss: 0.73413 | val_0_accuracy: 0.65021 |  1:08:19s

epoch 14 | loss: 0.73409 | val_0_accuracy: 0.65102 |  1:13:07s

epoch 15 | loss: 0.73404 | val_0_accuracy: 0.6488  |  1:17:50s

epoch 16 | loss: 0.7341  | val_0_accuracy: 0.64976 |  1:22:40s

epoch 17 | loss: 0.73393 | val_0_accuracy: 0.64614 |  1:27:45s

epoch 18 | loss: 0.73432 | val_0_accuracy: 0.64931 |  1:32:35s

epoch 19 | loss: 0.73412 | val_0_accuracy: 0.65019 |  1:37:19s

epoch 20 | loss: 0.73394 | val_0_accuracy: 0.64603 |  1:42:03s

epoch 21 | loss: 0.73383 | val_0_accuracy: 0.64879 |  1:46:59s

epoch 22 | loss: 0.73378 | val_0_accuracy: 0.65194 |  1:51:46s

epoch 23 | loss: 0.73382 | val_0_accuracy: 0.64757 |  1:56:31s

epoch 24 | loss: 0.73273 | val_0_accuracy: 0.65256 |  2:01:18s

epoch 25 | loss: 0.73179 | val_0_accuracy: 0.65256 |  2:06:03s

epoch 26 | loss: 0.73212 | val_0_accuracy: 0.64692 |  2:10:49s

epoch 27 | loss: 0.73154 | val_0_accuracy: 0.65011 |  2:15:33s

epoch 28 | loss: 0.73155 | val_0_accuracy: 0.61447 |  2:20:19s

epoch 29 | loss: 0.73138 | val_0_accuracy: 0.65297 |  2:25:04s

Stop training because you reached max_epochs = 30 with best_epoch = 29 and best_val_0_accuracy = 0.65297

 In addition, we added the explanation of hyperparameters optimization in the revised manuscript (page 7, line 260-262) as follows:

 “The above model was trained in 30 epochs, 128 batch_size using keras and tensorflow package, hyperparameter optimization was performed with RandomSearchCV to perform effective differential diagnosis between three classes”  

Comment 9: English Language and Grammar

The manuscript would benefit from professional English language editing to correct grammatical errors, improve readability, and reduce repetition.

Response: We appreciate the reviewer's valuable suggestion. We have resived the whole manuscript with the support of an English native speaker for English language editing. The revision could be seen throughout the revised manuscript with track changes.

Comment 10: Figures and Image Quality

Figures (particularly Figures 2 and 3) require higher resolution and clearer axis labels to meet MDPI publication standards.

Response: We are thankful for the reviewer's suggestion. We adjusted the y axis label in Figure 3 to accurately represent the AUC value range from 0.0 to 1.0, changed the color of the bars to match with the figure legend, and adjusted the resolution of all figures in the revised manuscript.

Comment 11: Table Consistency

In Table 3, several percentages and frequencies—especially for chikungunya symptoms—appear inconsistent with known epidemiology. Please verify and correct as needed.

Response: We appreciate the reviewer’s valuable suggestion. After checking with the original paper, we confirmed that the percentages and frequencies for chikungungya are consistent with the descriptive stastistics in the original paper. The authors also claimed about the absence os symptoms in records of chikungunya cases, which made it seem inconsistent with known epidemiology. We adjusted one typo in Table 1 about the number of missing/ignored records of gestational age, from 6,687,12 to 6,687,215. In addition, we added Table 5, which was missing during editing in the previous manuscript. 

Comment 12: Citations

Some statements would benefit from clearer linkage to the cited literature. Please ensure that claims, especially regarding previous multiclass work, are directly supported by the referenced studies.

Response: We appreciate the reviewer for this important suggestion and have reviewed the cited literature.

Comment 13: I recommend that the authors cite and discuss the recent Lancet Regional Health Southeast Asia article (https://www.thelancet.com/journals/lansea/article/PIIS2772-3682(25)00101-5/fulltext), which provides updated insights into arboviral epidemiology and diagnostic challenges. Incorporating this reference would strengthen the contextual foundation of the Introduction and better situate the current study within the most recent literature.

Response: We appreciate the reviewer's suggestion. After reading the suggested paper, we added it to one of our references in the Introduction section (page 2-3, line 89-91) about recent effort of using spectroscopy techniques combined with machine learning for rapid diagnosis of dengue and chikungunya in resource-limited regions with promising application as follows:

“Recently, novel approach using micro-spectroscopy techniques combined with machine learning showed a promising application in rapid classification of dengue and chikungunya in remote areas”

Comment 14: The manuscript has significant potential, but the issues outlined above—especially those related to data quality, feature leakage, imbalance handling, and interpretability—must be addressed before the work can be considered for publication.

Response: We appreciate the reviewer's valuable suggestion. We tried our best to address the issues raised by the reviewer.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript addresses a highly relevant challenge in global health that is the rapid differentiation between DENV and CHIKV infections in the absence of laboratory diagnostics. The authors apply a well-structured machine learning approach to a large dataset. The study is timely, technically sound, and offers meaningful insights into the potential of AI tools for improving infectious disease surveillance and triage. However, it requires some improvements, as suggested below:

1- There is some minor error for data description. “The dataset contains 4,307,513 million records…which is a typographical error; it should be 4,307,513 records.

2- The paper reports use of an external test set with specific AUC results (RF macro AUC = 0.8329), but does not explain where this external dataset originated from, how it was partitioned, or if it differs by geography or time.

3- SMOTE oversampling was used to balance classes in the training set. However, authors didn’t  discuss potential limitations of this technique.

4- There is some small spelling issue in the reference list. It’s good to go over the refrences again. In addition, provide a brief explanation of what constitutes a discarded case in the dataset. Readers might not immediately know this term.

5- It would be helpful to briefly explain how the hyperparameters were chosen and how the ANN was trained. Also, please clarify whether the reported performance metrics are macro averaged, just to avoid any confusion.

6- It would be helpful to briefly highlight how this approach could make a real world impact. For example, could the model be used during outbreaks or added to current surveillance systems? Explaining this upfront would strengthen the motivation for the study. In addition, One useful point to consider is whether this modeling approach could be extended to include other viruses with overlapping clinical features . A brief comment on this possibility could enhance the broader relevance of the work.

Author Response

Comment 1: This manuscript addresses a highly relevant challenge in global health that is the rapid differentiation between DENV and CHIKV infections in the absence of laboratory diagnostics. The authors apply a well-structured machine learning approach to a large dataset. The study is timely, technically sound, and offers meaningful insights into the potential of AI tools for improving infectious disease surveillance and triage.

Response: Thank you very much for your valuable comments.

Comment 2: There is some minor error for data description. “The dataset contains 4,307,513 million records…which is a typographical error; it should be 4,307,513 records.

Response: We thank the reviewer for pointing it out. The reviewer is correct, and we have revised it in the manuscript (page 4, line 180-182). The revised text reads as follows:

“We used the open-source dataset from da Silva Neto et al. [39], which consists of 4,307,513 records of dengue cases; 325,000 of chikungunya; and 2,100,029 discarded cases…”

Comment 3: The paper reports use of an external test set with specific AUC results (RF macro AUC = 0.8329), but does not explain where this external dataset originated from, how it was partitioned, or if it differs by geography or time.

Response: We thank the reviewer for this comment. We changed the term “external test set” into “internal test set” and provided a brief explanation of this dataset in the revised manuscript (page 5, line 221-224) as follows:

 “The testing data is also called the internal test set, which has similar features to the training set, including years, geographic regions, municipalities and class distribution. This test set is kept separately during model training”

Comment 4: SMOTE oversampling was used to balance classes in the training set. However, authors didn’t discuss potential limitations of this technique.

Response: We appreciate the reviewer for this important suggestion. We added the statements about potential limitation of SMOTE technique with the possibility of causing overfitting in the revised manuscript (page 5, line 229-232) as follows:

 “The performance of models was compared after applying different approaches to handle the issue of imbalanced data and SMOTE showed the prominent advantage over techniques. Therefore, SMOTE was applied in this study despite its potential concern of overfitting”

Comment 5: There is some small spelling issue in the reference list. It’s good to go over the refrences again.

Response: We appreciate the reviewer's suggestion. We have checked all the references again, checked the citations through Google Scholar and PubMed databases.

Comment 6: In addition, provide a brief explanation of what constitutes a discarded case in the dataset. Readers might not immediately know this term.

Response: We appreciate the reviewer for this critical suggestion. We added the explanation of discarded cases in the revised manuscript (page 4, line 177-178).

Comment 7: It would be helpful to briefly explain how the hyperparameters were chosen and how the ANN was trained. Also, please clarify whether the reported performance metrics are macro averaged, just to avoid any confusion.

Response: We thank the reviewer for this critical suggestion. We added the explanation of hyperparameter optimization and ANN model training in the revised manuscript (page 7, line 261-263) as follows:

“The above model was trained in 30 epochs, 128 batch_size using keras and tensorflow package, hyperparameter optimization was performed with RandomSearchCV to perform effective differential diagnosis between three classes”

 We also clarified the metrics for classification are macro-averaged in the revised manuscript (page 7, line 288) as follows:

“The metrics for multi-class classification were obtained by using macro averaging.”

Comment 8: It would be helpful to briefly highlight how this approach could make a real world impact. For example, could the model be used during outbreaks or added to current surveillance systems? Explaining this upfront would strengthen the motivation for the study. In addition, one useful point to consider is whether this modeling approach could be extended to include other viruses with overlapping clinical features. A brief comment on this possibility could enhance the broader relevance of the work.

Response: We thank the reviewer for this important suggestion. A short description of practical application of our work in the field is provided in the manuscript (page 12, line 414-419) as follows:

“For instance, a young physician in remote areas of one province can use the model, enter epidemiological and clinical information of a new patient who comes from another province, and achieve a reliable diagnosis of that patient to plan medical assistance for him/her, such as hospitalization. This model could be trained with other arboviral diseases like Zika or yellow fever for active disease surveillance and case management in the field.”

Reviewer 3 Report

Comments and Suggestions for Authors
  • Data availability link not working: Please correct the link and ensure the dataset is accessible. Also provide a short description of the dataset

  • Main novelty unclear: Clearly explain what is new in your study compared with existing work. A short table or structured bullet list showing previous work vs this work vs novelty will make the contribution more visible.

  • Comparative analysis required: Include comparison of your model with previous ML/AI studies on Dengue/Chikungunya and report standard imbalanced-data metrics (e.g., F1, PR-AUC, recall). If external validation is not available, please state this.

  • Add and discuss suggested references: Please cite and briefly discuss the following AI/Dengue related works and explain how they connect to your approach: Physica Scripta (100, 2025, DOI:10.1088/1402-4896/addfbc) and Nonlinear Dynamics (111, 2023, DOI:10.1142/S1793524524501328).

  • State limitations: Mention limitations such as data imbalance, sampling bias, limited generalization, lack of external validation or clinical deployment issues, and describe them clearly in a short limitations section.

Author Response

Comment 1: Data availability link not working: Please correct the link and ensure the dataset is accessible. Also provide a short description of the dataset.

Response: We thank the reviewer for this suggestion. After checking the link in the manuscript, we confirm that the link we provided is still working, and the dataset is also accessible via the original research with this link: https://www.nature.com/articles/s41597-022-01312-7. In addition, a short description of the dataset is provided in the manuscript (page 4, line 182-188) as follows:

“We used an open-source dataset from da Silva Neto et al. [39], which consists of 4,307,513 dengue, 325, 000 chikungunya, and 2,100,029 discarded cases in Brazil from 2013 to 2020 for classification. There are 55 variables but 9 from laboratory data were excluded. The remaining features were classified into three groups: demographic, clinical, and comorbidity data. One important characteristic of this dataset is that some important diagnostic features such as days from symptom onset, and severity markers were not collected by the researchers”

Comment 2: Main novelty unclear: Clearly explain what is new in your study compared with existing work. A short table or structured bullet list showing previous work vs this work vs novelty will make the contribution more visible.

Response: We thank the reviewer for this suggestion. A short description of novelty of our work is provided in the manuscript (page 12, line 385-390 and line 410-414) to illustrate our new approach in using epidemiological variables for prediction and an internal test set for performance evaluation instead of using only validation set like previous work. The description is written in the manuscript as follows:

“Moreover, our study exclusively used demographic and clinical data to train the models, which achieved high performance. From a clinical perspectives, epidemiological and demographic variables are perceived as less influential, and are usually ignored when diagnosing patients with dengue or chikungunya. In a previous study, experienced physicians only selected clinical symptoms, two pre-existing diseases (diabetes and hypertension), and days from symptoms onset as input data for training ML models”

“The use of internal test set will enhance the reliability of ML and DL algorithms when predicting data not previously encountered When deployed through a computer interfaces or portable devices, these models can assist frontline healthcare workers by providing accurate and timely differentiation of arboviral diseases such as dengue and chikungunya”

Comment 3: Comparative analysis required: Include comparison of your model with previous ML/AI studies on Dengue/Chikungunya and report standard imbalanced-data metrics (e.g., F1, PR-AUC, recall). If external validation is not available, please state this.

Response: We thank the reviewer for this valuable suggestion. We added the comparison of our model with previous work in the manuscript (page 11, line 368-372) as follows:

“Our work suggested that RF model works best in differentiating dengue and chikungunya with a macro-averaged recall of 0.92288, precision of 0.9111, f1-score of 0.9196, which are higher than the metrics of a previous work using GB model on balanced dataset and achieved the recall, precision, and f1-score of 0.6257, 0.6205, and 0.6196, respectively [34].”  

In addition, we change the term “external test set” into “internal test set” and provided a brief explanation of this dataset in the revised manuscript (page 5, line 222-225) as follows:

“The testing data is also called the internal test set, which has similar features to the training set, including years, geographic regions, municipalities and class distribution. This test set is kept separately during model training”

Comment 4: Add and discuss suggested references: Please cite and briefly discuss the following AI/Dengue related works and explain how they connect to your approach: Physica Scripta (100, 2025, DOI:10.1088/1402-4896/addfbc) and Nonlinear Dynamics (111, 2023, DOI:10.1142/S1793524524501328).

Response: We appreciate the reviewer's suggestion. We read all the two suggested references. One reference (100, 2025, DOI:10.1088/1402-4896/addfbc) uses a similar approach to our work in building a prediction model for heart failure using clinical features. Machine learning models like Random Forest and XGBoost were used for improving diagnostic performance. This reference was added to the revised manuscript as reference no. 24 (page 2, line 83). The other reference introduced the use of machine learning for disease transmission, which is not the main aim of our study, so we will cite it in our future work about dengue outbreak prediction.

Comment 5: State limitations: Mention limitations such as data imbalance, sampling bias, limited generalization, lack of external validation or clinical deployment issues, and describe them clearly in a short limitations section.

Response: We appreciate the reviewer's suggestion. We updated the limitations of our study about data imbalance, feature intepretability in the revised manuscript (page 12-13, line 424-430) as follows:

“Another concerns the approach used to address the imbalanced dataset. We only applied the SMOTE technique, which made it difficult to compare the models’ performance with previous studies that either used balanced data or applied other methods like down sampling technique. In addition, lack of interpretability analysis of top prediction features for clinical adoption is another limitation of this study, and we hope to apply SHAP analysis in future work to understand more which features contribute most to disease prediction”

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have responded to all reviewer comments; however, several key methodological concerns remain insufficiently resolved.

While textual clarifications were added, major scientific risks persist, including:
(i) treatment of missing clinical symptoms as true absence,
(ii) strong dependence on geographic identifiers suggesting data leakage,
(iii) lack of interpretability analyses critical for clinical deployment, and
(iv) absence of true external validation despite claims of generalizability.

The current models appear to function primarily as spatial surveillance classifiers rather than patient-level diagnostic tools, which limits the translational claims made in the manuscript.

Therefore, I recommend revision with the following required actions:

  • Re-encode missing symptom values separately or perform sensitivity analysis.

  • Report model performance using clinical features only and clearly reframe application scope.

  • Apply appropriate imbalance techniques for categorical data (e.g., class-weighting or SMOTE-NC).

  • Provide at least basic feature importance or surrogate interpretability.

  • Clarify that validation is internal and revise claims of generalizability accordingly.

Author Response

Comment 1: Re-encode missing symptom values separately or perform sensitivity analysis.

Response: We thank the reviewer’s suggestion. As explained in the previous response to comment 1 from reviewer 1, we rechecked the original article and found that: in the codebook, the authors stated that they encoded the presence of symptoms as 1, the absence symptoms as 2. But when checking the raw dataset, we found that they also have values encoded as 9. Therefore, we kept missing values separately as one class for each clinical feature, which means the value of each feature was encoded as follows: 1 for positive, 2 for negative, and 9 for ignored/missing value.  

Comment 2: Report model performance using clinical features only and clearly reframe application scope.

Response: We appreciate the reviewer for this valuable suggestion. As shown in the previous response to comment 5 from reviewer 1, we ran the machine learning models again using only 23 features, including three basic demographic (age, gender, race), 13 clinical features (fever, myalgia, headache, rash, vomit, nausea, back pain, conjunctivitis, arthritis, arthralgia, petechiae, tourniquet test, retro-orbital pain), and seven comorbidities features (diabetes, liver disease, kidney disease, hypertension, peptic acid disease, and autoimmune disease), and achieved extremely low performance, with the macro-averaged values of recall, precision and F1-score less than 0.50, as presented in the table below:

The performance of different ML models without epidemiological features

ML models

Recall

Precision

F1-score

AUC

RF

0.36

0.45

0.34

0.69

DT

0.36

0.43

0.34

0.68

AD

0.34

0.39

0.28

0.65

GB

0.36

0.41

0.32

0.70

XG

0.35

0.39

0.32

0.47

KNN

0.34

0.39

0.29

0.67

Therefore, we decided to add epidemiological features in the models to see whether they could improve the prediction capability of machine learning and deep learning algorithms, with the results illustrated in Table 4 in the previously revised manuscript as follows:

ML models

Accuracy

Specificity

Recall

Precision

F1-score

AUC

Without SMOTE

RF

0.8501

0.8994

0.9016

0.7846

0.8245

0.9436

DT

0.8384

0.8884

0.8785

0.7858

0.8226

0.9225

AD

0.7240

0.8376

0.6581

0.5974

0.5919

0.8414

GB

0.8601

0.9452

0.9068

0.7946

0.8358

0.9576

XG

0.9113

0.9351

0.9333

0.8711

0.8983

0.9831

KNN

0.8641

0.9064

0.9104

0.8059

0.8443

0.9686

With SMOTE

RF

0.9292

0.9562

0.9288

0.9111

0.9196

0.9853

DT

0.9221

0.9470

0.9137

0.9072

0.9104

0.9347

AD

0.8338

0.9033

0.8703

0.7776

0.8147

0.9101

GB

0.8664

0.9329

0.9101

0.8061

0.8452

0.9675

XG

0.9141

0.9531

0.9357

0.8733

0.9007

0.9841

KNN

0.9248

0.9561

0.9342

0.8907

0.9108

0.9748

 

Comment 3: Apply appropriate imbalance techniques for categorical data (e.g., class-weighting or SMOTE-NC).

Response: We appreciate the reviewer for this important suggestion. We applied the class weighting technique to handle imbalanced datasets and run the data again with Random Forest algorithm (class_weighted = ‘balanced’), where the classes of the target feature are weighted inversely proportional to how frequently they appear in the original dataset. The performance of Random Forest model with and without handling imbalanced dataset was presented in the table below:

Random Forest with

Accuracy

Specificity

Recall

Precision

F1 score

AUC

No oversampling

0.8501

0.8994

0.9016

0.7846

0.8245

0.9436

Class weighed

0.9304

0.9505

0.9254

0.9157

0.9203

0.9861

SMOTE

0.9292

0.9562

0.9288

0.9111

0.9196

0.9853

Since we want to develop a model to predict patients with dengue or chikungunya disease, a model with higher recall (sensitivity) and specificity will be preferred.

Comment 4: Provide at least basic feature importance or surrogate interpretability.

Response: We thank the reviewer for this valuable suggestion. We already performed the feature importance task using RFECV technique with Random Forest as baseline model and received the scores of important features, which are illustrated in Figure 2 in the revised manuscript. 

Figure 2. The important features are selected using RFECV technique with RF. The y-axis presents the 25 important features selected using RFECV technique (blue bar). The x-axis presented the percentage of importance. (Note: SEM_PRI: Epidemiological week of onset of symptom; TPAUTOCTO: Indicates whether the case is indigenous to the area of residence; COMUNINF: City where the patient was infected; ID_MN_RESI: City of the patient; ID_REGIONA: Health care regional code (where the health unit or other reporting source is located); NU_IDADE_N: Patient age; CEFALEIA: Headache; FEBRE: Fever; ARTRALGIA: Arthralgia; DOR_RETRO: Retro-orbital pain; HEMATOLOG: Hematological disease; CS_RACA: Patient Race; CONJUNTVIT: Conjunctivitis; DOR_COSTAS: Back Pain; EXANTEMA: Rash; LACO: Tourniquet test; VOMITO: Vomiting; PETEQUIA_N: Petechiae; HIPERTENSA: Hypertension; ARTRITE: Arthritis; CS_ESCOL_N: Patient education; NAUSEA: Nausea; CS_GESTANT: Gestational Age of the Patient (Quarter) in case Sex is Female; AUTO_IMMUNE: Autoimmune disease; ACIDO_PEPT: Peptic acid disease).

Furthermore, the score of each important feature in Figure 2 is presented in the table below:

Feature

Score

SEM_PRI    

0.329141

TPAUTOCTO  

0.151176

COMUNINF  

0.132481

ID_MN_RESI 

0.106001

ID_REGIONA 

0.058417

NU_IDADE_N

0.034040

CEFALEIA   

0.014734

FEBRE      

0.014671

ARTRALGIA  

0.014266

DOR_RETRO  

0.011935

HEMATOLOG  

0.011645

CS_RACA    

0.011375

CONJUNTVIT 

0.011038

DOR_COSTAS 

0.009925

EXANTEMA   

0.009877

LACO       

0.009116

VOMITO     

0.009062

PETEQUIA_N 

0.008687

HIPERTENSA 

0.008526

ARTRITE    

0.008426

CS_ESCOL_N 

0.008293

NAUSEA     

0.007963

CS_GESTANT 

0.007042

AUTO_IMUNE 

0.006314

ACIDO_PEPT 

0.005850

 

Comment 5: Clarify that validation is internal and revise claims of generalizability accordingly.

Response: We appreciate the reviewer’s suggestion. In the previously revised manuscript, we clarified that validation is performed with internal test set in the Methods section (Figure 1, page 4, line 182-183) and the Results section (Figure 3, page 10, line 314-315 and Figure 4, page 11, line 333-335).

In addition, we revised claims of generalizability (page 13, line 409-411) as follows:

“By incorporating the internal test set, these models might have a potential application as a supportive tool in screening of dengue and chikungunya diseases in Brazilian populations”

Author Response File: Author Response.docx

Back to TopTop