Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning—Random Forest and Decision Tree Techniques
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsReviewer's report
The manuscript addresses one of the critical issues at the level of clinical diagnosis of arboviral diseases, as the symptoms of dengue, Zika and chikungunya are very similar. However, the manuscript required extensive revision and considerable additional information in order to publish the paper. These are listed below:
1. The data size of Chikungunya is the major shortcoming of the manuscript. Although the author has already used bootstrapping (or cross-validation?), the effect of 6% (i.e. only 9 patients) was significant. The author might consider including more chikungunya data or making it binary (dengue and Zika).
2. Large loss of information/training weight due to dengue and Zika classes. To create a balanced dataset (the author has illustrated this in Figure 2), which means that the bootstrapping sample is applied to the pool of dengue and Zika and the maximum number of data is also six (therefore you have 33% for each of the classes), which is a big loss for the dengue and Zika data.
2. The methodology needs to include the details of the data transformation.
3. The authors mention the use of bootstrapping, but cross-validation is also mentioned in some of the methods. They differ in terms of the resampling. Authors need to clarify which re-sampling/validation was used.
4. The result was weakly presented. To get a more meaningful result, at least a confusion matrix showing the performance of each class is important. Then we see the performance of the model for the Chikungunya class. What about the training curve?
5. The discussion in the manuscript is weak. Many of the results were not compared to previous studies, and what is the salient point compared to other studies? As well as the potential application or use of the model at a clinical level and where the limitations of the study lie.
6. The manuscript does not contain any information on the requirements for medical or ethical approval for the use of clinical data.
Please see the attached files for further details.
Thank you.
Comments for author File: Comments.pdf
Moderate editing of English language required
Author Response
Dear Reviewer:
We would like to thank you for the review and valuable feedback you provided on our article titled “Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning, Random Forests, and Decision Trees.” We have worked diligently to address your comments and suggestions, and I am pleased to inform you that we have completed the requested corrections.
Below we have summarized the major edits made in response to your comments.
Reviewer 1.
The manuscript addresses one of the critical issues at the level of clinical diagnosis of arboviral diseases, as the symptoms of dengue, Zika and chikungunya are very similar. However, the manuscript required extensive revision and considerable additional information in order to publish the paper. These are listed below:
- The data size of Chikungunya is the major shortcoming of the manuscript. Although the author has already used bootstrapping (or cross-validation?), the effect of 6% (i.e. only 9 patients) was significant. The author might consider including more chikungunya data or making it binary (dengue and Zika).
Anwser:
The highlighted text clarifies the number of patients finally obtained by applying the bootstrapping technique. It was not possible to collect more data on chikungunya and zika, since these diseases had cycles in the years 2015-2016 and the information collection protocols were different from those of dengue. For this reason, it was decided to obtain data from a clinic that fortunately had detailed information on signs, symptoms and laboratory results. It was not decided to binarize only dengue and zika, since the relevance of the study lies in classifying the three diseases. This is because, at the time of carrying out the experiments, there was no research focused on these three diseases and there was no availability of data that related them.
Before
3.1.2. Dataset creation
The creation of the dataset arose from the need for a dataset linking signs, symptoms, and laboratory results of dengue, Zika, and chikungunya, as one did not previously exist. To this end, data were collected in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia. Historical records for chikungunya correspond to 2015, Zika to 2016, and dengue to 2020. This dataset consisted of 151 rows and 24 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Now
3.1.2. Dataset creation
The creation of the dataset for these three diseases became necessary due to the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20] [21] show limitations in the studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymized records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Clínica Las Peñitas in Sincelejo, Colombia, complying with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic's ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases from 2015, zika from 2016, and dengue from 2020. The resulting dataset consists of 151 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Before
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its small size, with only 151 rows and 24 columns, which classifies it as a small data dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering that there are several options to try to balance the data, the bootstrapping technique was chosen to generate a balanced dataset for the dengue, zika, and chikungunya labels in the target variable. Bootstrapping was chosen because it is a specialised resampling technique based on the central limit theorem, which generates new samples by taking random samples from the existing data rather than creating new synthetic samples from the data in the minority classes like the adaptive synthetic ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41] al-gorithms. Figure 2 presents the data-balancing process.
The following section refers to the need to create a data set that includes the three diseases.
Now
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed creating new samples that balanced the dengue and chikungunya labels with Zika disease in the target variable. As a result, the new dataset increased from 150 to 267 samples in total. Each disease now has 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrap was chosen because it is a specialized resampling technique based on the central limit theorem, which generates new samples by randomly sampling existing data rather than creating new synthetic samples from the minority class data, such as the adaptive synthetic algorithms ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41]. Figure 2 presents the data balancing process.
- Large loss of information/training weight due to dengue and Zika classes. To create a balanced dataset (the author has illustrated this in Figure 2), which means that the bootstrapping sample is applied to the pool of dengue and Zika and the maximum number of data is also six (therefore you have 33% for each of the classes), which is a big loss for the dengue and Zika data.
Answer
It is noted that there is no loss of information when applying the bootstraping technique. In fact, there is an increase in records based on the central limit theorem, which allows the records to be equal to 89 for each of the diseases.
Before
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its small size, with only 151 rows and 24 columns, which classifies it as a small data dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering that there are several options to try to balance the data, the bootstrapping technique was chosen to generate a balanced dataset for the dengue, zika, and chikungunya labels in the target variable. Bootstrapping was chosen because it is a specialised resampling technique based on the central limit theorem, which generates new samples by taking random samples from the existing data rather than creating new synthetic samples from the data in the minority classes like the adaptive synthetic ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41] al-gorithms. Figure 2 presents the data-balancing process.
Now
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed creating new samples that balanced the dengue and chikungunya labels with Zika disease in the target variable. As a result, the new dataset increased from 150 to 267 samples in total. Each disease now has 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrap was chosen because it is a specialized resampling technique based on the central limit theorem, which generates new samples by randomly sampling existing data rather than creating new synthetic samples from the minority class data, such as the adaptive synthetic algorithms ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41]. Figure 2 presents the data balancing process.
- The methodology needs to include the details of the data transformation.
Before
3.2.1. Data transformation was performed according to methodology based on the PAHO Guidelines (2022).
In this phase, data transformation is performed according to the methodology based on the PAHO Guidelines (2022) proposed by [22]. Quantitative values were assigned to each categorical variable in the dataset following the guidelines of the Pan American Health Organization (PAHO), allowing a differential value based on medical evidence to be assigned to variables that match those proposed in these guidelines.
Now
3.2.1. Data transformation was performed according to methodology based on the PAHO Guidelines (2022).
In this phase, the data transformation was carried out following the methodology based on the PAHO Guidelines (2022) proposed by Arrubla et al [22]. Quantitative values ​​were assigned to each categorical variable in the data set in accordance with the guidelines of the Pan American Health Organization (PAHO), which allowed assigning a differential value based on medical evidence to the variables that coincide with the proposals of these guidelines. Table 2 shows the variables to which interpolation was applied and the ranges of the evaluative weights, allowing the categorical variables to be transformed into numerical ones according to the methodological proposal made by [22].
Variable |
Certainty in Evidence |
|
Quantitative Weight Assignment |
|
Manifestations in dengue |
Manifestations in chikungunya |
Manifestations in Zika |
||
Myalgia |
* |
Moderate |
|
0.51–0.75 |
Headache |
Low |
* |
* |
0.26–0.50 |
Rash |
* |
Moderate |
Moderate |
0.51–0.75 |
Threw up |
Moderate |
* |
* |
0.51–0.75 |
Abdominal_pain |
Moderate |
* |
* |
0.51–0.75 |
Mucosal_hemorrhage
|
Moderate |
Low |
* |
0.51–0.75 0.26–0.50 |
Arthralgia |
* |
High |
* |
0.76–1 |
Diarrhea |
Low |
* |
* |
0.26–0.50 |
Hepatomegaly |
Low |
* |
* |
0.26–0.50 |
Retroocular pain |
Low |
* |
* |
0.26–0.50 |
platelet_drop |
High |
* |
* |
0.76–1 |
Note: * A score of 0.0 to 0.25 was given when the label "yes" was present, signifying an extremely low level of confidence in the evidence, as all outcomes are deemed uncertain according to the GRADE system.
- The authors mention the use of bootstrapping, but cross-validation is also mentioned in some of the methods. They differ in terms of the resampling. Authors need to clarify which re-sampling/validation was used.
Answer
The study notes that bootstrapping technique is used to balance the dataset and cross-validation technique is used for training.
1.Introduction
….
This study aimed to develop predictive models that can be easily interpreted by the medical community. It proposes the use of decision trees and random forests to predict dengue, Zika, and chikungunya from clinical data, including signs, symptoms, and laboratory results. In addition, bootstrapping, which is a technique for balancing the classes of the target variable and is based on the central limit theorem, is used, as well as a weight assignment methodology based on PAHO 2022 [22] and cross-validation to obtain a more balanced performance of the model results. The rest of the article is organised as follows: Section 2 presents the context of the study, Section 3 describes the methodology used, Section 4 presents the results obtained and dis-cussed, and finally, the conclusions are presented in Section 5.
3.3. Model Training
In this phase, the data processed without applying the aforementioned weight assignment methodology were used. The training and evaluation of the models were carried out under the same conditions as in the previous step using cross-validation, DT, and RF techniques.
However, the respective clarification is made in the following sections of the methodology.
Before
3.2.1. Modeling with ML techniques
In this stage, the models are trained using the machine learning techniques Decision Tree (DT) and Random Forest (RF). These techniques were selected because of their good performance in previous experiments using similar data [22]. In addition, the decision tree has the advantage of being more interpretable in the results, which facilitates an understanding of how decisions are made.
Now
3.2.1. Modeling with ML techniques
At this stage, the models were trained using the Decision Tree (DT) and Random Forest (RF) machine learning techniques. These techniques were selected due to their good performance in previous experiments with similar data [22]. In addition, the decision tree offers the advantage of being more interpretable in its results, thus facilitating the understanding of the criteria used for decision making. The training was carried out using the k=10 cross-validation technique, with the aim of obtaining more reliable results and minimizing the biases that are usually generated when using conventional techniques such as the 70-30 division.
- The result was weakly presented. To get a more meaningful result, at least a confusion matrix showing the performance of each class is important. Then we see the performance of the model for the Chikungunya class. What about the training curve?
Before
Figure 3. Comparison of DT and RF quality metrics.
On the other hand, table 3 shows the results of the models obtained by working with the dataset without applying the methodology proposed by [22] but balanced using the bootstrapping technique.
….
Similarly, Figure 5 summarises the behaviour of the 10 models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by [22] allows obtaining superior quality metrics in the model.
Figure 5. Comparison of the precision and error of DT models generated by cross-validation.
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 6 shows the model tree, where the rules generated by the model to perform the respective classifications are shown.
Figure 6. Tree diagram of the best-performing DT model.
Now
Figure 3. Comparison of DT and RF quality metrics.
The confusion matrix in Figure 4 reveals a high overall performance in classifying Chikungunya, Dengue, and Zika with the DT model. The model classifies Chikungunya with an accuracy of 88.5%, presenting a very low false negative rate (0.5%) and no false positives. Dengue has an accuracy of 86.7%, with false negatives (1.7%) and false positives (0.6%). Although Zika shows a slightly lower accuracy of 81.9%, it is still high, with a false negative rate of 4.1% and false positives of 3.0%. Overall, the model is efficient, although it exhibits slight confusion in classifying Zika.
Figure 4. Average confusion matrix DT.
The confusion matrix shown in Figure 5 reveals that the Random Forest technique offers robust performance in classifying Chikungunya, Dengue, and Zika. The model achieves an accuracy of 88.5% for Chikungunya, with no false positives and a very low false negative rate of 0.5%. For Dengue, the accuracy is 88.0%, with a false negative rate of 1.0% and no false positives. For Zika, the accuracy is 87.3%, with a false negative rate of 0.4% and false positives of 1.4%. Overall, the model demonstrates high classification ability, although it exhibits slight confusion in identifying Zika compared to the other two classes.
Figure 5. Average confusion matrix RF.
Similarly, Figure 5 summarises the behaviour of the 10 models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by Arrubla et al [22] allows obtaining superior quality metrics in the model.
Figure 5. Comparison of the precision and error of DT models generated by cross-validation.
The confusion matrix in Figure 6 for the decision tree technique indicates a mixed performance in classifying Chikungunya, Dengue, and Zika. The model classifies Chikungunya with an accuracy of 88.2%, with no false positives and a false negative rate of 0.8%. However, for Dengue, the accuracy is 76.9%, with a remarkably high false negative rate of 7.3% and false positives of 4.8%. Zika shows an accuracy of 72.0%, with a false negative rate of 12.3% and false positives of 4.7%. Overall, although the decision tree model shows good accuracy for Chikungunya, it faces difficulties in classifying Dengue and Zika, evidencing higher confusion between the classes.
Figure 6. Average confusion matrix DT model.
The confusion matrix in Figure 7 for the Random Forest model shows better performance in classifying Chikungunya, Dengue, and Zika. The model achieves an accuracy of 88.2% for Chikungunya, with a false positive rate of 0.0% and a false negative rate of 0.8%. For Dengue, the accuracy is 83.0%, with false negatives of 3.7% and false positives of 2.3%. For Zika, the accuracy is 81.4%, with a false negative rate of 5.3% and false positives of 2.4%. Overall, the model demonstrates better classification ability with a good balance between classes, although it shows slight confusion in identifying Zika and Dengue.
Figure 7. Average confusion matrix DT model.
The application of the methodology proposed by Arrubla et al [22] significantly improves the performance of Random Forest and Decision Tree classification models. For Random Forest, the use of the methodology results in a slight improvement in accuracy, especially in the reduction of false positives and negatives, with accuracy of 88.5% for Chikungunya, 88.0% for Dengue and 87.3% for Zika. In comparison, without the methodology, the accuracy is 88.2% for Chikungunya, 83.0% for Dengue and 81.4% for Zika, showing a lower discrimination capacity between classes.
For decision trees, the proposed methodology also has a positive impact, improving the overall accuracy to 88.5% for Chikungunya, 86.7% for Dengue and 81.9% for Zika. Without the methodology, the accuracies were 88.2% for Chikungunya, 76.9% for Dengue and 72.0% for Zika, indicating a notable reduction in classification capacity, especially for Dengue and Zika.
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 10 shows the model tree, where the rules generated by the model to perform the respective classifications are shown.
Figure 10. Tree diagram of the best-performing DT model.
Figure 10 illustrates that headache is the most relevant variable for classifying dengue, while myalgia is key for identifying chikungunya, aligning with PAHO's 2022 guidelines on differential symptoms for these diseases. The decision tree classifies cases between Chikungunya, Dengue, and Zika using symptoms such as headache, myalgia, days of symptoms, IgM, and platelet count. The root node shows that a mild or absent headache is linked to Chikungunya, whereas a severe headache strongly indicates Dengue. As the tree progresses, additional symptoms like myalgia and symptom duration further refine the classification, with terminal nodes offering pure and definitive predictions for each disease, underscoring the clinical utility of these symptoms.
- The discussion in the manuscript is weak. Many of the results were not compared to previous studies, and what is the salient point compared to other studies? As well as the potential application or use of the model at a clinical level and where the limitations of the study lie.
Before
Figure 10. Comparison of the precision and error of RF models generated by cross-validation.
The results of this research support the feasibility of a model for early and differential prediction of dengue, Zika, and chikungunya based on signs, symptoms, and clinical laboratory results. This model showed high performance with an accuracy of 98.8%, precision of 99.6%, specificity of 99.8% and F1-Score of 99.5%. In addition, its ability to accurately recognise each disease is remarkable, achieving 99.7% for chikungunya, 99.1% for dengue, and 98.8% for Zika.
The use of cross-validation in this study played a crucial role in providing a more accurate estimate of model performance. By employing multiple partitions of the dataset for training and validation, this technique reduces the risk of overfitting and improves the ability of the model to generalise to unseen data. In addition, using cross-validation, more stable and reliable metrics of model performance were obtained, allowing for a more accurate assessment of the model's ability to predict these diseases.
Bootstrapping was used to balance the classes in model construction. This technique allowed us to work with the unbalanced dataset that made up the dataset, generating multiple samples of equal size to the original dataset and randomly selecting observations with replacement. By applying this technique, we were able to obtain an adequate representation of the training samples, which helped improve the model's ability to learn, in a balanced way, the characteristics of each disease.
Finally, this study represents an important contribution to the differential prediction of dengue, Zika, and chikungunya diseases through the use of machine learning techniques and the use of information from signs, symptoms, and laboratory variables. This work provides a reference point for other researchers because, according to [20] [21], no similar work has been found in the literature because of the lack of a dataset containing records of these viruses.
In addition, the predictive model developed could be of great use to the medical community in places where there is co-circulation of dengue, Zika, and chikun-gunya, as early classification becomes a challenge due to the similarity of symptoms at disease onset. This tool could help health professionals make more informed and rapid decisions regarding the management of patients with these diseases, which could result in better patient care and outcomes.
Now
Figure 13. Comparison of the precision and error of RF models generated by cross-validation.
On the other hand, the scarcity of specific research on the classification of diseases such as dengue, zika and chikungunya limits direct comparisons of results. However, a recent research [42] addresses this challenge by developing a proposal to classify seven similar diseases, which includes 137 records of zika, 127 of dengue and 140 of chikungunya, in addition to other diseases such as malaria and yellow fever, totaling 1,500 records. This proposal compares various algorithms and presents a hybrid technique called HML, which combines machine learning techniques with reinforcement learning based on recurrent neural networks (RNN). The results obtained show high precision, with an accuracy of 98.7%, precision of 98.7%, recall of 98.4% and an F1-score of 99.10%.
Despite these promising results, the research does not include confusion matrices that allow evaluating the reliability of the classification for each disease individually. When comparing these results with those of our research, it is observed that our models outperform the proposed quality metrics, especially in terms of accuracy, precision, recall, and F1-score. Furthermore, our research provides a detailed analysis at the confusion matrix level for each class, allowing a more accurate assessment of the classification capacity of each disease. This highlights not only the effectiveness of our models in differentiating between dengue, zika, and chikungunya, but also the advantage of having detailed metrics to assess and improve classification quality.
The results of this research support the feasibility of a model for early and differ-ential prediction of dengue, Zika, and chikungunya based on signs, symptoms, and clinical laboratory results. This model showed high performance with an accuracy of 98.8%, precision of 99.6%, specificity of 99.8% and F1-Score of 99.5%. In addition, its ability to accurately recognise each disease is remarkable, achieving 99.7% for chikungunya, 99.1% for dengue, and 98.8% for Zika.
The use of cross-validation in this study played a crucial role in providing a more accurate estimate of model performance. By employing multiple partitions of the da-taset for training and validation, this technique reduces the risk of overfitting and im-proves the ability of the model to generalise to unseen data. In addition, using cross-validation, more stable and reliable metrics of model performance were ob-tained, allowing for a more accurate assessment of the model's ability to predict these diseases.
Bootstrapping was used to balance the classes in model construction. This tech-nique allowed us to work with the unbalanced dataset that made up the dataset, gen-erating multiple samples of equal size to the original dataset and randomly selecting observations with replacement. By applying this technique, we were able to obtain an adequate representation of the training samples, which helped improve the model's ability to learn, in a balanced way, the characteristics of each disease.
Finally, this study represents a significant advance in the differential prediction of dengue, zika, and chikungunya using machine learning techniques and the analysis of signs, symptoms, and laboratory variables. The developed model offers robust diagnostic support, based on the criteria established in the PAHO evidence synthesis (2022), which clearly distinguishes the signs and symptoms of each disease for diagnosis and treatment. With high performance, this model not only demonstrates remarkable accuracy, but also has great potential for implementation in clinical settings. Its integration into clinical practice would provide fundamental support to health professionals, facilitating early and accurate diagnoses, and favoring timely decision-making that improves patient outcomes.
Moreover, the predictive model developed in this study could be particularly beneficial in regions where dengue, Zika, and chikungunya co-circulate, as early differentiation between these diseases is challenging due to their similar initial symptoms. This tool could empower healthcare providers to make more informed and rapid decisions about patient management, ultimately leading to better care and outcomes.
Although this work presents some limitations regarding the amount of data, especially for chikungunya, which were addressed by specialized computational techniques, it is recognized that the reliability of the model could be improved with a larger volume of data. Despite these limitations, this study establishes a crucial benchmark for future research, since, according to [20] [21], no comparable studies have been identified in the literature, mainly due to the scarcity of data sets that include records of these viruses.
- The manuscript does not contain any information on the requirements for medical or ethical approval for the use of clinical data.
Before
3.1.2. Dataset creation
The creation of the dataset arose from the need for a dataset linking signs, symptoms, and laboratory results of dengue, Zika, and chikungunya, as one did not previously exist. To this end, data were collected in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia. Historical records for chikungunya correspond to 2015, Zika to 2016, and dengue to 2020. This dataset consisted of 151 rows and 24 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Now
3.1.2. Dataset creation
The creation of the dataset for these three diseases became necessary due to the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20] [21] show limitations in the studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymized records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia, complying with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic's ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases from 2015, zika from 2016, and dengue from 2020. The resulting dataset consists of 151 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript entitled “Differential classification of dengue, Zika, and chikungunya using Machine Learning - Random Forest and Decision Tree techniques” applied machine learning approaches to develop symptom-orientated models to differentiate dengue, ZIKA and chikungunya. This is a well-written article with clear methodology and findings. Below are some comments
Major
1. The application of machine learning approach seems predicted the three diseases very well. I was wondering to know that how can you apply the finding in the medical community? Can you develop a criterion for clinical doctor to diagnose the diseases just based on certain symptoms? If not, what the benefit of this findings?
2. The authors used dengue, zika, and chikungunya data are in different years. Any explanation of this approach? Whether the small sample size in the PAHO database can represent the whole picture of the outbreaks?
Minor issues:
1. The Figure 1 should be modified. Left had side, it should not be initiated by “Predictive model” because it is the data processing stage. In the right-hand side, Model validation should be added instead of “End.”
2. Lines 147-148. The total is not 100%. Where is the 4% of data? It may due to typo? Zika should be 59%?
Author Response
Dear Reviewer:
We would like to thank you for the review and valuable feedback you provided on our article titled “Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning, Random Forests, and Decision Trees.” We have worked diligently to address your comments and suggestions, and I am pleased to inform you that we have completed the requested corrections.
Below we have summarized the major edits made in response to your comments.
Reviewer 2.
The manuscript entitled “Differential classification of dengue, Zika, and chikungunya using Machine Learning - Random Forest and Decision Tree techniques” applied machine learning approaches to develop symptom-orientated models to differentiate dengue, ZIKA and chikungunya. This is a well-written article with clear methodology and findings. Below are some comments.
Major
- The application of machine learning approach seems predicted the three diseases very well. I was wondering to know that how can you apply the finding in the medical community? Can you develop a criterion for clinical doctor to diagnose the diseases just based on certain symptoms? If not, what the benefit of this findings?
Before
Finally, this study represents an important contribution to the differential prediction of dengue, Zika, and chikungunya diseases through the use of machine learning techniques and the use of information from signs, symptoms, and laboratory variables. This work provides a reference point for other researchers because, according to [20] [21], no similar work has been found in the literature because of the lack of a dataset containing records of these viruses.
In addition, the predictive model developed could be of great use to the medical community in places where there is co-circulation of dengue, Zika, and chikun-gunya, as early classification becomes a challenge due to the similarity of symptoms at disease onset. This tool could help health professionals make more informed and rapid decisions regarding the management of patients with these diseases, which could result in better patient care and outcomes
Now
Finally, this study represents a significant advance in the differential prediction of dengue, zika, and chikungunya using machine learning techniques and the analysis of signs, symptoms, and laboratory variables. The developed model offers robust diagnostic support, based on the criteria established in the PAHO evidence synthesis (2022), which clearly distinguishes the signs and symptoms of each disease for diagnosis and treatment. With high performance, this model not only demonstrates remarkable accuracy, but also has great potential for implementation in clinical settings. Its integration into clinical practice would provide fundamental support to health professionals, facilitating early and accurate diagnoses, and favoring timely decision-making that improves patient outcomes.
Moreover, the predictive model developed in this study could be particularly beneficial in regions where dengue, Zika, and chikungunya co-circulate, as early differentiation between these diseases is challenging due to their similar initial symptoms. This tool could empower healthcare providers to make more informed and rapid decisions about patient management, ultimately leading to better care and outcomes.
Although this work presents some limitations regarding the amount of data, especially for chikungunya, which were addressed by specialized computational techniques, it is recognized that the reliability of the model could be improved with a larger volume of data. Despite these limitations, this study establishes a crucial benchmark for future research, since, according to [20] [21], no comparable studies have been identified in the literature, mainly due to the scarcity of data sets that include records of these viruses.
- The authors used dengue, zika, and chikungunya data are in different years. Any explanation of this approach?
Answer:
The construction of the dataset was carried out due to the lack of public datasets and studies that include information on these three diseases. This situation is detailed in the following section.
Before
3.1.2. Dataset creation
The creation of the dataset arose from the need for a dataset linking signs, symptoms, and laboratory results of dengue, Zika, and chikungunya, as one did not previously exist. To this end, data were collected in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia. Historical records for chikungunya correspond to 2015, Zika to 2016, and dengue to 2020. This dataset consisted of 151 rows and 24 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Now
3.1.2. Dataset creation
The creation of the dataset for these three diseases became necessary due to the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20] [21] show limitations in the studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymized records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Clínica Las Peñitas in Sincelejo, Colombia, complying with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic's ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases from 2015, zika from 2016, and dengue from 2020. The resulting dataset consists of 151 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
- Whether the small sample size in the PAHO database can represent the whole picture of the outbreaks?
Answer
This research is based on the guidelines established by PAHO in 2022, which classify the differential signs and symptoms of the three diseases according to the degree of evidence. In this study, the categorical variables of the dataset were transformed into numerical variables using the methodology proposed by [22], which we recently published. Although the main limitation is the small size of the dataset, specialized techniques are applied to overcome these challenges. This methodology can be applied to any dataset containing records of these three diseases and their respective signs and symptoms according to the PAHO guideline.
Minor issues:
- The Figure 1 should be modified. Left had side, it should not be initiated by “Predictive model” because it is the data processing stage. In the right-hand side, Model validation should be added instead of “End.”
Before
Now
- Lines 147-148. The total is not 100%. Where is the 4% of data? It may due to typo? Zika should be 59%?
Before
The amount of data is limited because of the difficulty in finding historical records for Zika and chikungunya, as their epidemiological cycles occurred in 2015 and 2016 [36-37], respectively. During these epidemic peaks, data collection was incomplete, and it was not possible to use records published by the Colombian National Institute of Health. The distribution of the dataset is as follows: Zika accounted for 55% of the data, dengue for 35%, and chikungunya for 6%.
Now
The amount of data is limited because of the difficulty in finding historical records for Zika and chikungunya, as their epidemiological cycles occurred in 2015 and 2016 [36-37], respectively. During these epidemic peaks, data collection was incomplete, and it was not possible to use records published by the Colombian National Institute of Health. The distribution of the dataset is as follows: 89 cases of Zika (59%), 52 of Dengue (35%) and 9 of Chikungunya (6%)
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors are encouraged to address the following points to strengthen their publication:
1- According to the size of the database authors are encourage to address the following questions:
a) Why do you use such a small database?
b) Why only use data from the peaks of dengue and chikungunya? In subsequent years there are no case reports? This could help to balance the data with real data which will be rather than balance with virtual data.
2- Regarding the distribution of the database that authors shows in Figure 2, this referee consider that will be better to put the number of cases for each disease in addition to the percent, the number of cases could better illustrate the readers.
3- Once the bootstrapping has been perfomed and the database has been balanced, how many compounds remained in each group? Please add the number of compounds for each group instead of the percentage or you can also put both, but most important the number of cases for each group.
4- Could the authors compare the bootstrapping oversampling strategy with the undersampling strategy used in other papers like for example in the following paper:
https ://doi.org/10.1007/s1082 2-019-00255 -3
5- In line 158 why does it say that the database has 15 rows and 24 variables if in the section 3.1.2. Dataset creation authors says that ¨This dataset consisted of 151 rows and 24 variables, including signs, symp-138 toms, and laboratory results, as detailed in Table 1¨ and after that in section 3.1.3. Data cleaning they say that several variables were deleted. But in the line 158 again you say that the database has 15 rows and 24 columns (variables), what happened with the eliminated variables.
6- The Figure 6 must show the number of compounds in each node/leaf of the tree in order to permit the correctly evaluation of the quality.
7- The authors should clearly explain how they evaluate the predictive power of the model since they do not use an external prediction series. It would be very interesting if they collected additional information on new cases (from another year/country, etc.) and evaluated them in the models to validate the results they propose.
Other points
1- In Line 112 (and also in several other parts of the manuscript) authors says ¨PAHO 2022 guidelines proposed by [22]¨ will be better refer to the surname of the first authors of that paper instead of the number of reference.
2- References in some cases include the entire name of the journal, in others the abbreviations must be homogenized.
Author Response
Dear Reviewer:
We would like to thank you for the review and valuable feedback you provided on our article titled “Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning, Random Forests, and Decision Trees.” We have worked diligently to address your comments and suggestions, and I am pleased to inform you that we have completed the requested corrections.
Below we have summarized the major edits made in response to your comments.
Reviewer 3.
- According to the size of the database authors are encourage to address the following questions:
- a) Why do you use such a small database?
Answer
The construction of the dataset was carried out due to the lack of public datasets and studies that include information on these three diseases. This situation is detailed in the following section.
Before
3.1.2. Dataset creation
The creation of the dataset arose from the need for a dataset linking signs, symptoms, and laboratory results of dengue, Zika, and chikungunya, as one did not previously exist. To this end, data were collected in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia. Historical records for chikungunya correspond to 2015, Zika to 2016, and dengue to 2020. This dataset consisted of 151 rows and 24 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Now
3.1.2. Dataset creation
The creation of the dataset for these three diseases became necessary due to the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20] [21] show limitations in the studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymized records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Clínica Las Peñitas in Sincelejo, Colombia, complying with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic's ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases from 2015, zika from 2016, and dengue from 2020. The resulting dataset consists of 151 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
- b) Why only use data from the peaks of dengue and chikungunya? In subsequent years there are no case reports? This could help to balance the data with real data which will be rather than balance with virtual data.
Answer
Given the lack of publicly available data, as outlined in the previous point, all accessible information was collected from the clinic that collaborated in this research. Historical data was sought from the beginning of the outbreak in Colombia in 2015, but only the data set used in this study was obtained.
- Regarding the distribution of the database that authors shows in Figure 2, this referee consider that will be better to put the number of cases for each disease in addition to the percent, the number of cases could better illustrate the readers.Before
Answer
The number of records per disease is specified in the text.
Before
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its small size, with only 151 rows and 24 columns, which classifies it as a small data dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering that there are several options to try to balance the data, the bootstrapping technique was chosen to generate a balanced dataset for the dengue, zika, and chikungunya labels in the target variable. Bootstrapping was chosen because it is a specialised resampling technique based on the central limit theorem, which generates new samples by taking random samples from the existing data rather than creating new synthetic samples from the data in the minority classes like the adaptive synthetic ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41] al-gorithms. Figure 2 presents the data-balancing process.
Now
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed creating new samples that balanced the dengue and chikungunya labels with Zika disease in the target variable. As a result, the new dataset increased from 151 to 267 samples in total. Each disease now has 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrap was chosen because it is a specialized resampling technique based on the central limit theorem, which generates new samples by randomly sampling existing data rather than creating new synthetic samples from the minority class data, such as the adaptive synthetic algorithms ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41]. Figure 2 presents the data balancing process.
- Once the bootstrapping has been perfomed and the database has been balanced, how many compounds remained in each group? Please add the number of compounds for each group instead of the percentage or you can also put both, but most important the number of cases for each group.
Before
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its small size, with only 151 rows and 24 columns, which classifies it as a small data dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering that there are several options to try to balance the data, the bootstrapping technique was chosen to generate a balanced dataset for the dengue, zika, and chikungunya labels in the target variable. Bootstrapping was chosen because it is a specialised resampling technique based on the central limit theorem, which generates new samples by taking random samples from the existing data rather than creating new synthetic samples from the data in the minority classes like the adaptive synthetic ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41] al-gorithms. Figure 2 presents the data-balancing process.
Now
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed creating new samples that balanced the dengue and chikungunya labels with Zika disease in the target variable. As a result, the new dataset increased from 150 to 267 samples in total. Each disease now has 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrap was chosen because it is a specialized resampling technique based on the central limit theorem, which generates new samples by randomly sampling existing data rather than creating new synthetic samples from the minority class data, such as the adaptive synthetic algorithms ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41]. Figure 2 presents the data balancing process.
- Could the authors compare the bootstrapping oversampling strategy with the undersampling strategy used in other papers like for example in the following paper:
https ://doi.org/10.1007/s1082 2-019-00255 -3
Anwer:
The bootstrap technique was chosen to increase the number of minority class samples due to its strong statistical basis and ability to faithfully reflect the variability of the original dataset. Based on the Central Limit Theorem, bootstrap generates new samples by sampling with replacement, which preserves the statistical structure without introducing artificial biases. Unlike techniques such as SMOTE, which create synthetic data, bootstrap maintains the original characteristics of the data, minimizing the risk of overfitting and improving model generalization.
- In line 158 why does it say that the database has 15 rows and 24 variables if in the section 3.1.2. Dataset creationauthors says that ¨This dataset consisted of 151 rows and 24 variables, including signs, symp-138 toms, and laboratory results, as detailed in Table 1¨ and after that in section 1.3. Data cleaning they say that several variables were deleted. But in the line 158 again you say that the database has 15 rows and 24 columns (variables), what happened with the eliminated variables.
Answer
A typographical error in the creation of the data, there are 28 columns and finally there are 24
Before
3.1.2. Dataset creation
The creation of the dataset arose from the need for a dataset linking signs, symptoms, and laboratory results of dengue, Zika, and chikungunya, as one did not previously exist. To this end, data were collected in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia. Historical records for chikungunya correspond to 2015, Zika to 2016, and dengue to 2020. This dataset consisted of 151 rows and 24 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Now
3.1.2. Dataset creation
The creation of the dataset for these three diseases became necessary due to the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20] [21] show limitations in the studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymized records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Clínica Las Peñitas in Sincelejo, Colombia, complying with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic's ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases from 2015, zika from 2016, and dengue from 2020. The resulting dataset consists of 150 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
- The Figure 6 must show the number of compounds in each node/leaf of the tree in order to permit the correctly evaluation of the quality.Before
Before
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 6 shows the model tree, where the rules generated by the model to perform the respective classifications are shown.
Figure 6. Tree diagram of the best-performing DT model.
Figure 7 also shows that headache is the most relevant variable in the dataset for the classification of dengue as well as myalgia for chikungunya. These results coincide with those proposed in the guidelines given by PAHO in 2022 [9], which are the differential symptoms of these diseases.
In contrast to the tree diagram, Figure 8 shows the importance of the variables generated by the Random Forest (RF) model, where, as in the DT model, headache is the most important variable for classifying diseases, followed by myalgia and arthralgia. These results are in line with the guidelines given by PAHO, which consider these variables as differential in the three diseases.
Now
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 10 shows the model tree, where the rules generated by the model to perform the respective classifications are shown.
Figure 10. Tree diagram of the best-performing DT model.
Figure 10 illustrates that headache is the most relevant variable for classifying dengue, while myalgia is key for identifying chikungunya, aligning with PAHO's 2022 guidelines on differential symptoms for these diseases. The decision tree classifies cases between Chikungunya, Dengue, and Zika using symptoms such as headache, myalgia, days of symptoms, IgM, and platelet count. The root node shows that a mild or absent headache is linked to Chikungunya, whereas a severe headache strongly indicates Dengue. As the tree progresses, additional symptoms like myalgia and symptom duration further refine the classification, with terminal nodes offering pure and definitive predictions for each disease, underscoring the clinical utility of these symptoms.
In contrast to the tree diagram, Figure 11 shows the importance of the variables generated by the Random Forest (RF) model, where, as in the DT model, headache is the most important variable for classifying diseases, followed by myalgia and arthralgia. These results are in line with the guidelines given by PAHO, which consider these variables as differential in the three diseases.
- The authors should clearly explain how they evaluate the predictive power of the model since they do not use an external prediction series. It would be very interesting if they collected additional information on new cases (from another year/country, etc.) and evaluated them in the models to validate the results they propose.
Answer:
The predictive power of each class is assessed by analyzing the confusion matrix. Additionally, a comparison is made with a similar study published recently, for which data are not available.
Before
Figure 3. Comparison of DT and RF quality metrics.
On the other hand, table 3 shows the results of the models obtained by working with the dataset without applying the methodology proposed by [22] but balanced using the bootstrapping technique.
….
Similarly, Figure 5 summarises the behaviour of the 10 models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by [22] allows obtaining superior quality metrics in the model.
Figure 5. Comparison of the precision and error of DT models generated by cross-validation.
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 6 shows the model tree, where the rules generated by the model to perform the respective classifications are shown.
Figure 6. Tree diagram of the best-performing DT model.
Now
Figure 3. Comparison of DT and RF quality metrics.
The confusion matrix in Figure 4 reveals a high overall performance in classifying Chikungunya, Dengue, and Zika with the DT model. The model classifies Chikungunya with an accuracy of 88.5%, presenting a very low false negative rate (0.5%) and no false positives. Dengue has an accuracy of 86.7%, with false negatives (1.7%) and false positives (0.6%). Although Zika shows a slightly lower accuracy of 81.9%, it is still high, with a false negative rate of 4.1% and false positives of 3.0%. Overall, the model is efficient, although it exhibits slight confusion in classifying Zika.
Figure 4. Average confusion matrix DT.
The confusion matrix shown in Figure 5 reveals that the Random Forest technique offers robust performance in classifying Chikungunya, Dengue, and Zika. The model achieves an accuracy of 88.5% for Chikungunya, with no false positives and a very low false negative rate of 0.5%. For Dengue, the accuracy is 88.0%, with a false negative rate of 1.0% and no false positives. For Zika, the accuracy is 87.3%, with a false negative rate of 0.4% and false positives of 1.4%. Overall, the model demonstrates high classification ability, although it exhibits slight confusion in identifying Zika compared to the other two classes.
Figure 5. Average confusion matrix RF.
Similarly, Figure 5 summarises the behaviour of the 10 models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by [22] allows obtaining superior quality metrics in the model.
Figure 5. Comparison of the precision and error of DT models generated by cross-validation.
The confusion matrix in Figure 6 for the decision tree technique indicates a mixed performance in classifying Chikungunya, Dengue, and Zika. The model classifies Chikungunya with an accuracy of 88.2%, with no false positives and a false negative rate of 0.8%. However, for Dengue, the accuracy is 76.9%, with a remarkably high false negative rate of 7.3% and false positives of 4.8%. Zika shows an accuracy of 72.0%, with a false negative rate of 12.3% and false positives of 4.7%. Overall, although the decision tree model shows good accuracy for Chikungunya, it faces difficulties in classifying Dengue and Zika, evidencing higher confusion between the classes.
Figure 6. Average confusion matrix DT model.
The confusion matrix in Figure 7 for the Random Forest model shows better performance in classifying Chikungunya, Dengue, and Zika. The model achieves an accuracy of 88.2% for Chikungunya, with a false positive rate of 0.0% and a false negative rate of 0.8%. For Dengue, the accuracy is 83.0%, with false negatives of 3.7% and false positives of 2.3%. For Zika, the accuracy is 81.4%, with a false negative rate of 5.3% and false positives of 2.4%. Overall, the model demonstrates better classification ability with a good balance between classes, although it shows slight confusion in identifying Zika and Dengue.
Figure 7. Average confusion matrix DT model.
The application of the methodology proposed by [22] significantly improves the performance of Random Forest and Decision Tree classification models. For Random Forest, the use of the methodology results in a slight improvement in accuracy, especially in the reduction of false positives and negatives, with accuracy of 88.5% for Chikungunya, 88.0% for Dengue and 87.3% for Zika. In comparison, without the methodology, the accuracy is 88.2% for Chikungunya, 83.0% for Dengue and 81.4% for Zika, showing a lower discrimination capacity between classes.
For decision trees, the proposed methodology also has a positive impact, improving the overall accuracy to 88.5% for Chikungunya, 86.7% for Dengue and 81.9% for Zika. Without the methodology, the accuracies were 88.2% for Chikungunya, 76.9% for Dengue and 72.0% for Zika, indicating a notable reduction in classification capacity, especially for Dengue and Zika.
While it is true that the DT model performs less well compared to RF in both experiments, it is important to mention that it may be more interpretable for the medical community when supporting early decision-making. Figure 10 shows the model tree, where the rules generated by the model to perform the respective classifications are shown.
Figure 10. Tree diagram of the best-performing DT model.
Figure 10 illustrates that headache is the most relevant variable for classifying dengue, while myalgia is key for identifying chikungunya, aligning with PAHO's 2022 guidelines on differential symptoms for these diseases. The decision tree classifies cases between Chikungunya, Dengue, and Zika using symptoms such as headache, myalgia, days of symptoms, IgM, and platelet count. The root node shows that a mild or absent headache is linked to Chikungunya, whereas a severe headache strongly indicates Dengue. As the tree progresses, additional symptoms like myalgia and symptom duration further refine the classification, with terminal nodes offering pure and definitive predictions for each disease, underscoring the clinical utility of these symptoms.
In contrast to the tree diagram, Figure 11 shows the importance of the variables generated by the Random Forest (RF) model, where, as in the DT model, headache is the most important variable for classifying diseases, followed by myalgia and arthralgia. These results are in line with the guidelines given by PAHO, which consider these variables as differential in the three diseases.
Regarding obtaining new data, the research indicates that it was not possible to find information in public databases or in previous research, which led to the need to create a new data set. This situation is addressed in the following section.
Before
3.1.2. Dataset creation
The creation of the dataset arose from the need for a dataset linking signs, symptoms, and laboratory results of dengue, Zika, and chikungunya, as one did not previously exist. To this end, data were collected in collaboration with the Las Peñitas Clinic in Sincelejo, Colombia. Historical records for chikungunya correspond to 2015, Zika to 2016, and dengue to 2020. This dataset consisted of 151 rows and 24 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Now
3.1.2. Dataset creation
The creation of the dataset for these three diseases became necessary due to the lack of a public dataset that integrated information on signs, symptoms, and clinical laboratory results. Literature reviews conducted by [20] [21] show limitations in the studies on these three diseases, which is attributed to the lack of adequate datasets. The dataset used in this study was obtained from fully anonymized records, thus ensuring patient privacy and confidentiality. Data collection was carried out in collaboration with the Clínica Las Peñitas in Sincelejo, Colombia, complying with local ethical and legal requirements. It should be noted that the project was reviewed and approved by the clinic's ethics committee, ensuring compliance with the relevant regulations for the use of clinical data in research. Historical records cover chikungunya cases from 2015, zika from 2016, and dengue from 2020. The resulting dataset consists of 150 records and 28 variables, including signs, symptoms, and laboratory results, as detailed in Table 1.
Other points
- In Line 112 (and also in several other parts of the manuscript) authors says ¨PAHO 2022 guidelines proposed by [22]¨ will be better refer to the surname of the first authors of that paper instead of the number of reference.
Before
- Materials and Methods
This article presents an experiment aimed at differentially predicting dengue, Zika, and chikungunya. It compares the results of applying the weighting methodology based on scientific evidence from the PAHO 2022 guidelines proposed by [22] to create predictive models with different machine learning techniques. The process began with the creation of a fully anonymised dataset in collaboration with Clínica Las Peñitas in the city of Sincelejo, Colombia. This dataset relates signs, symptoms, and clinical science data recorded in 2015 for chikungunya, 2016 for Zika, and 2020 for dengue. Subsequently, bootstrapping replacement resampling was applied to address the imbalance in the classes and size of the dataset. This technique was chosen because of its advantages in this context, allowing for balancing the classes of the target variable based on the central limit theorem without generating synthetic data. Finally, two machine learning models based on decision trees (DT) and random forests (RF) were proposed to compare the results obtained by applying the weight assignment methodology with the data obtained without using this methodology. Figure 1 presents the proposed algorithm for the development of this experiment in more detail.
….
3.2.1. Data transformation was performed according to methodology based on the PAHO Guidelines (2022).
In this phase, data transformation is performed according to the methodology based on the PAHO Guidelines (2022) proposed by [22]. Quantitative values were assigned to each categorical variable in the dataset following the guidelines of the Pan American Health Organization (PAHO), allowing a differential value based on medical evidence to be assigned to variables that match those proposed in these guidelines.
….
Figure 3. Comparison of DT and RF quality metrics.
On the other hand, table 3 shows the results of the models obtained by working with the dataset without applying the methodology proposed by [22] but balanced using the bootstrapping technique.
…
Figure 4. Comparison of DT and RF quality metrics.
Similarly, Figure 5 summarises the behaviour of the 10 models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by [22] allows obtaining superior quality metrics in the model.
Now
- Materials and Methods
This article presents an experiment aimed at differentially predicting dengue, Zika, and chikungunya. It compares the results of applying the weighting methodology based on scientific evidence from the PAHO 2022 guidelines proposed by Arrubla et al [22] to create predictive models with different machine learning techniques. The process began with the creation of a fully anonymised dataset in collaboration with Clínica Las Peñitas in the city of Sincelejo, Colombia. This dataset relates signs, symptoms, and clinical science data recorded in 2015 for chikungunya, 2016 for Zika, and 2020 for dengue. Subsequently, bootstrapping replacement resampling was applied to address the imbalance in the classes and size of the dataset. This technique was chosen because of its advantages in this context, allowing for balancing the classes of the target variable based on the central limit theorem without generating synthetic data. Finally, two machine learning models based on decision trees (DT) and random forests (RF) were proposed to compare the results obtained by applying the weight assignment methodology with the data obtained without
….
3.2.1. Data transformation was performed according to methodology based on the PAHO Guidelines (2022).
In this phase, the data transformation was carried out following the methodology based on the PAHO Guidelines (2022) proposed by Arrubla et al [22]. Quantitative values ​​were assigned to each categorical variable in the data set in accordance with the guidelines of the Pan American Health Organization (PAHO), which allowed assigning a differential value based on medical evidence to the variables that coincide with the proposals of these guidelines. Table 2 shows the variables to which interpolation was applied and the ranges of the evaluative weights, allowing the categorical variables to be transformed into numerical ones according to the methodological proposal made by [22].
Table 2. Transformation of categorical data applying the PAHO Guidelines (2022) proposed by Arrubla et al [22]
….
Figure 5. Average confusion matrix RF model.
On the other hand, table 4 shows the results of the models obtained by working with the dataset without applying the methodology proposed by Arrubla et al [22] but balanced using the bootstrapping technique.
Table 4. Quality metrics of the models without applying methodology based on PAHO Guidelines (2022).
….
Figure 6. Comparison of DT and RF quality metrics.
Similarly, Figure 7 summarises the behaviour of the 10 models created using the cross-validation technique in the two experiments. It shows the behaviour of the accuracy and error in each model, highlighting that applying the methodology proposed by Arrubla et al [22] allows obtaining superior quality metrics in the model.
…
Figure 9. Average confusion matrix DT model.
The application of the methodology proposed by Arrubla et al [22] significantly improves the performance of Random Forest and Decision Tree classification models. For Random Forest, the use of the methodology results in a slight improvement in accuracy, especially in the reduction of false positives and negatives, with accuracy of 88.5% for Chikungunya, 88.0% for Dengue and 87.3% for Zika. In comparison, without the methodology, the accuracy is 88.2% for Chikungunya, 83.0% for Dengue and 81.4% for Zika, showing a lower discrimination capacity between classes.
2- References in some cases include the entire name of the journal, in others the abbreviations must be homogenized.
Before
References
- Lambrechts, L.; Scott, T.W.; Gubler, D.J. Consequences of the Expanding Global Distribution of Aedes Albopictus for Dengue Virus Transmission. PLoS neglected tropical diseases 2010, 4, e646.
- Chaw, J.K.; Chaw, S.H.; Quah, C.H.; Sahrani, S.; Ang, M.C.; Zhao, Y.; Ting, T.T. A Predictive Analytics Model Using Machine Learning Algorithms to Estimate the Risk of Shock Development among Dengue Patients. Healthcare Analytics 2024, 5, 100290, doi:10.1016/j.health.2023.100290.
- Arrubla, W.D.J.A. Conceptualización del diagnóstico del Dengue desde una perspectiva de la ingeniería y las nuevas tecnologías. Computer and Electronic Sciences: Theory and Applications 2022, 3, 1–8, doi:10.17981/cesta.03.01.2022.01.
- Codina, J.-R.; Mascini, M.; Dikici, E.; Deo, S.K.; Daunert, S. Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. International Journal of Molecular Sciences 2023, 24, 12144, doi:10.3390/ijms241512144.
- Gangula, R.; Thirupathi, L.; Parupati, R.; Sreeveda, K.; Gattoju, S. Ensemble Machine Learning Based Prediction of Dengue Disease with Performance and Accuracy Elevation Patterns. Materials Today: Proceedings 2023, 80, 3458–3463, doi:10.1016/j.matpr.2021.07.270.
- Brady, O.J.; Hay, S.I. The Global Expansion of Dengue: How Aedes Aegypti Mosquitoes Enabled the First Pandemic Arbovirus. Annu Rev Entomol 2020, 65, 191–208, doi:10.1146/annurev-ento-011019-024918.
- Sukhralia, S.; Verma, M.; Gopirajan, S.; Dhanaraj, P.S.; Lal, R.; Mehla, N.; Kant, C.R. From Dengue to Zika: The Wide Spread of Mosquito-Borne Arboviruses. Eur J Clin Microbiol Infect Dis 2019, 38, 3–14, doi:10.1007/s10096-018-3375-7.
- Chala, B.; Hamde, F. Emerging and Re-Emerging Vector-Borne Infectious Diseases and the Challenges for Control: A Review. Front. Public Health 2021, 9, doi:10.3389/fpubh.2021.715759.
- PAHO Síntesis de evidencia: Directrices para el diagnóstico y el tratamiento del dengue, el chikunguña y el zika en la Región de las Américas. Revista Panamericana de Salud Pública 2022, 46, 1, doi:10.26633/RPSP.2022.82.
- Paniz-Mondolfi, A.E.; Rodriguez-Morales, A.J.; Blohm, G.; Marquez, M.; Villamil-Gomez, W.E. ChikDenMaZika Syndrome: The Challenge of Diagnosing Arboviral Infections in the Midst of Concurrent Epidemics. Ann Clin Microbiol Antimicrob 2016, 15, 42, s12941-016-0157–x, doi:10.1186/s12941-016-0157-x.
- da Silva Neto, S.R.; Tabosa de Oliveira, T.; Teixiera, I.V.; Medeiros Neto, L.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Arboviral Disease Record Data - Dengue and Chikungunya, Brazil, 2013-2020. Sci Data 2022, 9, 198, doi:10.1038/s41597-022-01312-7.
- Villamil-Gómez, W.E.; Rodríguez-Morales, A.J.; Uribe-García, A.M.; González-Arismendy, E.; Castellanos, J.E.; Calvo, E.P.; Álvarez-Mon, M.; Musso, D. Zika, Dengue, and Chikungunya Co-Infection in a Pregnant Woman from Colombia. International Journal of Infectious Diseases 2016, 51, 135–138, doi:10.1016/j.ijid.2016.07.017.
- Caicedo, D.M.; Méndez, A.C.; Tovar, J.R.; Osorio, L.; Caicedo, D.M.; Méndez, A.C.; Tovar, J.R.; Osorio, L. Desarrollo de algoritmos clínicos para el diagnóstico del dengue en Colombia. Biomédica 2019, 39, 170–185, doi:10.7705/biomedica.v39i2.3990.
- Dharap, P.; Raimbault, S. Performance Evaluation of Machine Learning-Based Infectious Screening Flags on the HORIBA Medical Yumizen H550 Haematology Analyzer for Vivax Malaria and Dengue Fever. Malar. J. 2020, 19, doi:10.1186/s12936-020-03502-3.
- Tchapet Njafa, J.-P.; Nana Engo, S.G. Quantum Associative Memory with Linear and Non-Linear Algorithms for the Diagnosis of Some Tropical Diseases. Neural Netw 2018, 97, 1–10, doi:10.1016/j.neunet.2017.09.002.
- Rodriguez-Quijada, C.; Gomez-Marquez, J.; Hamad-Schifferli, K. Repurposing Old Antibodies for New Diseases by Exploiting Cross-Reactivity and Multicolored Nanoparticles. ACS Nano 2020, 14, 6626–6635, doi:10.1021/acsnano.9b09049.
- Tan, K.W.; Tan, B.; Thein, T.L.; Leo, Y.-S.; Lye, D.C.; Dickens, B.L.; Wong, J.G.X.; Cook, A.R. Dynamic Dengue Haemorrhagic Fever Calculators as Clinical Decision Support Tools in Adult Dengue. Trans R Soc Trop Med Hyg 2020, 114, 7–15, doi:10.1093/trstmh/trz099.
- Veiga, R.V.; Schuler-Faccini, L.; França, G.V.; Andrade, R.F.; Teixeira, M.G.; Costa, L.C.; Paixão, E.S.; Costa, M. da C.N.; Barreto, M.L.; Oliveira, J.F.; et al. Classification Algorithm for Congenital Zika Syndrome: Characterizations, Diagnosis and Validation. Scientific Reports 2021, 11, 6770.
- Medeiros Neto, L.; Rogerio da Silva Neto, S.; Endo, P.T. A Comparative Analysis of Converters of Tabular Data into Image for the Classification of Arboviruses Using Convolutional Neural Networks. PLoS One 2023, 18, e0295598, doi:10.1371/journal.pone.0295598.
- da Silva Neto, S.R.; Tabosa Oliveira, T.; Teixeira, I.V.; Aguiar de Oliveira, S.B.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Machine Learning and Deep Learning Techniques to Support Clinical Diagnosis of Arboviral Diseases: A Systematic Review. PLoS Negl Trop Dis 2022, 16, e0010061, doi:10.1371/journal.pntd.0010061.
- Choubey, S.; Barde, S.; Badholia, A. Analysis of Deep Learning Techniques to Investigate and Support Diagnosis of Virus Borne Diseases. In Proceedings of the 3rd International Conference on Electronics and Sustainable Communication Systems, ICESC 2022 - Proceedings; Institute of Electrical and Electronics Engineers Inc., 2022; pp. 921–928.
- Arrubla-Hoyos, W.; Gómez, J.G.; De-La-Hoz-Franco, E. Methodology for the Differential Classification of Dengue and Chikungunya According to the PAHO 2022 Diagnostic Guide. Viruses 2024, 16, 1088. https://doi.org/10.3390/v16071088
- Noorbakhsh-Sabet, N.; Zand, R.; Zhang, Y.; Abedi, V. Artificial Intelligence Transforms the Future of Health Care. The American Journal of Medicine 2019, 132, 795–801, doi:10.1016/j.amjmed.2019.01.017.
- Wiljer, D.; Hakim, Z. Developing an Artificial Intelligence–Enabled Health Care Practice: Rewiring Health Care Professions for Better Care. Journal of Medical Imaging and Radiation Sciences 2019, 50, S8–S14, doi:10.1016/j.jmir.2019.09.010.
- Bharambe, A.; Chandorkar, A.A.; Kalbande, D. A Deep Learning Approach for Dengue Tweet Classification. In Proceedings of the 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA); IEEE: Coimbatore, India, September 2 2021; pp. 1043–1047.
- Khotimah, P.H.; Fachrur Rozie, A.; Nugraheni, E.; Arisal, A.; Suwarningsih, W.; Purwarianti, A. Deep Learning for Dengue Fever Event Detection Using Online News. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET); IEEE: Tangerang, Indonesia, November 18 2020; pp. 261–266.
- Gambhir, S.; Malik, S.K.; Kumar, Y. The Diagnosis of Dengue Disease: An Evaluation of Three Machine Learning Approaches. In Cog. Analytics: Concepts, Methodologies, Tools, and Applic.; IGI Global, 2020; pp. 1076–1095 ISBN 978-179982461-9.
- Acosta Torres, J.; Oller Meneses, L.; Sokol, N.; Balado Sardiñas, R.; Montero Díaz, D.; Balado Sansón, R.; Sardiñas Arce, M.E. Técnica Árboles de Decisión Aplicada al Método Clínico En El Diagnóstico Del Dengue. Revista Cubana de Pediatría 2016, 88, 441–453.
- Arrubla-Hoyos, W.; Seveiche-Maury, Z.; Saeed, K.; Gómez, J.E.G.; De-La-Hoz-Franco, E. Comparison of Classical Machine Learning and Ensemble Techniques in the Context of Dengue Severity Prediction. In Proceedings of the 2023 IEEE Colombian Caribbean Conference (C3); November 2023; pp. 1–5.
- PAHO/WHO Epidemiological Update - Dengue, Chikungunya and Zika - 10 June 2023 - PAHO/WHO | Pan American Health Organization Available online: https://www.paho.org/en/documents/epidemiological-update-dengue-chikungunya-and-zika-10-june-2023 (accessed on 13 March 2024).
- Zoubir, A.M.; Boashash, B. The Bootstrap and Its Application in Signal Processing. IEEE signal processing magazine 1998, 15, 56–76.
- Zoubir, A.M.; Iskander, D.R. Bootstrap Techniques for Signal Processing; Cambridge University Press, 2004;
- Smith, P.J.; Hoaglin, D.C.; Battaglia, M.P.; Barker, L. Implementation and Applications of Bootstrap Methods for the National Immunization Survey. Statistics in medicine 2003, 22, 2487–2502.
- RAO, J.; WU, C. Bootstrap Inference with Stratified Samples[Technical Summary Report]. 1984.
- Kunz, P.J.; Abid, S. ben; Zoubir, A.M. The Heterogeneity-Intensified and Heterogeneity Ratio-Stratified Bootstrap (HiS- and HeRS-Boot) Oversampling to Boost a Detector Performance. In Proceedings of the 2023 IEEE SENSORS; October 2023; pp. 1–4.
- Acosta-Reyes, J.; Navarro-Lechuga, E.; Martínez-Garcés, J.C. Enfermedad por el virus del Chikungunya: historia y epidemiología. Revista Salud Uninorte 2015, 31, 621–630, doi:10.14482/sun.31.3.7486.
- Pardo-Turriago, R. Zika. Una pandemia en progreso y un reto epidemiológico. Colombian Journal of Anestesiology 2016, 44, 86–88.
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Transactions on knowledge and data engineering 2009, 21, 1263–1284.
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. Journal of artificial intelligence research 2002, 16, 321–357.
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary. Journal of artificial intelligence research 2018, 61, 863–905.
- Connor, S.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. Journal of big data 2019, 6, 1–48.
Now
References
- Lambrechts, L.; Scott, T.W.; Gubler, D.J. Consequences of the Expanding Global Distribution of Aedes Albopictus for Dengue Virus Transmission. PLoS neglected tropical diseases 2010, 4, e646.
- Chaw, J.K.; Chaw, S.H.; Quah, C.H.; Sahrani, S.; Ang, M.C.; Zhao, Y.; Ting, T.T. A Predictive Analytics Model Using Machine Learning Algorithms to Estimate the Risk of Shock Development among Dengue Patients. Healthcare Analytics 2024, 5, 100290, doi:10.1016/j.health.2023.100290.
- Arrubla, W.D.J.A. Conceptualización del diagnóstico del Dengue desde una perspectiva de la ingeniería y las nuevas tecnologías. Computer and Electronic Sciences: Theory and Applications 2022, 3, 1–8, doi:10.17981/cesta.03.01.2022.01.
- Codina, J.-R.; Mascini, M.; Dikici, E.; Deo, S.K.; Daunert, S. Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. International Journal of Molecular Sciences 2023, 24, 12144, doi:10.3390/ijms241512144.
- Gangula, R.; Thirupathi, L.; Parupati, R.; Sreeveda, K.; Gattoju, S. Ensemble Machine Learning Based Prediction of Dengue Disease with Performance and Accuracy Elevation Patterns. Materials Today: Proceedings 2023, 80, 3458–3463, doi:10.1016/j.matpr.2021.07.270.
- Brady, O.J.; Hay, S.I. The Global Expansion of Dengue: How Aedes Aegypti Mosquitoes Enabled the First Pandemic Arbovirus. Annual review of entomology 2020, 65, 191–208, doi:10.1146/annurev-ento-011019-024918.
- Sukhralia, S.; Verma, M.; Gopirajan, S.; Dhanaraj, P.S.; Lal, R.; Mehla, N.; Kant, C.R. From Dengue to Zika: The Wide Spread of Mosquito-Borne Arboviruses. European Journal of Clinical Microbiology & Infectious Diseases 2019, 38, 3–14, doi:10.1007/s10096-018-3375-7.
- Chala, B.; Hamde, F. Emerging and Re-Emerging Vector-Borne Infectious Diseases and the Challenges for Control: A Review. Frontiers in public health 2021, 9, doi:10.3389/fpubh.2021.715759.
- PAHO Síntesis de evidencia: Directrices para el diagnóstico y el tratamiento del dengue, el chikunguña y el zika en la Región de las Américas. Revista Panamericana de Salud Pública 2022, 46, 1, doi:10.26633/RPSP.2022.82.
- Paniz-Mondolfi, A.E.; Rodriguez-Morales, A.J.; Blohm, G.; Marquez, M.; Villamil-Gomez, W.E. ChikDenMaZika Syndrome: The Challenge of Diagnosing Arboviral Infections in the Midst of Concurrent Epidemics. Annals of clinical microbiology and antimicrobials 2016, 15, 42, s12941-016-0157–x, doi:10.1186/s12941-016-0157-x.
- da Silva Neto, S.R.; Tabosa de Oliveira, T.; Teixiera, I.V.; Medeiros Neto, L.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Arboviral Disease Record Data - Dengue and Chikungunya, Brazil, 2013-2020. Scientific data 2022, 9, 198, doi:10.1038/s41597-022-01312-7.
- Villamil-Gómez, W.E.; Rodríguez-Morales, A.J.; Uribe-García, A.M.; González-Arismendy, E.; Castellanos, J.E.; Calvo, E.P.; Álvarez-Mon, M.; Musso, D. Zika, Dengue, and Chikungunya Co-Infection in a Pregnant Woman from Colombia. International Journal of Infectious Diseases 2016, 51, 135–138, doi:10.1016/j.ijid.2016.07.017.
- Caicedo, D.M.; Méndez, A.C.; Tovar, J.R.; Osorio, L.; Caicedo, D.M.; Méndez, A.C.; Tovar, J.R.; Osorio, L. Desarrollo de algoritmos clínicos para el diagnóstico del dengue en Colombia. Biomédica 2019, 39, 170–185, doi:10.7705/biomedica.v39i2.3990.
- Dharap, P.; Raimbault, S. Performance Evaluation of Machine Learning-Based Infectious Screening Flags on the HORIBA Medical Yumizen H550 Haematology Analyzer for Vivax Malaria and Dengue Fever. Malaria Journal 2020, 19, doi:10.1186/s12936-020-03502-3.
- Tchapet Njafa, J.-P.; Nana Engo, S.G. Quantum Associative Memory with Linear and Non-Linear Algorithms for the Diagnosis of Some Tropical Diseases. Neural networks 2018, 97, 1–10, doi:10.1016/j.neunet.2017.09.002.
- Rodriguez-Quijada, C.; Gomez-Marquez, J.; Hamad-Schifferli, K. Repurposing Old Antibodies for New Diseases by Exploiting Cross-Reactivity and Multicolored Nanoparticles. ACS Nano 2020, 14, 6626–6635, doi:10.1021/acsnano.9b09049.
- Tan, K.W.; Tan, B.; Thein, T.L.; Leo, Y.-S.; Lye, D.C.; Dickens, B.L.; Wong, J.G.X.; Cook, A.R. Dynamic Dengue Haemorrhagic Fever Calculators as Clinical Decision Support Tools in Adult Dengue. Transactions of The Royal Society of Tropical Medicine and Hygiene 2020, 114, 7–15, doi:10.1093/trstmh/trz099.
- Veiga, R.V.; Schuler-Faccini, L.; França, G.V.; Andrade, R.F.; Teixeira, M.G.; Costa, L.C.; Paixão, E.S.; Costa, M. da C.N.; Barreto, M.L.; Oliveira, J.F.; et al. Classification Algorithm for Congenital Zika Syndrome: Characterizations, Diagnosis and Validation. Scientific Reports 2021, 11, 6770.
- Medeiros Neto, L.; Rogerio da Silva Neto, S.; Endo, P.T. A Comparative Analysis of Converters of Tabular Data into Image for the Classification of Arboviruses Using Convolutional Neural Networks. PLoS One 2023, 18, e0295598, doi:10.1371/journal.pone.0295598.
- da Silva Neto, S.R.; Tabosa Oliveira, T.; Teixeira, I.V.; Aguiar de Oliveira, S.B.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Machine Learning and Deep Learning Techniques to Support Clinical Diagnosis of Arboviral Diseases: A Systematic Review. PLoS neglected tropical diseases 2022, 16, e0010061, doi:10.1371/journal.pntd.0010061.
- Choubey, S.; Barde, S.; Badholia, A. Analysis of Deep Learning Techniques to Investigate and Support Diagnosis of Virus Borne Diseases. In Proceedings of the 3rd International Conference on Electronics and Sustainable Communication Systems, ICESC 2022 - Proceedings; Institute of Electrical and Electronics Engineers Inc., 2022; pp. 921–928.
- Arrubla-Hoyos, W.; Gómez, J.G.; De-La-Hoz-Franco, E. Methodology for the Differential Classification of Dengue and Chikungunya According to the PAHO 2022 Diagnostic Guide. Viruses 2024, 16, 1088. https://doi.org/10.3390/v16071088
- Noorbakhsh-Sabet, N.; Zand, R.; Zhang, Y.; Abedi, V. Artificial Intelligence Transforms the Future of Health Care. The American Journal of Medicine 2019, 132, 795–801, doi:10.1016/j.amjmed.2019.01.017.
- Wiljer, D.; Hakim, Z. Developing an Artificial Intelligence–Enabled Health Care Practice: Rewiring Health Care Professions for Better Care. Journal of Medical Imaging and Radiation Sciences 2019, 50, S8–S14, doi:10.1016/j.jmir.2019.09.010.
- Bharambe, A.; Chandorkar, A.A.; Kalbande, D. A Deep Learning Approach for Dengue Tweet Classification. In Proceedings of the 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA); IEEE: Coimbatore, India, September 2 2021; pp. 1043–1047.
- Khotimah, P.H.; Fachrur Rozie, A.; Nugraheni, E.; Arisal, A.; Suwarningsih, W.; Purwarianti, A. Deep Learning for Dengue Fever Event Detection Using Online News. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET); IEEE: Tangerang, Indonesia, November 18 2020; pp. 261–266.
- Gambhir, Shalini, Sanjay Kumar Malik, and Yugal Kumar. "The diagnosis of dengue disease: An evaluation of three machine learning approaches." International Journal of Healthcare Information Systems and Informatics (IJHISI)13.3 (2018): 1-19.
- Acosta Torres, J.; Oller Meneses, L.; Sokol, N.; Balado Sardiñas, R.; Montero Díaz, D.; Balado Sansón, R.; Sardiñas Arce, M.E. Técnica Árboles de Decisión Aplicada al Método Clínico En El Diagnóstico Del Dengue. Revista Cubana de Pediatría 2016, 88, 441–453.
- Arrubla-Hoyos, W.; Seveiche-Maury, Z.; Saeed, K.; Gómez, J.E.G.; De-La-Hoz-Franco, E. Comparison of Classical Machine Learning and Ensemble Techniques in the Context of Dengue Severity Prediction. In Proceedings of the 2023 IEEE Colombian Caribbean Conference (C3); November 2023; pp. 1–5.
- PAHO/WHO Epidemiological Update - Dengue, Chikungunya and Zika - 10 June 2023 - PAHO/WHO | Pan American Health Organization Available online: https://www.paho.org/en/documents/epidemiological-update-dengue-chikungunya-and-zika-10-june-2023 (accessed on 13 March 2024).
- Zoubir, A.M.; Boashash, B. The Bootstrap and Its Application in Signal Processing. IEEE signal processing magazine 1998, 15, 56–76.
- Zoubir, A.M.; Iskander, D.R. Bootstrap Techniques for Signal Processing; Cambridge University Press, 2004;
- Smith, P.J.; Hoaglin, D.C.; Battaglia, M.P.; Barker, L. Implementation and Applications of Bootstrap Methods for the National Immunization Survey. Statistics in medicine 2003, 22, 2487–2502.
- RAO, J.; WU, C. Bootstrap Inference with Stratified Samples[Technical Summary Report]. 1984.
- Kunz, P.J.; Abid, S. ben; Zoubir, A.M. The Heterogeneity-Intensified and Heterogeneity Ratio-Stratified Bootstrap (HiS- and HeRS-Boot) Oversampling to Boost a Detector Performance. In Proceedings of the 2023 IEEE SENSORS; October 2023; pp. 1–4.
- Acosta-Reyes, J.; Navarro-Lechuga, E.; Martínez-Garcés, J.C. Enfermedad por el virus del Chikungunya: historia y epidemiología. Revista Salud Uninorte 2015, 31, 621–630, doi:10.14482/sun.31.3.7486.
- Pardo-Turriago, R. Zika. Una pandemia en progreso y un reto epidemiológico. Colombian Journal of Anestesiology 2016, 44, 86–88.
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Transactions on knowledge and data engineering 2009, 21, 1263–1284.
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. Journal of artificial intelligence research 2002, 16, 321–357.
- Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-Year Anniversary. Journal of artificial intelligence research 2018, 61, 863–905.
- Connor, S.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. Journal of big data 2019, 6, 1–48.
- Shaikh, Salim Gulab, et al. "Original Research Article Hybrid machine learning method for classification and recommendation of vector-borne disease." Journal of Autonomous Intelligence 7.2 (2024).
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed most of my earlier concerns, especially the details of the data sets and the result (confusion matrix). However, my question about writing the data size of Chikungunya, which is described in section 3.1.4 from 9 (6%) to 89 (33%), remained vague and probably needs further clarification. This is author mentions in lines 173 to 177 that bootstrap sampling "creates new samples by randomly sampling from existing data, rather than creating new synthetic samples from the minority class". How can random samples from existing samples (containing only 9 records) become 89 samples?
Comments on the Quality of English LanguageMinor editing of English language required.
Author Response
Dear Reviewer:
We would like to thank you for the review and valuable feedback you provided on our article titled “Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning, Random Forests, and Decision Trees.” We have worked diligently to address your comments and suggestions, and I am pleased to inform you that we have completed the requested corrections.
Below we have summarized the major edits made in response to your comments.
Reviewer 1.
- The authors have addressed most of my earlier concerns, especially the details of the data sets and the result (confusion matrix). However, my question about writing the data size of Chikungunya, which is described in section 3.1.4 from 9 (6%) to 89 (33%), remained vague and probably needs further clarification. This is author mentions in lines 173 to 177 that bootstrap sampling "creates new samples by randomly sampling from existing data, rather than creating new synthetic samples from the minority class". How can random samples from existing samples (containing only 9 records) become 89 samples?
Before
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed creating new samples that balanced the dengue and chikungunya labels with Zika disease in the target variable. As a result, the new dataset increased from 150 to 267 samples in total. Each disease now has 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrap was chosen because it is a specialized resampling technique based on the central limit theorem, which generates new samples by randomly sampling existing data rather than creating new synthetic samples from the minority class data, such as the adaptive synthetic algorithms ADASYN [38], synthetic minority oversampling SMOTE [39-40] or data augmentation [35] [41]. Figure 2 presents the data balancing process.
Now
3.1.4. Target balance through bootstrapping
One of the main drawbacks of the dataset is its limited size, with only 150 rows and 24 columns (89 Zika, 52 Dengue, and 9 Chikungunya records), which classifies it as a small dataset. In addition, the imbalance in the classes of the target variable makes it difficult to train the model, especially for Chikungunya disease, which represents only 6% of the data. Faced with this challenge, and considering the various options for balancing the data, the bootstrapping technique was chosen. This technique allowed the creation of new samples that balanced dengue and chikungunya labels with Zika disease in the target variable. Consequently, the new dataset increased from 150 to 267 samples in total. Each disease now had 89 samples, without reducing the Zika records, which previously had 89 records before applying the technique. Bootstrapping was chosen because it is a statistical technique that allows the accuracy of a statistic (such as the mean or median) to be estimated by creating multiple samples from the original data. This is particularly useful when the exact properties of the underlying population are unknown. Unlike synthetic oversampling methods such as ADASYN [38], SMOTE [39-40] or data augmentation [35] [41], which create new synthetic samples to augment the data of the minority class, bootstrap does not generate new data. Instead, it creates “dummy samples” by repeatedly drawing with replacements from the original data. This means that the data are randomly selected from the original sample, allowing the same data to be selected more than once.
This process is repeated many times, calculating the desired estimate (such as the mean) for each sample and examining how these estimates vary. By repeating this process multiple times, a distribution of possible values ​​of the estimate is obtained, and the variability and uncertainty can be assessed without making strong assumptions about the underlying distribution of the data. Figure 2 presents the data balancing process.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have replied most of my comments well. However, I am still interested in the application of the findings in the real world. Just like the author mention in the discussion: “Moreover, the predictive model developed in this study could be particularly beneficial in regions where dengue, Zika, and chikungunya co-circulate, as early differentiation between these diseases is challenging due to their similar initial symptoms.” Early diagnosis is critical to differentiate the three diseases. However, the DT model indicated that headache, IgM, platelet count, and symptom days are important factors to classify dengue, Zika or Chikungunya. It usually takes some time to get IgM level, platelet count and symptom days so it contradicts to so called “early differentiation.” On the other hand, the feature important of RF model showed different story. Can we differentiate the three diseases from RF model? Most of the important variables in the RF model are symptom-based which is more feasible for physician to diagnose dengue, Zika, or Chikungunya in the early stage. That’s why I asked in the previous comment: “Can you develop a criterion for clinical doctor to diagnose the diseases just based on certain symptoms?
Author Response
Dear Reviewer:
We would like to thank you for the review and valuable feedback you provided on our article titled “Differential Classification of Dengue, Zika, and Chikungunya Using Machine Learning, Random Forests, and Decision Trees.” We have worked diligently to address your comments and suggestions, and I am pleased to inform you that we have completed the requested corrections.
Below we have summarized the major edits made in response to your comments.
Reviewer 2.
The authors have replied most of my comments well. However, I am still interested in the application of the findings in the real world. Just like the author mention in the discussion: “Moreover, the predictive model developed in this study could be particularly beneficial in regions where dengue, Zika, and chikungunya co-circulate, as early differentiation between these diseases is challenging due to their similar initial symptoms.” Early diagnosis is critical to differentiate the three diseases. However, the DT model indicated that headache, IgM, platelet count, and symptom days are important factors to classify dengue, Zika or Chikungunya. It usually takes some time to get IgM level, platelet count and symptom days so it contradicts to so called “early differentiation.” On the other hand, the feature important of RF model showed different story. Can we differentiate the three diseases from RF model? Most of the important variables in the RF model are symptom-based which is more feasible for physician to diagnose dengue, Zika, or Chikungunya in the early stage. That’s why I asked in the previous comment: “Can you develop a criterion for clinical doctor to diagnose the diseases just based on certain symptoms?
Before
Figure 13. Comparison of the precision and error of RF models generated by cross-validation.
On the other hand, the scarcity of specific research on the classification of diseases such as dengue, zika and chikungunya limits direct comparisons of results. However, a recent research [42] addresses this challenge by developing a proposal to classify seven similar diseases, which includes 137 records of zika, 127 of dengue and 140 of chikungunya, in addition to other diseases such as malaria and yellow fever, totaling 1,500 records. This proposal compares various algorithms and presents a hybrid technique called HML, which combines machine learning techniques with reinforcement learning based on recurrent neural networks (RNN). The results obtained show high precision, with an accuracy of 98.7%, precision of 98.7%, recall of 98.4% and an F1-score of 99.10%.
Despite these promising results, the research does not include confusion matrices that allow evaluating the reliability of the classification for each disease individually. When comparing these results with those of our research, it is observed that our models outperform the proposed quality metrics, especially in terms of accuracy, precision, recall, and F1-score. Furthermore, our research provides a detailed analysis at the confusion matrix level for each class, allowing a more accurate assessment of the classification capacity of each disease. This highlights not only the effectiveness of our models in differentiating between dengue, zika, and chikungunya, but also the advantage of having detailed metrics to assess and improve classification quality.
The results of this research support the feasibility of a model for early and differential prediction of dengue, Zika, and chikungunya based on signs, symptoms, and clinical laboratory results. This model showed high performance with an accuracy of 98.8%, precision of 99.6%, specificity of 99.8% and F1-Score of 99.5%. In addition, its ability to accurately recognise each disease is remarkable, achieving 99.7% for chikungunya, 99.1% for dengue, and 98.8% for Zika.
The use of cross-validation in this study played a crucial role in providing a more accurate estimate of model performance. By employing multiple partitions of the dataset for training and validation, this technique reduces the risk of overfitting and improves the ability of the model to generalise to unseen data. In addition, using cross-validation, more stable and reliable metrics of model performance were obtained, allowing for a more accurate assessment of the model's ability to predict these diseases.
Bootstrapping was used to balance the classes in model construction. This technique allowed us to work with the unbalanced dataset that made up the dataset, generating multiple samples of equal size to the original dataset and randomly selecting observations with replacement. By applying this technique, we were able to obtain an adequate representation of the training samples, which helped improve the model's ability to learn, in a balanced way, the characteristics of each disease.
Finally, this study represents a significant advance in the differential prediction of dengue, zika, and chikungunya using machine learning techniques and the analysis of signs, symptoms, and laboratory variables. The developed model offers robust diagnostic support, based on the criteria established in the PAHO evidence synthesis (2022), which clearly distinguishes the signs and symptoms of each disease for diagnosis and treatment. With high performance, this model not only demonstrates remarkable accuracy, but also has great potential for implementation in clinical settings. Its integration into clinical practice would provide fundamental support to health professionals, facilitating early and accurate diagnoses, and favoring timely decision-making that improves patient outcomes.
Moreover, the predictive model developed in this study could be particularly beneficial in regions where dengue, Zika, and chikungunya co-circulate, as early differentiation between these diseases is challenging due to their similar initial symptoms. This tool could empower healthcare providers to make more informed and rapid decisions about patient management, ultimately leading to better care and outcomes.
Although this work presents some limitations regarding the amount of data, especially for chikungunya, which were addressed by specialized computational techniques, it is recognized that the reliability of the model could be improved with a larger volume of data. Despite these limitations, this study establishes benchmark for future research, since, according to [20] [21], no comparable studies have been identified in the literature, mainly due to the scarcity of data sets that include records of these viruses.
Now
Figure 13. Comparison of the precision and error of RF models generated by cross-validation.
Table 6. Description of the dataset.
variable |
Description |
Age |
Represents the age of patients |
Sex |
Represents the sex of the patient |
Symptom_days |
Represents the number of days from the date of symptom onset to the day of consultation. |
headache |
Indicates headache symptom (yes or no) |
Retroocular_pain |
Indicates symptom of retro ocular pain (yes or no) |
Myalgia |
Indicates symptom myalgia (yes or no) |
Arthralgia |
Indicates symptom Arthralgia (yes or no) |
Rash |
Indicates symptom Rash (yes or no) |
Abdominal_pain |
Indicates whether the patient has abdominal pain (yes or no). |
Threw_up |
Indicates whether the patient has vomited (yes or no). |
Diarrhea |
Indicates whether the patient has symptoms of diarrhoea (yes or no). |
Drowsiness |
Indicates whether the patient has symptoms of Drowsiness (yes or no). |
Hepatomegaly |
Indicates whether the patient has Hepatomegaly sign (yes or no). |
Mucosal_hemorrhage |
Indicates whether the patient has the sign of mucosal bleeding (yes or no). |
Hyperemia |
Indicates if the patient has the signs of Hyperemia (yes or no). |
exanthema |
Indicates if the patient has signs of rash (yes or no). |
Target |
Indicates illness, dengue, zika or chikungunya |
The results obtained in this study allow for a highly accurate classification of dengue, Zika, and chikungunya diseases, highlighting the relevance of certain variables in prediction. Consequently, a new experiment was carried out, in which only variables related to signs and symptoms were selected, excluding laboratory results that were not available in the early stages of the disease, as well as variables that were not significant in previous analyses. This new model proposal seeks to align with the medical reality, providing an approach that, based on data obtainable by the physician in the early stages of the disease, effectively supports decision-making in the classification of these pathologies. Table 6 presents the variables that were selected to create the new model.
The training was carried out under the same conditions as the previous models using stratified cross-validation and the methodology proposed in [22]. The results are presented in Table 7.
Table 7. Quality metrics of the models applying methodology based on the PAHO Guidelines (2022).
ML technique |
accuracy |
precision |
specificity |
recall |
F1- Score |
Tree Decision |
96% |
97% |
96% |
96% |
96% |
RF |
99.3% |
99.8% |
99.9% |
99.9% |
99.9% |
The results presented in Table 7 highlight the excellent performance of the RF technique, which achieved a balance of over 99% for all quality metrics. Similarly, the decision tree also showed a solid performance, with an average of 96% across all metrics. These results suggest that the developed models are highly effective and can be adapted for early disease detection, providing valuable support to the medical community for accurate triage of dengue, Zika, and chikungunya. This is especially useful in remote communities where the lack of experienced medical epidemiologists or specialists can make early disease triage difficult.
Table 8 shows the quality metrics of both models and highlights their ability to recognise chikungunya, dengue, and Zika diseases. Although the RF model has superior metrics, suggesting that it might be the preferred option in terms of pure performance, the Decision Tree offers very robust performance and clearer interpretability in the medical domain. This better interpretability makes it a potentially more useful tool in contexts where a detailed understanding of the model's decisions is critical to support clinical decision-making.
Table 8. Model quality metrics.
Model quality metrics |
||||
Quality metrics Decision Tree with Methodology |
||||
accuracy |
precision |
recall |
F1- Score |
|
Chikungunya |
96,0% |
90,0% |
100% |
95%% |
Dengue |
96,0% |
100% |
100% |
100% |
Zika |
96,0% |
100% |
88,0% |
93,0% |
Quality metrics Decision Tree without Methodology |
||||
Chikungunya |
99,9% |
99,8% |
100% |
99,9% |
Dengue |
99,3% |
98,6% |
99,4% |
99 ,0% |
Zika |
99,3% |
99,4% |
98,4% |
98,9% |
Quality metrics Random Forest with Methodology |
Figure 14 illustrates the decision tree, highlighting that the most significant variable for classifying dengue was headache, followed by abdominal pain, retrocular pain, and arthralgia. These findings are in line with the PAHO guidelines, which identify these symptoms as differential signs, supported by scientific evidence. In addition, patient age emerged as a significant factor in the classification of dengue. For chikungunya, myalgia was observed as a key variable, which is in line with the PAHO indications. However, symptom duration, retroocular pain, and patient age were also identified as important factors in the classification of chikungunya. Finally, in the case of Zika, significant variables for classification include myalgia, abdominal pain, age, and arthralgia. It should be noted that, although relevant in this context, they are not mentioned as distinctive signs or symptoms of Zika in the PAHO 2022 guidelines.
Figure 14. Tree diagram of the DT model.
However, the scarcity of specific research on the classification of diseases such as dengue, Zika, and chikungunya limits direct comparisons of results. However, a recent study [42] addressed this challenge by developing a proposal to classify seven similar diseases, including 137 records of Zika, 127 of dengue, and 140 of chikungunya, in addition to other diseases such as malaria and yellow fever, totalling 1,500 records. This proposal compares various algorithms and presents a hybrid technique called HML that combines machine learning techniques with reinforcement learning based on recurrent neural networks (RNN). The results obtained showed high precision, with an accuracy of 98.7%, precision of 98.7%, recall of 98.4%, and an F1-score of 99.10%.
Despite these promising results, this research does not include confusion matrices that allow the evaluation of the reliability of the classification for each disease individually. When comparing these results with those of our research, it is observed that our models outperform the proposed quality metrics, particularly in terms of accuracy, precision, recall, and F1-score. Furthermore, our research provides a detailed analysis of the confusion matrix level for each class, allowing for a more accurate assessment of the classification capacity of each disease. This highlights not only the effectiveness of our models in differentiating between dengue, Zika, and chikungunya, but also the advantage of having detailed metrics to assess and improve classification quality.
The results of this research support the feasibility of a model for early and differential prediction of dengue, Zika, and chikungunya based on signs and symptoms. This model showed high performance with an accuracy of 99.3%, precision of 99.8%, specificity of 99.9% and F1-Score of 99.9%. Furthermore, its ability to accurately recognise each disease is remarkable, reaching 99.9% for chikungunya, 99.3% for dengue, and 99.3% for Zika.
Author Response File: Author Response.pdf
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsThe additional information provided by the authors relates to the issue of augmentation of the Zika sample/data set.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author has replied all the comments well.