Skip Content
You are currently on the new version of our website. Access the old version .
  • Article
  • Open Access

30 January 2026

AI-Assisted Differentiation of Dengue and Chikungunya Using Big, Imbalanced Epidemiological Data

and
1
International Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110301, Taiwan
2
In-Service Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 110301, Taiwan
3
AIBioMed Research Group, Taipei Medical University, Taipei 110301, Taiwan
4
Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110301, Taiwan
This article belongs to the Special Issue Surveillance, Modelling, and Risk Mapping of Tropical Infectious Diseases

Abstract

Dengue and chikungunya are endemic arboviral diseases in many low- and middle-income countries, often co-circulating and presenting with overlapping symptoms that hinder early diagnosis. Timely differentiation is critical, especially in resource-limited settings where laboratory testing is unavailable. We developed and evaluated machine-learning (ML)- and deep-learning (DL) models to classify dengue, chikungunya, and discarded cases using a large-scale, real-world dataset of over 6.7 million entries from Brazil (2013–2020). After applying the Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance, we trained six ML models and one artificial neural network (ANN) using only demographic, clinical, and comorbidity features. The Random Forest model achieved strong multi-class classification performance (Recall: 0.9288, the Area Under the Curve (AUC): 0.9865). The ANN model excelled in identifying chikungunya cases (Recall: 0.9986, AUC: 0.9283), suggesting its suitability for rapid screening. External validation confirmed the generalizability of our models, particularly for distinguishing discarded cases. Our models demonstrate high-accuracy in differentiating dengue and chikungunya using routinely collected clinical and epidemiological data. This work supports the development of Artificial Intelligence-powered decision-support tools to assist frontline healthcare workers in under-resourced settings and aligns with the One Health approach to improving surveillance and diagnosis of neglected tropical diseases.

1. Introduction

Neglected tropical diseases (NTDs) continue to burden impoverished communities in tropical and subtropical regions, contributing significantly to global health inequities. Among these, arboviral infections such as dengue and chikungunya have re-emerged as major public health concerns [1], particularly in low- and middle-income countries across Latin America, Southeast Asia, and Africa [2,3]. Both diseases are transmitted by the Aedes mosquito and frequently co-circulate, making clinical differentiation challenging due to their overlapping symptomatology [4].
In 2021, the World Health Organization (WHO) declared the roadmap for NTDs in 2021–2030 with the new ambitious target: reducing 75% of deaths from vector-borne NTDs, including dengue and chikungunya [5]. However, the global scenario of dengue is still not optimistic: over 6.5 million cases and more than 7300 dengue-related deaths were reported across all six WHO regions in 2023, with the Americas reporting the highest number of infected individuals with 4.5 million cases and 2300 deaths. The year 2024 surpassed 2023 as the worst year for recorded dengue cases, with over 10 million cases reported, 24,000 severe cases, and 6508 deaths [6]. Dengue cases increased more than 50%, from 26.5 million in 1990 to 59.0 million in 2021 [7], with a higher incidence in children and adolescents [8].
Although dengue usually presents with common flu-like symptoms such as fever, myalgia, and headache, and most cases self-recover, a small number of patients develop severe dengue [9]. Regardless of if a patient has prior exposure to only dengue or combined with another arboviral viruses, such as Zika, they are at an increased risk of developing severe dengue, which could be fatal [10].
Beside dengue, chikungunya is also a public health concern as a re-emerging vector-borne disease [11] with 18.7 million cases in 110 countries between 2011 and 2020, causing 1.95 million disability-adjusted life years, USD 2.8 billion in direct costs, and USD 47.1 billion in indirect costs worldwide; the majority of the disease burden was observed in the Americas [12,13]. An estimated 51% of symptomatic chikungunya patients with laboratory-confirmed tests developed chronic disability after infection [14]. The clinical manifestations of chikungunya at the early stage include fever, arthralgia, myalgia, and joint pain and swelling, which are almost indistinguishable from dengue and other febrile diseases [15].
Many health efforts have been made to differentiate dengue from chikungunya and other febrile diseases. A three-year study in Puerto Rico identified arthritis, joint pain, skin rash, any bleeding, and irritability as clinical predictors that distinguish chikungunya from dengue; while joint pain, muscle, bone, or back pain, skin rash, and red conjunctiva are significant predictors for chikungunya compared with other acute febrile infections [16]. Researchers in Brazil proposed a clinical rule scoring system to diagnose chikungunya infection in a dengue-endemic area, using fever, exanthema, myalgia, arthralgia or arthritis, and joint edema; this system achieved an AUC of 0.695 [17]. However, in remote areas, human- and equipment resources are limited. While dengue and chikungunya share the same vector—known as the Aedes mosquito—similar symptoms with other arboviral pathogens make correctly diagnosing patients more challenging [18,19,20]. Therefore, an alternative approach is needed to assist healthcare workers in the early detection and classification of patients with these diseases without laboratory confirmation.
Machine Learning (ML), a subset of Artificial Intelligence (AI), is algorithms that can learn the patterns inside a dataset and use these experiences to automatically improve the accuracy of their output [21]. In recent years, advances in clinical practice using ML and its subfield—Deep Learning (DL)—have proven to improve diagnostic performance [22,23,24], including arboviral-diseases detection [25,26]. These techniques can assist in classifying, supporting diagnosis and treatment for patients with arboviral diseases in resource-limited settings. Previous studies attempted to develop ML models for a binary classification of dengue fever (DF). These models aimed to predict dengue positivity/negativity or to differentiate between severe dengue (SD) and non-severe dengue, using various types of data, including meteorology, genomic, socio-demographic, clinical, and laboratory data [27]. Recently, a novel approach using micro-spectroscopy techniques combined with machine learning showed a promising application in rapid classification of dengue and chikungunya in remote areas [28].
Due to the aim of predicting clinical dengue cases, we examined the previous literature that applied ML for dengue diagnosis based on demographic, clinical, and laboratory data only, focusing on dataset size, features used, the ML models implemented, and the metrics used to evaluate the models’ performance. For example, Ho et al. applied different ML models to identify 2942 dengue cases from a dataset of 4894 patients with dengue-like illnesses, using only age, body temperature, white blood cell count, and platelet count as input features. Logistic regression (LR), Decision Tree (DT), and Deep Neural Networks (DNN) were used to build predictive models; the proposed DL model achieved the best performance with an AUC of 0.8587 [29]. Abdualgalil et al. analyzed 6694 samples containing one continuous variable (age) and 20 binary categorical features (including sex, fever, headache, arthralgia, myalgia, conjunctivitis, skin rash, generalized weakness, jaundice, decrease in urine or anuria, abdominal pain, vomiting,) with the train/test split ratio of 70/30 to build five ML models: k-Nearest Neighbor (KNN), Gradient Boosting Classifier, eXtreme Gradient Boosting (XG), Extra Tree Classifier (ETC), and Light Gradient Boosting Machine [30]. The target variable was dengue positivity or negativity; the ETC model using the hold-out cross-validation approach achieved the highest accuracy of 0.9912.
In predicting SD cases, Phakhounthong et al. used DT to predict 38 SD cases out of 198 laboratory-confirmed dengue cases in Cambodian children, using five clinical and laboratory attributes (hematocrit, Glasgow Coma Score, urinary protein, creatinine, and platelet count). The DT model achieved 0.605 sensitivity, 0.65 specificity, and 0.641 accuracy [31]. Huang et al. analyzed 798 laboratory-confirmed dengue cases, including 138 SD cases, to develop various ML models for assessing the risk of dengue severity based on six features: age, sex, viral RNA amounts, the positivity of NS1, and IgM and IgG test results. Different ML methods, including LR, Random Forest (RF), Gradient Boosting Machine (GB), Support Vector Classifier (SVC), and Artificial Neural Networks (ANN)—a DL algorithm—were applied for building prognostic models. The ANN model outperformed others with an AUC of 0.8324 and 0.7523 accuracy [32].
Besides binary classification models, the development of multi-class classification algorithms, which can differentiate between dengue and other mosquito-borne diseases, are of great importance to clinicians when more than one arbovirus is present in endemic areas. Lee et al. implemented LR and DT models to differentiate between 862 DF, 55 dengue hemorrhagic fever (DHF), and 117 chikungunya cases in two scenarios: with and without laboratory testing (suitable for well-resourced and resource-limited settings, respectively). Multiple demographic, epidemiological, and clinical features were used for the prediction. Without laboratory results, the DT model achieved an overall AUC of 0.59 in classifying DF and chikungunya cases, while performing better in discriminating DHF versus chikungunya with 0.91 AUC [33].
Tabosa de Oliveira et al. used seven ML models: RF, Adaptative Boosting (AD), GB, XG, KNN, NB, and Multilayer Perceptron, to train a dataset of 17,272 records, with 5724 for each of the three classes: dengue, chikungunya, and others (patients classified as “inconclusive” or “negative” for both dengue and chikungunya). The dataset consists of 26 features: socio-demographic (age, sex, gestational age in case sex is female, race, residence area, days that patient feels the symptoms), clinical (fever, myalgia, headache, rash, vomiting, nausea, back pain, conjunctivitis, arthritis, arthralgia, petechiae, tourniquet test, eye pain), and comorbidities (diabetes, hypertension, and hematological, liver, kidney, peptic acid, and autoimmune disease). The GB model achieved the best performance with 0.6240 accuracy, 0.6257 precision, 0.6205 recall, and 0.6196 F1-score [34].
Previous studies that focused on multi-class classification of dengue and other diseases are still limited compared with binary tasks [35]. Moreover, a previous study aimed to differentiate dengue with malaria, leptospirosis, and scrub typhus, also indicated that ML models (DT, RF, AD) only achieved 55–60% overall predictability on the multi-class classification task, far lower than binary classification using the LR model with an average of 79–84% correct predictions for one versus other diseases [36]. Previous studies have examined the applicability of ML-based models for disease diagnosis; however, there are no studies that have attempted to apply DL algorithms for multiclass classification of multiple arboviral diseases [35]. To the best of our knowledge, there are no studies that have investigated the multi-class task on an imbalanced dataset with more than 100,000 records.
Therefore, in this study, we built different ML and DL models to investigate the ability of differentiating dengue and chikungunya cases with discarded cases (inconclusive cases of the two diseases) using a big, highly imbalanced open-source dataset.

2. Materials and Methods

2.1. Data Collection

We used an open-source dataset from da Silva Neto et al. [37], which consists of 4,307,513 dengue, 325,000 chikungunya, and 2,100,029 discarded cases in Brazil from 2013 to 2020 for classification. There are 55 variables, but 9 from laboratory data were excluded. The remaining features were classified into three groups: demographic, clinical, and comorbidity data. One important characteristic of this dataset is that some important diagnostic features such as days from symptom onset and severity markers were not collected by the researchers. Figure 1 illustrates the workflow of this study.
Figure 1. The workflow of this study.

2.2. Data Preprocessing

First, the data was checked for missing values and typos. Next, the outcome variable was encoded as follows: 0 for discarded cases, 1 for chikungunya, and 2 for dengue. Apart from the age variable, which is numeric, all other categorical variables were converted into numeric format according to the number of classes within each variable. Third, the dataset was split into two parts: training data and testing data, with an 80/20 ratio. The testing data is also called the internal test set, which has similar features to the training set, including years, geographic regions, municipalities, and class distribution. This test set is kept separately during model training. Then, the training data was split further into two parts, with the same ratio for training and validation purposes. For the training dataset, the imbalanced data was handled using various techniques, including random undersampling, random oversampling, and the Synthetic Minority Oversampling Technique (SMOTE) [38], since the dataset used in this study has the minority class of interest (chikungunya cases). The performance of models was compared after applying different approaches to handle the issue of imbalanced data and SMOTE showed the prominent advantage over techniques. Therefore, SMOTE was applied in this study despite its potential concern of overfitting. After applying SMOTE for training data, we had 8,272,884 samples, and the number of instances in each class after using SMOTE was 2,757,628. Important features for predicting dengue and chikungunya were chosen after using the ML algorithm which shows the best performance as baseline. Last, train- and test sets were normalized using z-score scaling technique.

2.3. ML Model Development

We used various ML techniques to develop six models to diagnose different diseases: Random Forest (RF), Decision Tree (DT), Adaptive Boosting (AD), Gradient Boosting Machine (GB), eXtreme Gradient Boosting (XG), and K-nearest Neighbor (KNN). The proposed algorithms were trained with different hyperparameters and random_state = 42, using RandomizedSearchCV package in Scikit-Learn library. Table 1 displays the hyperparameters that presented the best results. The model with the best performance was chosen for feature selection with the Recursive Feature Elimination with Cross-Validation (RFECV) technique, using the following parameters: estimator = the ML classifier with best performance; step = 1; cv = StratifiedKFold (5); scoring = ‘accuracy’.
Table 1. Hyperparameters with the best results for each model.

2.4. Artificial Neural Network (ANN) Model

The main advantage of ANN model is that it does not require human intervention for any of its processes, which allows automatic feature extraction in comparison with ML [39]. Table 2 describes model configuration of ANN model. The hidden layer had two layers with 64 output shapes.
Table 2. The model configuration of the proposed ANN algorithm.
The above model was trained in 30 epochs, 128 batch_size using keras and tensorflow package, and hyperparameter optimization was performed with RandomSearchCV to perform effective differential diagnosis between three classes.

2.5. Model Evaluation

In this research, we have evaluated our multi-class arbovirus diseases classification models by using the accuracy, precision, recall, specificity, balanced accuracy, F1-score, and area under the receiver operating characteristics (ROC) curve (AUC). These evaluation metrics are based on the confusion matrix, which seeks to calculate True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
Accuracy measures the model’s performance according to the total samples correctly classified, and is defined as
Accuracy = ( T P + T N ) ( T P + F P + T N + F N )
Precision, also called positive predictive value, is the metric used to represent the proportion of positive classifications that are true positive, calculated as
Precision = T P ( T P + F P )
Recall, also called sensitivity, determines the proportion of real positives that were correctly classified, defined as
Recall = T P ( T P + F N )
Specificity determines the proportion of real negatives that were correctly classified, defined as
Specificity = T N T N + F P
Balanced accuracy is the arithmetic mean of sensitivity and specificity, and is defined as
Balanced   accuracy = S e n s i t i v i t y + S p e c i f i c i t y 2
F1-score is a metric that calculates the harmonic mean of precision and recall, and is defined as
F 1 - score = 2   ×   P r e c i s i o n × R e c a l l ( P r e c i s i o n + R e c a l l )
The metrics for multi-class classification were obtained by using macro averaging. Initially, precision, recall (sensitivity), specificity, F1-score, accuracy, and balanced accuracy were calculated for each class. The metrics for multi-class classification were obtained using macro averaging, i.e., by taking the average of the values obtained for three classes. Balanced accuracy, along with AUC metric, are identified to be more robust to imbalanced data than traditional accuracy [40]. In multi-class classification task, balanced accuracy is defined as the average of recall, which equals the macro-average recall.

2.6. Software

Tensorflow and Keras frameworks implemented the proposed method. A free cloud service from Google Colab performed the training and testing process. The evaluations metrics were calculated using the Scikit-Learn library version 1.5.0.

3. Results

3.1. Data Characteristics

As depicted in Table 3, 55.8% of cases were women with a mean age of 33 years; most did not have information about race, stage in pregnancy, or education degree. The most common symptoms included fever (37.3%), headache (34.5%), myalgia (34%), retro-orbital pain (14.3%), and nausea (14.2%).
Table 3. Data characteristics of features.

3.2. ML Models’ Performance

Table 4 presents a comparison of the performance of ML models with and without applying the SMOTE technique on training set. Before applying SMOTE, the XG model outperformed others with a recall of 0.9333, precision of 0.8711, F1-score 0.8983, and an AUC of 0.9831. Among the ML models after applying SMOTE, RF algorithm showed the best performance in classifying dengue, chikungunya, and discarded cases with the macro-average values of accuracy (0.9292), recall/balanced accuracy (0.9288), precision (0.9111), F1-score (0.9196), and AUC (0.9853).
Table 4. The performance of different ML models.
Figure 2 depicts 25 important variables chosen for model training after running the feature importance procedure with RF as baseline.
Figure 2. The important features are selected using RFECV technique with RF. The y-axis presents the 25 important features selected using RFECV technique (blue bar). The x-axis presented the percentage of importance. (Note: SEM_PRI: Epidemiological week of onset of symptom; TPAUTOCTO: Indicates whether the case is indigenous to the area of residence; COMUNINF: City where the patient was infected; ID_MN_RESI: City of the patient; ID_REGIONA: Health care regional code (where the health unit or other reporting source is located); NU_IDADE_N: Patient age; CEFALEIA: Headache; FEBRE: Fever; ARTRALGIA: Arthralgia; DOR_RETRO: Retro-orbital pain; HEMATOLOG: Hematological disease; CS_RACA: Patient Race; CONJUNTVIT: Conjunctivitis; DOR_COSTAS: Back Pain; EXANTEMA: Rash; LACO: Tourniquet test; VOMITO: Vomiting; PETEQUIA_N: Petechiae; HIPERTENSA: Hypertension; ARTRITE: Arthritis; CS_ESCOL_N: Patient education; NAUSEA: Nausea; CS_GESTANT: Gestational Age of the Patient (Quarter) in case Sex is Female; AUTO_IMMUNE: Autoimmune disease; ACIDO_PEPT: Peptic acid disease).
Figure 3 shows the ROC curve of RF algorithm for each class. In the validation test set, the macro-average AUC was 0.9865, and chikungunya class had the maximum AUC value. For the internal test set, the RF model achieved a lower overall AUC of 0.8329; the highest AUC was observed in discarded class. The RF model presented similar AUC values of discarded class in both the validation and external test set, illustrating the high capability of the proposed algorithm in predicting discarded cases.
Figure 3. ROC curve of multi-class classification using Random Forest model on the validation (left) and internal test set (right). ROC curves are indicated in orange (chikungunya), green (dengue), blue (discarded) and blue dashed (macro-average) lines.

3.3. The Performance of DL Model

Table 5 illustrates the proposed DL model which achieved 0.8984 macro-average specificity, 0.8401 recall/balanced accuracy, and 0.8693 AUC. The dengue class achieved the best specificity (0.9879), precision (0.9283), and F1-score (0.8297). The chikungunya class achieved the best recall (0.9986), balanced accuracy (0.9283), and AUC (0.9283). These results demonstrated that the ANN model worked best as a screening tool for chikungunya cases.
Table 5. Proposed DL model’s performance.
Figure 4 compares the performance of multi-class models on the validation and internal test set. In validation set, the AUC of ML model (RF) was higher than the DL model (ANN) in all three classes. In contrast, except for the discarded class, the ANN model performed better RF in classifying dengue and chikungunya classes in the internal test set. In addition, RF showed the best result in predicting discarded class with the approximate AUC values in the validation and internal test set, while the ANN model showed a stable performance in both datasets in all three classes.
Figure 4. Comparison of AUC values on multi-class classification task between ML and DL models on the validation and internal test set. AUC values are indicated in blue (RF-Val: validation set using Random Forest), orange (RF-In: internal test set using Random Forest), gray (DL-Val: validation set using Deep Learning), and yellow (DL-In: internal test set using Deep Learning) bar. (Notes: DEN: dengue class; CHIK: chikungunya class; DIS: discarded class).

4. Discussion

In this study, we investigated different ML and ANN models in differentiating dengue and chikungunya with the discarded cases. Our work suggested that the RF model works best in differentiating dengue and chikungunya with a macro-averaged recall of 0.92288, precision of 0.9111, and F1-score of 0.9196 which are higher than the metrics of a previous work using the GB model on the balanced dataset and achieved the recall, precision, and F1-score of 0.6257, 0.6205, and 0.6196, respectively [34]. Previous studies also indicated tree-based ML algorithms, such as DT and RF, achieved better performance in the multi-class classification of target variable [33,34]. A recent study also agreed that a tree-based ML model could be a suitable choice for building a decision-making support application and deploying on portable devices to assist doctors and nurses in diagnosing dengue patients [41].
Some clinical symptoms that are declared as important in dengue diagnosis were absent in the input data used for training the models. For instance, abdominal pain and myalgia were identified as better predictors of dengue infections with logistic regression and decision-tree models [36]. Some comorbidities, like pre-existing renal disease or diabetes, were ignored by the models although these symptoms are important risk factors of SD [42]. Diabetes is also associated with an increased risk for severe outcomes in dengue and West Nile fever [43].
Moreover, our study exclusively used demographic- and clinical data to train the models, which achieved a high performance. From a clinical perspectives, epidemiological and demographic variables are perceived as less influential and are usually ignored when diagnosing patients with dengue or chikungunya. In a previous study, experienced physicians only selected clinical symptoms, two pre-existing diseases (diabetes and hypertension), and days from symptoms onset as input data for training ML models [34]. Prior investigations [34,41], along with our study, indicate that epidemiological and demographic data, such as gender, age, indigenous status, and epidemiological week-of-symptoms onset, are important features for differentiating diagnosis of dengue with other arboviral diseases. For chikungunya, Vidal et al. reported gender differences in virus infection, where chikungunya symptoms are more frequent in women than men [44], these results are similar to our study’s findings. Researchers also observed that male patients confirmed to have dengue required longer recovery time compared to female patients, and patient age has a significant positive correlation with the number of clinical symptoms [45].
Our study has some advantages: First, we utilized a dataset of more than one million records as input, representing the largest dataset to date used for the multi-class classification task of dengue and chikungunya. Dataset size can affect model performance, as a larger input generally enhances the accuracy of both ML and DL algorithms [46]. Second, the input data demonstrated a high imbalance across target categories, similar to the real-world conditions while two previous studies only used the same dataset with the balanced records of each category in the target feature [34,47]. Lastly, the use of internal test sets for evaluating performance of proposed algorithms is a specific approach. To the best of our knowledge, this is the first study to incorporate internal test set to evaluate model performance for arboviral diseases.
The use of an internal test set will enhance the reliability of ML and DL algorithms when predicting data not previously encountered When deployed through computer interfaces or portable devices, these models can assist frontline healthcare workers by providing accurate and timely differentiation of arboviral diseases such as dengue and chikungunya. For instance, a young physician in remote areas of one province can use the model, enter epidemiological and clinical information of a new patient who comes from another province, and achieve a reliable diagnosis of that patient to plan medical assistance for him/her, such as hospitalization. This model could be trained with other arboviral diseases like Zika or yellow fever for active disease-surveillance and case-management in the field.
However, this study has some limitations. First, our data did not have some important features, such as the number of days from the symptom onset, one of critical criteria for deciding which laboratory test will be used. If symptom onset is less than 5 days, the NS1 antigen test kit may be preferred, whereas IgM will be used after day 5 of onset [18]. Another concerns the approach used to address the imbalanced dataset. We only applied the SMOTE technique, which made it difficult to compare the models’ performance with previous studies that either used balanced data or applied other methods like down sampling technique. In addition, the lack of interpretability analysis of top prediction features for clinical adoption is another limitation of this study, and we hope to apply SHAP analysis in future work to understand more which features contribute most to disease prediction.

5. Conclusions

In this study, we developed a multi-class classification method for predicting dengue and chikungunya diseases using ML and DL models. Our proposed models showed promising results as a decision-making support system to assist health physicians in differentiating dengue from chikungunya and inconclusive cases with high sensitivity, particularly in settings where laboratory testing is not readily accessible. By incorporating an internal test set, these models might have a potential application as a supportive tool in screening dengue and chikungunya diseases in Brazilian populations. Future work will continue to enhance the capability of these algorithms to differentiate more arboviral diseases, such as Zika, yellow fever, and West Nile fever diseases.

Author Contributions

Conceptualization: T.H.N., N.Q.K.L.; methodology: T.H.N., N.Q.K.L.; software: N.Q.K.L.; validation: N.Q.K.L.; formal analysis: T.H.N.; investigation: T.H.N.; resources: N.Q.K.L.; data curation: N.Q.K.L.; writing—original draft preparation: T.H.N., N.Q.K.L.; writing—review and editing: T.H.N., N.Q.K.L.; visualization: N.Q.K.L.; supervision: N.Q.K.L.; project administration: N.Q.K.L.; funding acquisition: N.Q.K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data used in this study are publicly available and were obtained from the open-access dataset shared by da Silva Neto et al. (2022) [37] in Scientific Data [https://www.nature.com/articles/s41597-022-01312-7] (accessed on 1 June 2025). The dataset includes anonymized epidemiological records of dengue and chikungunya cases in Brazil from 2013 to 2020. No additional restrictions apply to the use of this dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAdaptative boosting
AIArtificial intelligence
ANNArtificial neural network
AUCArea under the curve
DFDengue fever
DHFDengue hemorrhagic fever
DLDeep learning
DNNDeep neural network
DTDecision tree
ETCExtra tree classifier
FPFalse positive
FNFalse negative
GBGradient boosting machine
KNNk-nearest neighbor
LRLogistic regression
MLMachine learning
NBNaïve Bayes
NTDsNeglected tropical diseases
RFRandom forest
RFECVRecursive Feature Elimination with Cross-Validation
ROCReceiver operating characteristics
SDSevere dengue
SMOSequential minimal optimization
SMOTESynthetic minority oversampling technique
SVCSupport vector classifier
SVMSupport vector machine
TNTrue negative
TPTrue positive
WHOWorld Health Organization
XGeXtreme gradient boosting

References

  1. Soni, S.; Gill, V.J.S.; Singh, J.; Chhabra, J.; Gill, G.J.S.; Bakshi, R. Dengue, Chikungunya, and Zika: The causes and threats of emerging and re-emerging arboviral diseases. Cureus 2023, 15, e41717. [Google Scholar] [CrossRef]
  2. Colón-González, F.J.; Sewe, M.O.; Tompkins, A.M.; Sjödin, H.; Casallas, A.; Rocklöv, J.; Caminade, C.; Lowe, R. Projecting the risk of mosquito-borne diseases in a warmer and more populated world: A multi-model, multi-scenario intercomparison modelling study. Lancet Planet. Health 2021, 5, e404–e414. [Google Scholar] [CrossRef] [PubMed]
  3. Mohapatra, R.K.; Bhattacharjee, P.; Desai, D.N.; Kandi, V.; Sarangi, A.K.; Mishra, S.; Sah, R.; Ibrahim, A.A.A.; Rabaan, A.A.; Zahan, K.E. Global health concern on the rising dengue and chikungunya cases in the American regions: Countermeasures and preparedness. Health Sci. Rep. 2024, 7, e1831. [Google Scholar] [CrossRef] [PubMed]
  4. Hotez, P.J.; Aksoy, S.; Brindley, P.J.; Kamhawi, S. World neglected tropical diseases day. PLoS Neglected Trop. Dis. 2020, 14, e0007999. [Google Scholar] [CrossRef]
  5. Casulli, A. New global targets for NTDs in the WHO roadmap 2021–2030. PLoS Neglected Trop. Dis. 2021, 15, e0009373. [Google Scholar] [CrossRef] [PubMed]
  6. The Lancet. Dengue: The threat to health now and in the future. Lancet 2024, 404, 311. [Google Scholar] [CrossRef]
  7. Li, X.-C.; Zhang, Y.-Y.; Zhang, Q.-Y.; Liu, J.-S.; Ran, J.-J.; Han, L.-F.; Zhang, X.-X. Global burden of viral infectious diseases of poverty based on Global Burden of Diseases Study 2021. Infect. Dis. Poverty 2024, 13, 53–67. [Google Scholar] [CrossRef]
  8. Deng, J.; Zhang, H.; Wang, Y.; Liu, Q.; Du, M.; Yan, W.; Qin, C.; Zhang, S.; Chen, W.; Zhou, L. Global, regional, and national burden of dengue infection in children and adolescents: An analysis of the Global Burden of Disease Study 2021. eClinicalMedicine 2024, 78, 102943. [Google Scholar] [CrossRef]
  9. Wilder-Smith, A.; Ooi, E.-E.; Horstick, O.; Wills, B. Dengue. Lancet 2019, 393, 350–363. [Google Scholar] [CrossRef]
  10. Valencia, B.M.; Sigera, P.C.; Weeratunga, P.; Tedla, N.; Fernando, D.; Rajapakse, S.; Lloyd, A.R.; Rodrigo, C. Effect of prior Zika and dengue virus exposure on the severity of a subsequent dengue infection in adults. Sci. Rep. 2022, 12, 17225. [Google Scholar] [CrossRef]
  11. Weaver, S.C.; Charlier, C.; Vasilakis, N.; Lecuit, M. Zika, chikungunya, and other emerging vector-borne viral diseases. Annu. Rev. Med. 2018, 69, 395–408. [Google Scholar] [CrossRef]
  12. de Souza, W.M.; Ribeiro, G.S.; de Lima, S.T.; de Jesus, R.; Moreira, F.R.; Whittaker, C.; Sallum, M.A.M.; Carrington, C.V.; Sabino, E.C.; Kitron, U. Chikungunya: A decade of burden in the Americas. Lancet Reg. Health–Am. 2024, 30, 100673. [Google Scholar] [CrossRef] [PubMed]
  13. De Roo, A.M.; Vondeling, G.T.; Boer, M.; Murray, K.; Postma, M.J. The global health and economic burden of chikungunya from 2011 to 2020: A model-driven analysis on the impact of an emerging vector-borne disease. BMJ Glob. Health 2024, 9, e016648. [Google Scholar] [CrossRef] [PubMed]
  14. Kang, H.; Auzenbergs, M.; Clapham, H.; Maure, C.; Kim, J.-H.; Salje, H.; Taylor, C.G.; Lim, A.; Clark, A.; Edmunds, W.J. Chikungunya seroprevalence, force of infection, and prevalence of chronic disability after infection in endemic and epidemic settings: A systematic review, meta-analysis, and modelling study. Lancet Infect. Dis. 2024, 24, 488–503. [Google Scholar] [CrossRef] [PubMed]
  15. Zaid, A.; Burt, F.J.; Liu, X.; Poo, Y.S.; Zandi, K.; Suhrbier, A.; Weaver, S.C.; Texeira, M.M.; Mahalingam, S. Arthritogenic alphaviruses: Epidemiological and clinical perspective on emerging arboviruses. Lancet Infect. Dis. 2021, 21, e123–e133. [Google Scholar] [CrossRef]
  16. Alvarado, L.I.; Lorenzi, O.D.; Torres-Velásquez, B.C.; Sharp, T.M.; Vargas, L.; Muñoz-Jordán, J.L.; Hunsperger, E.A.; Pérez-Padilla, J.; Rivera, A.; González-Zeno, G.E. Distinguishing patients with laboratory-confirmed chikungunya from dengue and other acute febrile illnesses, Puerto Rico, 2012–2015. PLoS Neglected Trop. Dis. 2019, 13, e0007562. [Google Scholar] [CrossRef]
  17. Batista, R.P.; Hökerberg, Y.H.M.; de Oliveira, R.d.V.C.; Lambert Passos, S.R. Development and validation of a clinical rule for the diagnosis of chikungunya fever in a dengue-endemic area. PLoS ONE 2023, 18, e0279970. [Google Scholar] [CrossRef]
  18. Fischer, C.; Jo, W.K.; Haage, V.; Moreira-Soto, A.; de Oliveira Filho, E.F.; Drexler, J.F. Challenges towards serologic diagnostics of emerging arboviruses. Clin. Microbiol. Infect. 2021, 27, 1221–1229. [Google Scholar] [CrossRef]
  19. Malavige, G.N.; Sjö, P.; Singh, K.; Piedagnel, J.-M.; Mowbray, C.; Estani, S.; Lim, S.C.L.; Siquierra, A.M.; Ogg, G.S.; Fraisse, L. Facing the escalating burden of dengue: Challenges and perspectives. PLoS Glob. Public Health 2023, 3, e0002598. [Google Scholar] [CrossRef]
  20. Kasbergen, L.M.; Nieuwenhuijse, D.F.; de Bruin, E.; Sikkema, R.S.; Koopmans, M.P. The increasing complexity of arbovirus serology: An in-depth systematic review on cross-reactivity. PLoS Neglected Trop. Dis. 2023, 17, e0011651. [Google Scholar] [CrossRef]
  21. Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
  22. Adlung, L.; Cohen, Y.; Mor, U.; Elinav, E. Machine learning in clinical decision making. Med 2021, 2, 642–665. [Google Scholar] [CrossRef] [PubMed]
  23. Sarker, I.H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef] [PubMed]
  24. Vishnumon, S.; Jose, S.A.; Jirawattanapanit, A. Integrating machine learning techniques for enhanced prognostic modeling of heart failure risk in the american population. Phys. Scr. 2025, 100, 076011. [Google Scholar] [CrossRef]
  25. Attai, K.; Amannejad, Y.; Vahdat Pour, M.; Obot, O.; Uzoka, F.-M. A systematic review of applications of machine learning and other soft computing techniques for the diagnosis of tropical diseases. Trop. Med. Infect. Dis. 2022, 7, 398. [Google Scholar] [CrossRef]
  26. Kaur, I.; Sandhu, A.K.; Kumar, Y. Artificial intelligence techniques for predictive modeling of vector-borne diseases and its pathogens: A systematic review. Arch. Comput. Methods Eng. 2022, 29, 3741–3771. [Google Scholar] [CrossRef]
  27. Hoyos, W.; Aguilar, J.; Toro, M. Dengue models based on machine learning techniques: A systematic literature review. Artif. Intell. Med. 2021, 119, 102157. [Google Scholar] [CrossRef]
  28. Das, S.; Roy, S.; Bir, A.; Ghosh, A.; Bhattacharyya, T.K.; Lahiri, P.; Lahiri, B. FTIR-based molecular fingerprinting for the rapid classification of dengue and chikungunya from human sera using machine learning: An observational study. Lancet Reg. Health-Southeast Asia 2025, 40, 100630. [Google Scholar] [CrossRef]
  29. Ho, T.-S.; Weng, T.-C.; Wang, J.-D.; Han, H.-C.; Cheng, H.-C.; Yang, C.-C.; Yu, C.-H.; Liu, Y.-J.; Hu, C.H.; Huang, C.-Y. Comparing machine learning with case-control models to identify confirmed dengue cases. PLoS Neglected Trop. Dis. 2020, 14, e0008843. [Google Scholar] [CrossRef]
  30. Abdualgalil, B.; Abraham, S.; Ismael, W.M. Early diagnosis for dengue disease prediction using efficient machine learning techniques based on clinical data. J. Robot. Control 2022, 3, 257–268. [Google Scholar] [CrossRef]
  31. Phakhounthong, K.; Chaovalit, P.; Jittamala, P.; Blacksell, S.D.; Carter, M.J.; Turner, P.; Chheng, K.; Sona, S.; Kumar, V.; Day, N.P. Predicting the severity of dengue fever in children on admission based on clinical features and laboratory indicators: Application of classification tree analysis. BMC Pediatr. 2018, 18, 109. [Google Scholar] [CrossRef]
  32. Huang, S.-W.; Tsai, H.-P.; Hung, S.-J.; Ko, W.-C.; Wang, J.-R. Assessing the risk of dengue severity using demographic information and laboratory test results with machine learning. PLoS Neglected Trop. Dis. 2020, 14, e0008960. [Google Scholar] [CrossRef] [PubMed]
  33. Lee, V.J.; Chow, A.; Zheng, X.; Carrasco, L.R.; Cook, A.R.; Lye, D.C.; Ng, L.-C.; Leo, Y.-S. Simple clinical and laboratory predictors of Chikungunya versus dengue infections in adults. PLoS Neglected Trop. Dis. 2012, 6, e1786. [Google Scholar] [CrossRef] [PubMed]
  34. Tabosa de Oliveira, T.; da Silva Neto, S.R.; Teixeira, I.V.; Aguiar de Oliveira, S.B.; de Almeida Rodrigues, M.G.; Sampaio, V.S.; Endo, P.T. A comparative study of machine learning techniques for multi-class classification of arboviral diseases. Front. Trop. Dis. 2022, 2, 769968. [Google Scholar] [CrossRef]
  35. da Silva Neto, S.R.; Tabosa Oliveira, T.; Teixeira, I.V.; Aguiar de Oliveira, S.B.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Machine learning and deep learning techniques to support clinical diagnosis of arboviral diseases: A systematic review. PLoS Neglected Trop. Dis. 2022, 16, e0010061. [Google Scholar] [CrossRef]
  36. Shenoy, S.; Rajan, A.K.; Rashid, M.; Chandran, V.P.; Poojari, P.G.; Kunhikatta, V.; Acharya, D.; Nair, S.; Varma, M.; Thunga, G. Artificial intelligence in differentiating tropical infections: A step ahead. PLoS Neglected Trop. Dis. 2022, 16, e0010455. [Google Scholar] [CrossRef]
  37. da Silva Neto, S.R.; Tabosa de Oliveira, T.; Teixiera, I.V.; Medeiros Neto, L.; Souza Sampaio, V.; Lynn, T.; Endo, P.T. Arboviral disease record data-dengue and chikungunya, brazil, 2013–2020. Sci. Data 2022, 9, 198. [Google Scholar] [CrossRef]
  38. Liu, J. Importance-SMOTE: A synthetic minority oversampling method for noisy imbalanced data. Soft Comput. 2022, 26, 1141–1163. [Google Scholar] [CrossRef]
  39. Dargan, S.; Kumar, M.; Ayyagari, M.R.; Kumar, G. A survey of deep learning and its applications: A new paradigm to machine learning. Arch. Comput. Methods Eng. 2020, 27, 1071–1092. [Google Scholar] [CrossRef]
  40. Thölke, P.; Mantilla-Ramos, Y.-J.; Abdelhedi, H.; Maschke, C.; Dehgan, A.; Harel, Y.; Kemtur, A.; Berrada, L.M.; Sahraoui, M.; Young, T. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. NeuroImage 2023, 277, 120253. [Google Scholar] [CrossRef]
  41. Bohm, B.C.; Borges, F.E.d.M.; Silva, S.C.M.; Soares, A.T.; Ferreira, D.D.; Belo, V.S.; Lignon, J.S.; Bruhn, F.R.P. Utilization of machine learning for dengue case screening. BMC Public Health 2024, 24, 1573. [Google Scholar] [CrossRef]
  42. Tsheten, T.; Clements, A.C.; Gray, D.J.; Adhikary, R.K.; Furuya-Kanamori, L.; Wangdi, K. Clinical predictors of severe dengue: A systematic review and meta-analysis. Infect. Dis. Poverty 2021, 10, 123. [Google Scholar] [CrossRef] [PubMed]
  43. Lu, H.-Z.; Xie, Y.-Z.; Gao, C.; Wang, Y.; Liu, T.-T.; Wu, X.-Z.; Dai, F.; Wang, D.-Q.; Deng, S.-Q. Diabetes mellitus as a risk factor for severe dengue fever and West Nile fever: A meta-analysis. PLoS Neglected Trop. Dis. 2024, 18, e0012217. [Google Scholar] [CrossRef] [PubMed]
  44. Vidal, O.M.; Acosta-Reyes, J.; Padilla, J.; Navarro-Lechuga, E.; Bravo, E.; Viasus, D.; Arcos-Burgos, M.; Vélez, J.I. Chikungunya outbreak (2015) in the Colombian Caribbean: Latent classes and gender differences in virus infection. PLoS Neglected Trop. Dis. 2020, 14, e0008281. [Google Scholar] [CrossRef] [PubMed]
  45. Prattay, K.M.R.; Sarkar, M.R.; Shafiullah, A.Z.M.; Islam, M.S.; Raihan, S.Z.; Sharmin, N. A retrospective study on the socio-demographic factors and clinical parameters of dengue disease and their effects on the clinical course and recovery of the patients in a tertiary care hospital of Bangladesh. PLoS Neglected Trop. Dis. 2022, 16, e0010297. [Google Scholar] [CrossRef]
  46. Rácz, A.; Bajusz, D.; Héberger, K. Effect of dataset size and train/test split ratios in QSAR/QSPR multiclass classification. Molecules 2021, 26, 1111. [Google Scholar] [CrossRef]
  47. Arrubla-Hoyos, W.; Gómez, J.G.; De-La-Hoz-Franco, E. Methodology for the Differential Classification of Dengue and Chikungunya According to the PAHO 2022 Diagnostic Guide. Viruses 2024, 16, 1088. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.