Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis

Pennisi, Flavia; Pinto, Antonio; Borgonovo, Fabio; Scaglione, Giovanni; Ligresti, Riccardo; Santangelo, Omar Enzo; Provenzano, Sandro; Gori, Andrea; Baldo, Vincenzo; Signorelli, Carlo; Gianfredi, Vincenza

doi:10.3390/make8010015

Open AccessSystematic Review

Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis

by

Flavia Pennisi

^1,2

,

Antonio Pinto

^1,*

,

Fabio Borgonovo

^3,4

,

Giovanni Scaglione

⁴

,

Riccardo Ligresti

^2,4,

Omar Enzo Santangelo

^5,*

,

Sandro Provenzano

⁶

,

Andrea Gori

^4,7,8

,

Vincenzo Baldo

⁹

,

Carlo Signorelli

¹

and

Vincenza Gianfredi

⁹

¹

Faculty of Medicine, University Vita-Salute San Raffaele, 20132 Milan, Italy

²

PhD National Program in One Health Approaches to Infectious Diseases and Life Science Research, Department of Public Health, Experimental and Forensic Medicine, University of Pavia, 27100 Pavia, Italy

³

Division of Public Health, Infectious Diseases and Occupational Medicine, Department of Medicine, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN 55905, USA

⁴

Department of Infectious Diseases, “Luigi Sacco” University Hospital, Azienda Socio-Sanitaria Territoriale (ASST) Fatebenefratelli FBF Sacco, 20157 Milan, Italy

⁵

Regional Health Care and Social Agency of Lodi, Azienda Socio-Sanitaria Territoriale (ASST) Lodi, 26900 Lodi, Italy

⁶

Local Health Unit of Trapani, ASP Trapani, 91100 Trapani, Italy

⁷

Department of Biomedical and Clinical Sciences “L. Sacco”, University of Milan, 20157 Milan, Italy

⁸

Centre for Multidisciplinary Research in Health Science (MACH), University of Milan, 20122 Milan, Italy

⁹

Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padua, 35128 Padova, Italy

^*

Authors to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(1), 15; https://doi.org/10.3390/make8010015

Submission received: 29 November 2025 / Revised: 22 December 2025 / Accepted: 24 December 2025 / Published: 7 January 2026

(This article belongs to the Section Thematic Reviews)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background: Mosquito-borne viral diseases are a growing global health threat, and artificial intelligence (AI) and machine learning (ML) are increasingly proposed as forecasting tools to support early-warning and response. However, the available evidence is fragmented across pathogens, settings and modelling approaches. This review provides, to the best of our knowledge, the first comprehensive comparative assessment of AI/ML models forecasting mosquito-borne viral diseases in human populations, jointly synthesising predictive performance across model families and appraising both methodological quality and operational readiness. Methods: Following PRISMA 2020, we searched PubMed, Embase and Scopus up to August 2025. We included studies applying AI/ML or statistical models to predict arboviral incidence, outbreaks or temporal trends and reporting at least one quantitative performance metric. Given the substantial heterogeneity in outcomes, predictors and time–space scales, we conducted a descriptive synthesis. Risk of bias and applicability were evaluated using PROBAST. Results: Ninety-eight studies met the inclusion criteria, of which 91 focused on dengue. The forecasts spanned national to city-level settings and annual-to-weekly resolutions. Across classification tasks, tree-ensemble models showed the most consistent performance, with accuracies typically above 0.85, while classical ML and deep-learning models showed wider variability. For regression tasks, errors increased with temporal horizon and spatial aggregation: short-term, fine-scale forecasts (e.g., weekly city level) often achieved low absolute errors, whereas long-horizon national models frequently exhibited very large errors and unstable performance. PROBAST assessment indicated that most studies (63/98) were at high risk of bias, with only 24 judged at low risk and limited external validation. Conclusions: AI/ML models, especially tree-ensemble approaches, show strong potential for short-term, fine-scale forecasting, but their reliability drops substantially at broader spatial and temporal scales. Most remain research-stage, with limited external validation and minimal operational deployment. This review clarifies current capabilities and highlights three priorities for real-world use: standardised reporting, rigorous external validation, and context-specific calibration.

Keywords:

artificial intelligence; machine learning; mosquito-borne viral diseases; dengue; arbovirus forecasting; time-series prediction; deep learning; early-warning systems; epidemiological modelling; predictive performance

Graphical Abstract

1. Introduction

In the last decades, arboviral diseases such as dengue, Zika, chikungunya and yellow fever have posed an increasingly difficult challenge to global public health officials [1]; urbanization, climate change and globalization have facilitated the expansion and the survival of competent mosquito vectors such as Aedes aegypti and Aedes albopictus in new ecological niches [2,3,4]. While these infections have always been responsible for significant mortality, morbidity and economic burden in tropical and subtropical regions [5,6], an increasing number of imported and autochthonous outbreaks has been recently described at higher latitudes [7]. Early detection of outbreaks and accurate prediction of transmission are essential for effective containment and mitigation, supporting vector surveillance, clinical practice, and efficient resource allocation [8]. Conventional surveillance and early-warning methods, while effective, are usually limited by their dependence on delayed case reporting or limited climatic proxies [9,10,11]. In this context, artificial intelligence (AI) and machine learning (ML) techniques have emerged as powerful tools for infectious disease modelling [12,13]; the possibility of integrating high-dimensional and heterogenous data, such as meteorological, environmental and demographic, allows for a deeper and more accurate understanding of arboviral disease transmission [14,15,16]. Despite these promising results, the literature remains largely devoid of systematic, comparative evaluations that can clarify when and where AI-driven models for arboviral forecasting are truly useful. Existing reviews typically focus on single pathogens, narrow modelling families, or descriptive overviews, leaving unresolved questions about real-world applicability. The marked heterogeneity in data sources, algorithmic approaches and temporal and spatial forecasting scales further hampers meaningful comparison across studies and limits the ability to identify which modelling strategies consistently deliver reliable predictions [17]. In addition, direct evaluations of AI/ML models against traditional statistical methods are uncommon, and critical methodological aspects, such as validation strategies, risk of bias, and generalisability across spatial and temporal scales, are often insufficiently addressed. To address these limitations, this systematic review aims to comparatively evaluate the performance of AI and ML models developed to forecast mosquito-borne viral infections in human populations. Specifically, we synthesise predictive performance across different modelling families and epidemiological contexts, compare AI/ML approaches with traditional statistical models when available, and assess methodological quality and implementation readiness. By doing so, this review seeks to clarify the current capabilities and limitations of AI-driven forecasting models and to inform their appropriate use in public health surveillance and early-warning systems.

2. Materials and Methods

2.1. Search Strategy

This systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA 2020) guidelines. The protocol was prospectively registered in PROSPERO (CRD420251124153). The review addressed the following question: “What is the performance of AI and ML models in predicting the temporal trend and incidence peaks of mosquito-borne diseases, such as dengue, Zika, chikungunya, West Nile virus, yellow fever, and Rift Valley fever, compared with traditional statistical approaches?”. A comprehensive literature search was conducted in PubMed, Embase, and Scopus up to 11 August 2025. The search combined controlled vocabulary (MeSH terms) and free-text terms related to artificial intelligence, machine learning, deep learning, and mosquito-borne diseases (including dengue, Zika, chikungunya, West Nile, yellow fever, and Rift Valley fever). Reference lists of included studies and relevant reviews were screened manually, and field experts were contacted to identify additional publications or unpublished data. The full search strings, for each database, are available in Supplementary Table S1.

2.2. Eligibility Criteria

Studies were included if they (i) involved human populations at risk of, or with confirmed cases of, mosquito-borne viral diseases; (ii) developed, validated, or applied AI/ML-based models to predict disease incidence trends or peaks; and (iii) reported at least one quantitative performance metric. Eligible outcomes comprised classification metrics such as Area Under the Curve (AUC), sensitivity, specificity, positive predictive value (PPV)/precision, negative predictive value (NPV), accuracy, F1-score, as well as regression metrics, including mean absolute error (MAE), root mean squared error (RMSE), mean squared error (MSE), mean absolute percentage error (MAPE), symmetric mean absolute percentage error (SMAPE), coefficient of determination (R²), and correlation coefficient (r). Comparator models, when reported, consisted of alternative AI/ML techniques or classical statistical models applied to the same dataset. Eligible study designs included retrospective and prospective cohort studies, case–control studies, cross-sectional studies, and diagnostic accuracy studies. In addition, modelling studies based on secondary data sources—such as surveillance databases, electronic health records, or environmental and climatic datasets—were included, provided that they reported at least one quantitative performance metric related to the predictive or diagnostic task. Lastly, only original research articles published in peer-reviewed journals were retained. Only articles written in English were considered, without date restrictions. On the contrary, studies focusing exclusively on non-human or vector-only data, laboratory research without human health outcomes, or environmental data without direct linkage to human case prediction or diagnosis were excluded. Studies available only as abstracts, conference proceedings, book chapters, letters to the editor, commentaries, reviews, or non–peer-reviewed reports were excluded. Likewise, full texts that could not be retrieved, as well as duplicated datasets published in multiple papers, were omitted to avoid redundancy.

2.3. Study Selection

All records retrieved were imported into a reference management software, and duplicates were removed. Two reviewers (FP and AP) independently screened titles and abstracts to identify potentially eligible studies. Any discrepancies between the two reviewers were resolved by discussion, with arbitration by a third reviewer when necessary (FB). The full texts of the selected articles were then assessed independently for inclusion against the predefined criteria.

2.4. Data Extraction

Data were extracted using a pre-tested form developed in Microsoft Excel. The extracted information included bibliographic details (first author, year of publication, continent, and country), study characteristics (design, period, setting, and population), disease features (type, case definition, prediction horizon, and disease definition), and methodological aspects such as handling of missing or imbalanced data, calibration procedures, data splitting strategy, and implementation readiness.

Information on the data sources used for model development was recorded and categorized as follows: epidemiological (e.g., surveillance data, incidence, outbreak reports); clinical (e.g., laboratory-confirmed cases, hospital records); climatic or environmental (e.g., temperature, rainfall, vegetation index); socio-demographic (e.g., population density, age distribution, poverty, urbanization); mobility or transport (e.g., travel and migration flows, mobile phone data); big data (e.g., social media, web search queries); genomic or virological (e.g., viral sequences, genotypes, serotypes); health system (e.g., access to care, diagnostic capacity); policy or intervention (e.g., vaccination coverage, vector control campaigns); landscape (e.g., agricultural or forested land); and entomological (e.g., mosquito density, breeding sites, vector indices).

For each AI/ML model, the principal algorithm, the number of variables included versus considered, dataset structure and split, and performance metrics were extracted. When available, classification and regression metrics were recorded.

If a comparator model was present, the same performance metrics and details on dataset structure and validation type (internal or external) were also collected. When necessary, authors were contacted to obtain missing or unclear information. Data extraction was conducted by one reviewer and verified by another to ensure accuracy.

To ensure comparability across studies, all predictive models—whether primary models or comparators—were classified into seven predefined categories based on their underlying methodological structure:

(i): Classical machine learning (NARX neural networks, decision trees, AutoTiC-NN, feed-forward neural networks, ANN, backpropagation NN, SVR, SVM, LASSO, Naive Bayesian Network, Bayesian Network, logistic regression, multiple linear regression, generalized linear models, Gaussian processes, regression models);
(ii): Tree-ensemble methods (Random Forest, Extra Trees Classifier, Gradient Boosting, AdaBoost, GBM, BRT, XGBoost, LightGBM, CatBoost, CART);
(iii): Deep learning (BiLSTM, CNN, LSTM, GRU, RNN, DFFN, MobileNetV3, ResNet50, CNN-BiLSTM, CNN-BiGRU with Attention, ConvLSTM, stacked LSTM/BiLSTM, hybrid CNN–LSTM architectures, XEWNet, EWNet, transformer-based models, NBeatsX);
(iv): Time-series and statistical models (NNAR, SARIMA, ARIMA, VAR, naïve or moving-average baselines, temporal-average baselines, Poisson regression, SARIMAX, Prophet);
(v): Mechanistic models (WRF, SIR + EAKF, SI–SIR);
(vi): Other or heuristic approaches (e.g., GANN, ANFIS, Differential Evolution, fuzzy systems, DIR);
(vii): Hybrid or superensemble models, defined as models integrating two or more techniques from different categories.

The models combining multiple techniques within the same category were classified according to that category and not considered hybrid. This grouping was used consistently across all metric-specific visual summaries.

Because regression-based performance metrics such as MAE, RMSE, and MSE are scale-dependent and do not represent percentage errors, an additional extraction step was performed to ensure comparability across studies. For each model reporting MAE, RMSE, or MSE, the scale of the target variable was extracted along three dimensions:

(i): Unit of measurement (e.g., absolute case counts, cases per 100,000 population, log-transformed cases);
(ii): Temporal resolution (e.g., weekly, 10-day, monthly);
(iii): Spatial resolution, classified as: national (entire country), regional (multiple provinces or states), provincial (single administrative region), district level (municipalities or sub-city areas), or city level (single city).

This information was taken directly from each study and not derived or transformed and was used to facilitate valid cross-study comparisons of scale-dependent regression metrics.

2.5. Data Synthesis and Statistical Analysis

The results are reported in accordance with PRISMA 2020, including a PRISMA flow diagram and detailed tables summarizing study characteristics, methodological quality, and performance metrics. Specifically, a narrative synthesis was first conducted to summarize study characteristics, AI/ML model types, and outcomes. The result ranges reported in the various subparagraphs of the AI categories, refer to both the principal and comparative models.

Given the substantial heterogeneity in study design, temporal and spatial resolution, incidence scale, and reporting practices, no meta-analytic pooling was performed. Instead, all analyses were descriptive and exploratory. To characterise model performance across studies, unweighted distributions were generated for each reported metric of the principal models, without applying transformations, normalisation procedures, or weighting by sample size or study quality. Outliers were retained to preserve the original variability of the source data. All metrics were summarised using multi-panel visualisations to describe performance dispersion within and across modelling families. For classification metrics (AUC, sensitivity, specificity, PPV, NPV, accuracy, and F1-score), study-level estimates were displayed using horizontal boxplots stratified by modelling family. When ≥3 observations were available for a given family, full boxplots were produced. For numerical regression metrics with strong dependence on outcome scale (RMSE and MAE), multi-panel figures were constructed by stratifying results into predefined magnitude ranges (very small, small, medium, large). This approach prevented misleading cross-scale comparisons and highlighted context-dependent error behaviour across modelling families. Additional regression metrics (MAPE, MSE, R², and r) were summarised using single-panel unweighted distributions. Visual synthesis focused on describing central tendency, variability, dispersion patterns, and systematic differences across modelling groups. No cross-family statistical comparisons or inferential tests were conducted, as heterogeneity in study frameworks, reporting conventions, and outcome units precluded harmonisation.

2.6. Risk of Bias Assessment

The methodological quality and risk of bias of the included studies were evaluated using the Prediction model Risk of Bias Assessment Tool (PROBAST). Each study was assessed across four domains: participants, predictors, outcome, and analysis. Assessments were performed independently by two reviewers, with disagreements resolved through consensus.

3. Results

3.1. Literature Search

A total of 4531 records were identified through database searches in PubMed/MEDLINE (n = 801), Scopus (n = 1768), and Embase (n = 1962). After removing duplicates (n = 2555), 1976 unique records were screened by title and abstract. Of these, 1790 were excluded due to being non-original or focusing on unrelated topics, leaving 186 records eligible for full-text review. Full-texts were unavailable for 22 articles. After full-text assessment, 66 records were excluded, including 98 records for specific reasons. The overall study selection process is illustrated in Figure 1.

3.2. Geographical Distribution

Most of the included studies originated from Asia and South America, reflecting the higher burden of mosquito-borne diseases in these regions. At the continental level (Figure 2), the majority of studies were conducted in Asia (n = 62) [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79], followed by South America (n = 23) [80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101] and North America (n = 6) [102,103,104,105,106,107]. Fewer studies were reported from Europe (n = 2) [91,108], Africa (n = 1) [109], and Oceania (n = 1) [110]. Three [111,112,113,114] studies covered multiple continents and were therefore not represented on the map.

At the country level (Figure 3), the highest number of studies was identified in Brazil (n = 15) [80,82,84,85,86,88,89,90,91,93,94,96,98,99,115], Malaysia (n = 12) [36,38,51,52,53,54,57,58,62,67,76,78], and Bangladesh (n = 8) [18,27,37,47,55,56,65,71]. Several other countries contributed a smaller number of studies, mainly located in the Americas and Southeast Asia. Countries highlighted but without numerical labels correspond to those represented by a single study. Detailed information on the geographical distribution of studies is provided in Supplementary Table S2.

3.3. Temporal Distribution and Evolution of AI Model Types

The 98 studies that met the inclusion criteria had the following yearly distribution: 3 [43,83,110] in 2015, 4 [36,49,70,107] in 2016, 2 [34,100] in 2017, 6 [24,25,45,61,80,106] in 2018, 3 [21,60,102] in 2019, 12 [31,41,50,64,82,93,98,101,103,108,111,116] in 2020, 11 [19,29,32,35,44,59,67,75,76,88,89] in 2021, 11 [38,47,57,72,86,90,91,96,99,115,117] in 2022, 12 [39,52,53,62,69,78,79,81,94,95,105,114] in 2023, 18 [18,20,22,23,33,37,40,46,48,55,58,66,73,74,84,97,104,109] in 2024, and 16 [26,27,30,51,54,56,63,65,68,71,74,77,85,87,92,113] in 2025.

Considering both principal and comparator models, the use of AI methods showed a progressive diversification over time (Figure 4). Classical ML algorithms represented the predominant approach in the initial years and remained consistently applied throughout the decade, although with a reduction from 2022 onward. Tree-ensemble and time-series statistical models did not show a clear increasing trend but were intermittently used during the study period, particularly from 2018 onward. A similar pattern was observed for deep learning models, which were absent before 2018 and showed a marked increase from 2020, becoming one of the most frequently applied categories in recent years. Mechanistic and heuristic models were rarely employed. Models classified as hybrid/superensemble included studies that combined multiple AI algorithms or where the model type could not be unambiguously categorized.

3.4. Characteristics of the Features of the Included Studies

As shown in Figure 5, across the 98 included studies, epidemiological surveillance data were the most frequently used predictors (n = 90) [18,19,20,21,22,23,24,25,26,27,29,30,31,32,33,34,35,36,37,38,39,40,43,44,45,47,48,50,51,52,53,54,55,56,58,59,62,63,64,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,100,101,102,104,105,106,107,108,110,111,113,114,115,116], followed by climatic and environmental variables (n = 87) [18,19,20,21,22,23,24,25,26,27,29,30,31,32,33,34,35,36,37,38,39,40,43,44,45,47,48,50,51,52,53,54,55,56,57,60,62,64,66,69,70,71,72,73,74,75,76,77,78,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,97,98,99,100,101,102,103,104,105,106,108,109,110,111,113,114,115,116]. Socio-demographic factors were incorporated in one third of the models (n = 33) [23,25,31,38,39,43,47,50,52,53,54,65,66,70,71,78,81,89,91,92,93,97,98,100,101,102,105,106,108,110,111,113], whereas landscape variables (n = 18) [25,33,46,52,53,65,66,69,70,71,78,89,91,100,101,105,110,113] and entomological indicators (n = 14) [25,38,43,46,48,51,62,70,74,78,96,102,113,116] were less commonly included. Mobility and transport data (n = 13) [25,44,61,82,84,85,91,100,101,102,108,112,118], big data sources such as internet or social media queries (n = 7) [19,49,64,77,78,90,112], clinical information (n = 5) [46,54,67,68,116], and health system indicators (n = 5) [47,68,102,112,113] were rarely used. Genomic/virological [112,113] and policy or intervention-related [33,112] variables were only sporadically considered (both n = 2).

3.5. Included Studies Characteristics

Table 1 summarizes the main characteristics of the 98 included studies. Most contributions were forecasting or modelling studies using routine surveillance data, while a smaller subset adopted ecological or spatiotemporal designs, and only a few were based on hospital or clinical datasets. The majority of models were developed at national or sub-national level in the general population, whereas studies focusing specifically on travellers or hospitalized patients were rare. Almost all studies targeted dengue (n = 90) [18,19,20,21,22,23,24,25,26,27,29,30,31,32,33,34,35,36,37,38,39,40,41,43,44,45,46,47,48,49,51,52,53,54,55,56,57,58,59,60,62,63,64,65,66,67,68,69,70,71,72,73,74,76,77,78,79,80,81,82,83,84,85,87,88,90,91,92,93,94,95,96,97,99,100,101,103,104,106,107,108,110,111,112,113,114,115,116], with a limited number addressing other mosquito-borne infections such as Zika (n = 2) [98,102], West Nile virus (n = 2) [91,105], Yellow Fever (n = 1) [89] or Rift Valley fever (n = 1) [109], while 2 [75,86] studies modelling multiple arboviral diseases simultaneously. Outcomes were most commonly defined as weekly or monthly incidence or case counts, generally based on suspected or laboratory-confirmed cases recorded in routine surveillance systems. Regarding the prediction task, most models focused on short-term forecasts at weekly time scales, with fewer studies addressing medium-term (up to several months) or long-term horizons of one year or more. Handling of missing values and data imbalance was heterogeneously reported: many studies did not explicitly describe any procedure, while others applied simple [18,26,35,38,39,47,56,60,67,73,74,80,83,87,109,113] imputation or case exclusion strategies, and only a minority used more advanced methods such as resampling or specialized imputation algorithms. Internal validation was typically performed using temporal train–test splits or k-fold cross-validation. In terms of implementation readiness, most studies were classified as “research only” or “proof-of-concept”, with only a small number explicitly designed as decision-support tools or described as being used operationally within routine public-health surveillance.

Data sources varied widely across studies, ranging from national surveillance systems and meteorological stations to remote sensing, mobility, and socioeconomic databases, with some studies integrating multiple heterogeneous datasets. A detailed list of data sources used in each study is provided in Supplementary Table S2.

3.6. Model Performance by AI Category

Overall, the reported performance varied substantially across model categories, reflecting differences in outcome type, data availability, and prediction tasks. Classification metrics (Table 2 and Supplementary Table S3), including AUC, sensitivity, specificity, PPV, or precision, NPV, accuracy, and F1-score, were mainly used in case-based or alert-level models. Regression metrics (Table 3 and Supplementary Table S4), instead, reported error and goodness-of-fit measures such as RMSE, MAE, MAPE, MSE, Coefficient of Determination (R²), Pearson’s correlation coefficient (r), and SMAPE.

Model Validation Approaches

Among the 98 included studies, the vast majority implemented internal validation strategies (n = 89) [18,19,20,21,22,25,26,27,30,31,32,33,34,35,37,38,39,40,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,62,64,65,66,67,68,69,70,71,72,73,74,75,76,77,79,80,81,82,83,84,86,88,89,90,91,92,93,94,95,96,97,99,100,101,102,104,105,106,108,109,110,111,113,114,115,116,119], while only 5 [41,61,98,103,107] studies reported an external validation procedure, typically using data from distinct time periods, regions, or populations, while only 1 [23] study integrating both internal and external validation steps (Table 2). The most frequent internal validation methods were simple hold-out or random train/test splits (n = 33 [20,22,27,30,31,37,39,40,44,45,47,51,54,55,58,62,63,66,68,69,71,73,75,77,85,87,88,93,102,105,108]), with training proportions ranging between 50% and 80%, and k-fold cross-validation (n = 19 [18,20,21,23,27,38,39,43,46,47,48,49,66,71,76,108,109,113,116]), most commonly 5- or 10-fold. Temporal or time-series validation approaches were applied in 22 [25,32,33,34,37,51,56,66,68,71,72,74,75,82,90,91,93,94,97,105,111,115] studies, mainly in incidence forecasting models, often using rolling or expanding-window designs. A few investigations (n = 4 [23,66,71,109]) employed nested resampling or combined spatiotemporal validation schemes, whereas 3 [36,78,112] studies did not clearly report the validation type. Overall, external validation remained uncommon, and independent test sets were often limited in size, potentially leading to optimistic performance estimates.

3.7. Classical Machine Learning

3.7.1. Classification Metrics

Among the classical ML models, AUC values ranged from 0.70 [102] (NARX NN) to 0.96 [95] (SVM). In the time-series applications (NARX NN), performance declined with longer forecast horizons, from 0.91–0.95 at 1 week to 0.70–0.74 at 12 weeks [102]. Sensitivity varied between 0.87 [49] (CART) and 1.00 [43] (SVM-L), while specificity ranged from 0.01 [57] (ANN) to 0.95 [83] (DT). Accuracy values spanned 0.47 [106]–1.00 [99], with the best performance in Diffusion Maps + SVM (RBF), and the lowest in the ANN models. PPV was reported only in 2 studies [82,116] (0.92 in both), and NPV was not provided in any study. Reported F1-scores were generally high (0.73 [106]–0.97 [95]), confirming balanced predictive ability across most classical ML algorithms.

3.7.2. Regression Metrics

In classical ML studies, MAE values ranged from 4.57 [47] (MLR, monthly, city level) to 4759.06 [18] (DT + Sequential Squeeze FS, monthly, national). Notably, 2 studies conducted on the same scale (cases-monthly-city level) reported markedly different MAE values using MLR: 4.57 [47] in one case and 200.68 [41] in another, highlighting substantial variability even within identical spatiotemporal settings. The reported RMSE values spanned from 0.04 [104] for an ANN model predicting severe dengue (weekly, regional) to 9296.35 [18] for DT + Sequential Squeeze FS (monthly, national). It is relevant that, within the same monthly national scale and using comparable DT-based approaches, the RMSE of 5.43 ± 0.43 [52] reported in another study was substantially lower due to model overfitting on a very small dataset. One study [36] reported MSE within the classical ML category (NN Regression, weekly, district level), with values ranging from 0.06–0.08 in Hulu Selangor (Malaysia) to 98.55 in Hulu Langat. Reported MAPE ranged from 0.94 [18] (DT + Sequential Squeeze FS, monthly, national) to 17–24 [70] for LASSO (weekly forecasts). SMAPE was not reported by any study. For R², values ranged from 0.18 [60] (MLR) to 0.99 [34] (SVR across provincial settings). Finally, r ranged from 0.50 [41] (MLR, monthly, city level) to 0.91 [90] for LASSO at a 1-week horizon, with progressively lower correlations at longer horizons (0.76 at 3 weeks, 0.61 at 6 weeks, and 0.56 at 8 weeks).

3.8. Tree-Ensemble Models

3.8.1. Classification Metrics

Among tree-ensemble models, AUC values were consistently high, ranging from 0.84 [71] (LightGBM) to 0.99 [40] (AdaBoost and XGB). Reported sensitivity ranged from 0.64 [40] (RF, Bangkok) to 0.99 [109] (XGB). Some XGB [91] configurations displayed marked 0.98 [71] (LightGBM). For PPV, values ranged from 0.88 [40] (RF) to 0.99 (GB in Bangladesh and XGB [109]), with consistently strong precision across ensemble approaches. NPV was reported in only one [71] study; it differed across years (0.86 in 2018 vs. 0.69 in 2019), reflecting changes in feature windows or data availability. Specificity values varied between 0.73 [48] (RF) and 0.98 (LightGBM) and 0.96 (XGBoost). Accuracy values spanned 0.79 [40] (RF)–1.00 [109], with the perfect score achieved by XGB in one study, while lower values reflected imbalanced datasets. Finally, F1-scores ranged from 0.72 [20] (Extra Trees) to 0.98 [40] (GB), with most ensemble models showing F1 values ≥ 0.90, confirming a strong balance between sensitivity and precision across settings.

3.8.2. Regression Metrics

In tree-ensemble studies, MAE values ranged from 0.15 [24] (RF, weekly, city level, per 1000 population) to 97.9 [87] in categorical dengue settings (RF, weekly, city level). Several studies operating on the same monthly national scale with RF, XGB, and LightGBM reported MAEs between 0.24 and 0.87 [71], showing relatively tight clustering across different feature groups. Reported RMSE values extended from 0.21 [24] (RF, weekly, city level, per 1000 population) up to 23.36 [71] (RF, weekly, city level, 8-week horizon). Tree-ensemble models applied at monthly national scale [71] (RF, XGB, LightGBM with SHAP) consistently reported RMSEs below 1.0, with LightGBM yielding the lowest values (0.32–0.57). In contrast, weekly city-level [90] forecasting displayed clear horizon-dependent increases (11.03 at 1 week vs. 23.36 at 8 weeks). For MSE, values ranged from 1.20 [20] (ETC, weekly, district level) to over 5000 [23] in 10-day regional incidence forecasts (ranger and ensemble configurations). Reported MAPE values ranged from 0.05–0.17 [71] for LightGBM, RF, and XGB at monthly national scale (training and test), up to 8.32 [38] in RF models when entomological covariates were removed, indicating substantial sensitivity of tree-based predictors to domain-specific input features. SMAPE was not reported by any tree-ensemble study. For R², estimates ranged from 0.09 [71] (LightGBM) to 0.85 [90] at short horizons in weekly city-level RF models, with values declining systematically as forecast length increased (0.62 at 3 weeks, 0.40 at 6 weeks, 0.34 at 8 weeks). Finally, r values ranged from 0.39 [98] (RF, monthly, national) to 0.95 [87] across several weekly city-level districts (e.g., Natal and Barranquilla).

3.9. Deep Learning Models

3.9.1. Classification Metrics

Among deep learning approaches, AUC was reported in a single study [55], with the principal model (MobileNetV3Small) and its comparators (ResNet50 and MobileNetV3Large) all achieving 0.98 ± 0.01.

Reported sensitivity showed substantial variability across architectures and locations, ranging from values near 0.00 [85] in city-specific LSTM forecasts (e.g., Belém, Fortaleza) to 0.97 ± 0.03 [55] in MobileNetV3Small. Specificity was generally higher, spanning 0.33–1.00 [85] across LSTM thresholds and cities, and reaching 0.99 ± 0.01 [55] in MobileNetV3Small and ResNet50/MobileNetV3Large. For precision, values ranged from 0.88 [68] (CNN-BiGRU with attention) to 0.99 ± 0.01 [55] (MobileNetV3Small), while NPV was not provided in any study. Reported accuracy varied widely depending on the model architecture and forecast horizon, from 0.26–1.00 [21] in CNN spatiotemporal experiments to 0.98 ± 0.01 [55] in MobileNetV3Small and ResNet50/MobileNetV3Large. LSTM models exhibited broad cross-city variability (approximately 0.63–0.98 [85], depending on threshold and location), while horizon-dependent CNN-BiLSTM [79] predictions declined from 0.88 at 1 week to 0.78 at 4 weeks. Finally, F1-scores ranged from 0.00 [85] to 0.98 [55], with the lowest values again observed in city-level LSTM forecasts with extreme class imbalance and the highest in MobileNetV3Small and ResNet50/MobileNetV3Large.

3.9.2. Regression Metrics

Among deep learning models, MAE values varied substantially across architectures and spatiotemporal scales, ranging from 0.20 to 0.53 [91] in log-weekly-regional LSTM configurations to values exceeding 1000 [85] in several weekly city-level LSTM applications (e.g., Belo Horizonte 1483; Brasília 1067). Monthly national LSTM forecasts showed MAEs of 301.64 [37], while BiLSTM and 1D-CNN models at monthly city scale reported values between 19.11 and 31.49 [19]. Reported RMSE values ranged from 0.22 to 0.40 [91] in log-weekly-regional LSTM models to >800 in monthly national forecasts under high-incidence conditions. City-level weekly LSTM outputs showed RMSEs between 4.79 and 10.13 [35], while CNN–BiLSTM and 1D-CNN models at monthly and weekly city scale returned a higher value, at 106.96 [63]. For MSE, only one study [30] provided data, reporting 3187.43 for a monthly 1D-CNN at city level. Reported MAPE varied from approximately 21–30% [85] in individual LSTM predictions to >40% in several LSTM baseline configurations, with notable cross-city variability. SMAPE was reported only from 1 [19] study, with BiLSTM values of 0.18–0.31 at monthly city scale. For R², estimates ranged from 0.91 to 0.94 [63] for CNN-LSTM hybrid and ConvLSTM models to 1.00 [72] in univariate LSTM settings. Finally, r values ranged from 0.42 [98] (deep feed-forward networks) to 0.92 [59] in LSTM models, with year-to-year improvements observed in multi-year evaluations (e.g., from 0.58 in 2016 to 0.92 in 2018).

3.10. Hybrid/Superensemble Models

3.10.1. Classification Metrics

Across hybrid and super-ensemble approaches, AUC values ranged widely depending on the underlying base learners, from 0.62 to 0.82 [62] in lower-performing models within multi-algorithm frameworks (e.g., ANN, DT, AdaBoost) to 0.93–0.97 [108] in stronger ensembles based on RF, XGB, glmnet, and PLS. Reported sensitivity extended from 0.42 to 0.49 [46] in weaker decision-tree or SVM configurations to 0.99 [95] in ensemble-optimized RF, DT, and AdaBoost models. Specificity showed a similar spread, ranging from 0.67 [33] in basic GAM/ANN/SVM configurations to 0.95 [108] in glmnet, RF, and XGB. For precision, values ranged from 0.41 to 0.48 [46] in NB/DT/SVM models to 0.90–0.93 [108] in RF and XGB. NPV was reported less frequently but ranged from 0.87 [65] to 0.96 [108] in GLM and XGB. Reported accuracy varied substantially, from 0.29 to 0.42 [18] in simple KNN/GB/SVR settings to 0.99 [95] in enriched hybrid pipelines combining feature selection with multiple classifiers. Finally, F1-scores ranged from 0.41 to 0.51 [46] in weaker NB/SVM/DT models to 0.95–0.99 [95] in sophisticated hybrid frameworks (e.g., PCA/GOOSE/PSO, AdaBoost, RF).

3.10.2. Regression Metrics

MAE values spanned a very broad range across scales and settings, from 0.17 to 0.27 [24] per 1000 population at weekly city level (GAM/GB) and from 0.43 to 0.52 [92] per 100,000 population at monthly provincial level (GLM, RF, XGB, LSTM), up to >50,000 [84] cases in the worst long-horizon state-level forecasts of climate-/case-based LSTM and Bayesian models in Brazil. Reported RMSE ranged from 0.02 [33] in weekly case–city-level settings (GAM, RF, CIF, SVM, ANN, XGB) to >20,000 [18] cases for some monthly national SVR and XGBoost models. Where reported, MSE values ranged from 0.18 to 0.37 [113] at the annual national scale (RF, XGB, MLP, SVR) to >80,000 [30] for monthly city-level ANN in high-incidence settings. MAPE values ranged from <1 to 6% [18] for several national monthly ensembles (RF, XGB, GB, SVR, KNN) to 30–70% [77] in ARDL-based city-level hybrids and exceeded 900–1400% [84] in the worst-performing state–horizon combinations of Brazilian LSTM and Bayesian RE models. SMAPE was rarely reported but, where available, was low (0.04–0.08) [69] for optimized weekly city-level ensembles (e.g., CNN + ANN + SVM, LSTM-RF). For R², estimates ranged from 0.00 to 0.98 [34], while no study reported r for hybrid/super-ensemble regression models.

3.11. Time-Series/Statistical Models

3.11.1. Classification Metrics

Regarding time-series and statistical baselines, AUC was reported in a single study, with a temporal average baseline achieving a value of 0.78 [25]. Accuracy was likewise available from only one [37] comparative analysis, in which ARIMA and Prophet models reached 0.58 and 0.60, respectively. Sensitivity, specificity, PPV, NPV, and the F1-score were not reported by any study in this category.

3.11.2. Regression Metrics

MAE values ranged from 2.80 [64] (moving average baseline, weekly, provincial) to 433.21 [37] (ARIMA, monthly, national). Reported RMSE spanned from 293.9 [94] in a statistical baseline (seasonal naïve) with monthly cases at the district level to 6806 [80] (naïve monthly city-level model). MSE was reported in a single [64] study, with values ranging widely across provinces and forecast horizons, from 6.47 to 32.86 for baseline and moving-average models (Mukdahan, Pattani) up to 1729.00 in naïve forecasts for Chiang Rai at longer horizons. MAPE was seldom reported and varied from 39.66 to 42.18 [37] for Prophet and ARIMA up to 94.84 [58] for NNAR (weekly, national). SMAPE was not provided by any time-series/statistical study. For R², only naïve and moving-average baselines reported values, ranging from 0.07 (weekly, provincial, long horizon) to 0.97 [64] (weekly, provincial, short horizon), while r was never reported.

3.12. Mechanistic and Heuristic Models

Because only a small number of studies relied on mechanistic or heuristic approaches, their performance metrics are reported together.

3.12.1. Classification Metrics

Classification outcomes were seldom reported. Among mechanistic approaches, only [46] study (WRF) provided evaluation metrics, reporting a sensitivity of 0.88, a specificity of 0.95, a precision of 0.85, an accuracy of 0.94, and an F1-score of 0.86. For heuristic and rule-based systems, only one study [116] reported classification results, comparing Bayesian Belief Networks, neural networks, and fuzzy systems; these models showed highly consistent performance, with sensitivity, specificity, accuracy, and F1-scores all ranging between 0.88 and 0.89.

3.12.2. Regression Metrics

Regression metrics were reported infrequently across mechanistic and heuristic studies. Among the heuristic approaches, ANFIS [41] showed an MAE of 151.51, an RMSE of 216.54, and an r = 0.83 at the monthly city level, while differential evolution models [100] yielded an MAE of 40.18–308.68, an RMSE between 40.04 and 106.30, and an MSE from 1627.11 to 11,869.5 across national monthly settings. One [36] heuristic GANN model reported only case-specific deviations (0.06–0.07) at the weekly district level. Mechanistic studies provided limited numerical outputs: the SIR + EAKF ensemble [107] reported timing, peak, and total-case errors (e.g., 4.8, 25, 519 for its primary configuration), without standard regression metrics. Overall, regression performance in this category remains sparsely documented and highly heterogeneous.

3.13. Descriptive Performance Patterns Based on Unweighted Comparative Analyses

3.13.1. Classification Performance

Across studies, substantial heterogeneity was observed in the performance of the evaluated modelling approaches. Overall, classification metrics demonstrated systematic differences in performance distributions, with tree-ensemble and classical machine-learning models generally showing higher stability and deep-learning models exhibiting greater variability (Figure 6). The distribution of AUC values showed consistently high discrimination for tree-ensemble models, with most estimates exceeding 0.85, while classical ML approaches spanned a wider range (0.65–0.96). Hybrid and superensemble methods also showed high AUC values but were represented by few observations, whereas single estimates from deep-learning and time-series/statistical models limited interpretability. For sensitivity, tree-ensemble models demonstrated the highest and most stable performance, with most values above 0.85. Classical ML models displayed broader dispersion (0.73–0.99), while deep-learning models showed the widest range overall, with sensitivities extending from 0.00 to 1.00 across different LSTM configurations. Few observations represented hybrid and mechanistic approaches. The distribution of specificity values also varied considerably. Tree-ensemble methods showed strong and concentrated performance (typically 0.90–0.98). Classical machine-learning estimates ranged more widely (0.70–0.95), and deep-learning models again showed substantial dispersion (0.33–1.00). Hybrid and mechanistic approaches contributed limited additional observations. Regarding PPV, tree-ensemble models yielded consistently high precision, with estimates typically between 0.89 and 0.99. Classical machine-learning estimates were more variable (0.70–0.92). Deep-learning models generally performed well, with most values clustering between 0.88 and 0.99. Mechanistic modelling contributed a single data point. For NPV, only tree-ensemble and classical machine-learning models reported estimates. Tree-ensemble approaches demonstrated high NPV (0.94–0.98), whereas classical machine-learning values were slightly lower (0.88–0.91). No other modelling families contributed NPV metrics. The distribution of accuracy values showed that tree-ensemble models achieved the most stable and highest performance (0.87–1.00). Classical machine-learning models spanned a broad range (0.20–0.97). Deep-learning models exhibited the widest variability, extending from very low scores (0.26) to perfect accuracy (1.00), reflecting strong dependence on architecture and study conditions. Hybrid and mechanistic approaches provided fewer observations. Finally, F1-scores indicated that tree-ensemble models achieved the highest and most consistent balance between precision and recall (typically 0.90–0.98). Classical machine-learning models showed broader dispersion (0.73–0.97). Deep-learning approaches exhibited the broadest range overall (0.00–0.98), highlighting variability in class-balance handling across datasets and architectures. Mechanistic modelling contributed only a single estimate.

3.13.2. Regression Performance

Regression metrics revealed marked heterogeneity across studies, largely driven by differences in spatial scale, temporal resolution, and underlying case magnitude. Overall, error distributions showed consistent scale-dependence: low errors occurred in fine-resolution forecasts, whereas larger errors were associated with national-level or high-incidence settings.

For RMSE (Figure 7), multi-panel stratification showed four distinct magnitude regimes. In the very small range (≤1), tree-ensemble, hybrid, and deep-learning models showed tightly clustered errors, indicating stable performance under low-incidence, high-resolution conditions. The small-error range (1–10) displayed broader but still moderate variability across modelling families. Medium-range RMSE values (10–1000) showed pronounced heterogeneity, most evident among deep-learning and classical machine-learning models, reflecting mixed spatial/temporal contexts. The largest RMSE values (>1000) originated primarily from national-scale predictions using classical ML and time-series/statistical models.

Magnitude-stratified MAE distributions confirmed these scale effects (Figure 8). Very small errors (≤1) were mostly produced by tree-ensemble and classical ML models applied to weekly district/provincial data. Small errors (1–10) encompassed several modelling families, with variability driven more by data granularity than by algorithm choice. Medium-scale MAE values (10–1000) showed broader dispersion, especially for deep-learning and tree-ensemble methods in monthly national or city-level forecasts. Large errors (>1000) were associated primarily with national-level deep-learning and classical ML models.

MAPE distributions further supported this pattern (Figure 9). Tree-ensemble and classical ML models consistently produced the lowest percentage errors, including several estimates below 1%. Moderate errors (2–15%) were observed across tree-ensemble, hybrid, and some deep-learning approaches. Most deep-learning configurations showed wider dispersion (20–36%). The highest errors (>90%) occurred exclusively in time-series/statistical models. MSE values also reflected strong dependence on geographic and temporal granularity (Figure 9). Tree-ensemble models spanned from near-zero to several thousand, particularly in regional 10-day forecasts. Deep-learning models ranged from low-error district-level settings to much higher values in monthly city-level predictions. Classical ML models generally occupied the lower range, while hybrid and heuristic approaches appeared at both extremes, depending on scale. For R², classical ML demonstrated the most consistently high explanatory power (0.75–0.99) (Figure 9). Tree-ensemble methods showed wider variation, ranging from low (0.09) to high (0.92–0.85), depending on study context. Deep-learning models reported high values, including near-perfect fits. Hybrid/superensemble models exhibited intermediate-to-high values but were represented by few observations. The correlation coefficient r showed similar patterns (Figure 9). Tree-ensemble models achieved the strongest and most stable correlations (0.60–0.95). Deep-learning models were more heterogeneous (0.42–0.92), though several achieved strong correlations (≥0.80). The heuristic model contributed a single mid-high value (0.83).

3.14. Assessment of Risk of Bias Using PROBAST

Across the 98 included studies, the overall methodological quality was highly variable (Figure 10). Using PROBAST, 63 studies were judged at high overall risk of bias, 23 at low risk, and 12 at unclear risk. At the domain level, the participant and predictor domains showed the most favourable profiles, with >78% of studies rated as low risk, reflecting the reliance on routinely collected surveillance and environmental datasets. In contrast, the analysis domain represented the most critical source of bias: over 60% of studies received a high-risk judgement. Common limitations included lack of transparent reporting of preprocessing steps, inconsistent handling of missing or imbalanced data, absence of hyperparameter tuning procedures, and insufficient safeguards against overfitting.

4. Discussion

4.1. Interpretation of Main Findings

This review provides the most extensive multi-metric, cross-model synthesis to date of forecasting performance for mosquito-borne viral diseases across classical machine-learning, tree-ensemble, deep-learning, hybrid/superensemble, mechanistic and time-series/statistical approaches. Substantial heterogeneity was evident across studies in terms of geographic setting, temporal resolution, incidence scale, predictor selection, model specification, and performance reporting. Forecasts were produced at multiple spatial units (national, provincial, district, city) and time granularities (annual, monthly, weekly, 10-day and bi-monthly), often with different combinations of climate, demographic, environmental and entomological covariates. Reporting practices were highly inconsistent, with many studies providing incomplete or non-standardised metrics and only a minority clarifying validation procedures or uncertainty quantification. Across the classifications, tree-ensemble models consistently showed the highest median performance and the lowest dispersion, whereas classical machine-learning and deep-learning models displayed broader variability, often driven by model architecture, input complexity and study context. Hybrid/superensemble, mechanistic, and time-series models were less frequently reported, limiting inference. Regression metrics exhibited a strong dependence on temporal and spatial scale. Very small error ranges were almost exclusively observed in fine-resolution, district- or provincial-level forecasting, whereas large errors were associated with national-scale, high-incidence settings. Tree-ensemble and classical ML approaches frequently achieved the lowest error values across scale bands, while deep-learning and time-series/statistical methods produced more dispersed and context-dependent results. Overall, the findings demonstrate that forecasting performance reflects the interplay between algorithmic family, spatiotemporal context, incidence magnitude, and study design, underscoring the need for standardised evaluation frameworks.

4.2. Interpretation and Comparison with Existing Literature

In comparing our findings with previously published systematic reviews of dengue and mosquito-borne disease forecasting models, several points of convergence emerge, but our study also provides substantive methodological extensions. The largest study [120] to date, reviewing 98 dengue outbreak prediction models across 64 studies, reported marked inconsistencies in modelling practices, including limited adoption of ML approaches (39.4%), very low rates of external validation (5.2%), and highly heterogeneous reporting of performance metrics. These structural issues are reproduced in our dataset: despite a broader temporal and geographical scope, we likewise observed substantial heterogeneity in modelling choices, input feature sets, temporal horizons, validation strategies, and combinations of reported metrics. A second review [121] focusing on dengue forecasting in endemic regions found that ML models, particularly tree-ensemble methods such as RF, tended to outperform classical statistical approaches (e.g., ARIMA, Poisson regression), yet emphasised the fragmented evidence base and variability in methodological quality. Our results are consistent with these observations: tree-ensemble methods in our synthesis showed the highest central tendency and smallest dispersion for classification performance, whereas classical ML and DL models showed markedly broader variability. This mirrors prior evidence but also shows, with greater granularity, how performance dispersion manifests differently across each metric, modelling family, and study design. A third study [122] highlighted that although many ML-based models reported favourable predictive accuracy, the absence of transparent validation procedures and the inconsistent reporting of evaluation metrics severely limited cross-study comparability. Our findings reinforce this concern: while some deep-learning and hybrid architectures achieved high performance in specific contexts, the variability across metrics, particularly regression measures, was strongly influenced by study-specific factors such as temporal granularity, incidence scale, and covariate selection. Our findings reinforce this concern: high performance was achievable for certain deep-learning or hybrid models, but variability, particularly in regression metrics, was strongly driven by study-specific characteristics such as temporal resolution, spatial aggregation, incidence magnitude, and covariate selection. However, our study extends the existing evidence base by explicitly quantifying how these sources of heterogeneity propagate across performance distributions. Regression metrics (RMSE, MAE, MAPE, and MSE) demonstrated pronounced scale-dependence in our dataset. By stratifying RMSE into predefined magnitude bands (≤1, 1–10, 10–1000, >1000), we showed that fine-resolution district-level weekly forecasts consistently achieved RMSE ≤ 1, while national-scale or monthly predictions routinely exceeded RMSE > 1000. This behaviour aligns with comparative modelling work such as the Rio de Janeiro study [120], which found that LSTM architectures with climatic covariates outperformed ARIMA only at short horizons, whereas the ensemble or hybrid approaches were more competitive for broader spatial scales or longer forecasting windows. Similarly, Liu et al. [123] reported that XGBoost achieved RMSE = 109, MAE = 127 and MAPE = 12.9% for monthly division-level forecasts in Bangladesh, outperforming SARIMA and SVR and reinforcing that scale, horizon, and feature design critically shape regression performance. Our stratified multi-panel visualisations provide direct empirical confirmation of this context-dependence: in the “very small” error band (≤1), tree-ensemble, hybrid, and deep-learning models clustered tightly; in the “small” band (1–10), variability increased across modelling families; in the “medium” band (10–1000), deep-learning and classical ML models exhibited wide interquartile ranges; and in the “large” band (>1000), classical ML and time-series/statistical methods produced the highest absolute errors. Across all metrics, these results indicate that algorithmic family alone is insufficient for accurate forecasting performance; instead, the interplay between modelling strategy, spatial/temporal scale, data structure, and epidemiological signal magnitude is a dominant determinant of predictive accuracy.

4.3. Implications for Public-Health Practice

From a public-health perspective, the findings of this review provide several actionable insights for the integration of mosquito-borne viral disease forecasting models into operational early-warning systems. The consistently strong performance of tree-ensemble methods across classification metrics suggests that these algorithms are particularly well suited for outbreak detection, alert-level assignment, and other decision-support applications requiring robust categorical predictions at fine spatial or temporal resolution. This observation is consistent with evidence from operational or semi-operational contexts, where Random Forest-based systems have shown stable outbreak detection performance, for example, within Singapore’s national dengue forecasting programme [70] and in ArboMAP [124] for West Nile virus risk mapping in the United States. Such stability across heterogeneous input conditions suggests that tree-ensemble models may offer more reliable early-warning signals than deep-learning or traditional statistical approaches in settings where data completeness varies over time and is broadly consistent with previous work on data-driven surveillance and forecasting for tropical and sub-tropical diseases [125]. The pronounced association between error magnitude and epidemiological scale observed in this review underscores the importance of aligning modelling strategy with intended operational use. District-level, short-term forecasts were characterised by low error and narrow dispersion, making them more compatible with actionable public-health decision-making (e.g., targeted vector control, short-term resource allocation). In contrast, national-scale or monthly forecasts frequently exhibited substantially larger errors and wider variability, indicating that such outputs should be interpreted with caution and used primarily for situational awareness rather than precise operational planning. These findings reinforce the need for public-health agencies to calibrate expectations to scale-specific uncertainty, rather than applying uniform performance thresholds across settings. Importantly, the descriptive and unweighted nature of the available evidence highlights that model performance alone is not sufficient for real-world adoption. The lack of rigorous validation practices remains a critical barrier: prior systematic reviews have shown that fewer than 10% of dengue forecasting models undergo external validation, and our review identified similar gaps [120]. Models developed without proper cross-validation or external testing may be overfitted to their training environment, and real-world experience has demonstrated that forecast accuracy often degrades substantially when models are deployed in new locations or under changing epidemiological conditions [120,126]. Similar challenges have been documented for digital epidemiology systems based on non-conventional data streams (e.g., Google Flu Trends and Google Dengue Trends), where insufficient validation led to substantial misestimation of true disease activity. For this reason, external validation, context-specific calibration, and transparent uncertainty quantification should be considered essential prerequisites before operational deployment. The marked heterogeneity in study design, data sources, and performance-reporting standards further highlights the need for harmonised reporting frameworks. The adoption of standardised evaluation protocols, including consistent forecast horizons, uncertainty measures, and scale-stratified error reporting, would substantially improve comparability across modelling approaches and support more evidence-based integration into routine surveillance systems. Finally, most models included in this review were developed exclusively for research purposes, with only a small minority piloted or implemented within real-world public-health infrastructures. This limited operational uptake is consistent with prior assessments of dengue and arboviral forecasting systems, which have repeatedly noted the gap between methodological innovation and practical deployment. Accelerating translation into practice will require co-development with public-health institutions, emphasis on interpretability and maintainability, and integration into existing surveillance workflows. Moreover, early-warning systems will only translate into effective risk reduction if they are accompanied by risk communication and community-engagement strategies; for example, poor awareness and knowledge of Zika virus observed in the Italian general population illustrate how limited understanding of arboviral risks can undermine the impact of preventive measures [127]. Strengthening standardisation, validation, and reporting practices will be essential for transitioning forecasting models from exploratory research tools into reliable components of operational early-warning systems. Looking ahead, the rapidly growing body of AI-based arboviral forecasting studies originating from low- and middle-income countries, particularly in Asia and Latin America, is encouraging and suggests that context-adapted tools may increasingly be developed where the burden of mosquito-borne diseases is highest. At the same time, long-standing concerns about “algorithmic inequity” remain highly relevant: many AI systems in global health have historically been trained and validated predominantly on data from high-income, well-resourced settings, raising the risk of models that encode and amplify structural biases related to geography, race/ethnicity and socioeconomic status, and that perform suboptimally in underrepresented populations [128,129]. Ensuring that future AI models for mosquito-borne viral diseases are trained on diverse, locally generated data, co-designed with stakeholders in resource-limited settings and evaluated through rigorous external validation will therefore be essential to avoid reproducing these inequities [130]. If coupled with investments in open data infrastructures, capacity building and transparent governance, the ongoing expansion of AI research in low- and middle-income settings has the potential to shift the field towards more equitable, accessible, and implementable early-warning tools for those communities most affected by arboviral transmission.

4.4. Strengths and Limitations

This study offers several strengths. It represents the first systematic effort to collate and compare the performance of AI and machine-learning models across the spectrum of mosquito-borne viral diseases, providing a unified descriptive overview of a highly fragmented field. The use of unweighted visual distributions allows an unbiased representation of performance variability, and the multi-panel stratification of scale-dependent metrics (RMSE and MAE) provides a clearer understanding of error behaviour across different incidence scales and analytical contexts. Several limitations must also be acknowledged. The analysis is purely descriptive, without meta-analytic weighting or adjustment for methodological heterogeneity; therefore, no causal inference or formal comparison across modelling families can be drawn. Dependence on published study-level metrics introduces potential publication bias, selective reporting, and uncertainties due to incompletely documented modelling procedures. Residual confounding related to prediction horizon, geographical setting, and underlying incidence patterns is likely, despite stratification efforts. Moreover, the uneven representation of modelling categories, particularly the limited number of deep-learning and mechanistic models, reduces the robustness of comparative insights. Finally, although the review aimed to address all major mosquito-borne viral diseases, the available literature was overwhelmingly focused on dengue, limiting the generalisability of findings to other arboviruses.

5. Conclusions

In conclusion, this review showed substantial variability in performance across AI/ML forecasting models for mosquito-borne viral diseases. Tree-ensemble approaches emerged as the most consistently reliable for classification, while regression accuracy was strongly shaped by spatial and temporal scale, incidence levels, and prediction horizon. By providing the first comparative synthesis that integrates model performance with methodological quality and operational readiness, this review clarifies where current approaches are genuinely fit for purpose. This contribution is particularly timely, given the growing institutional interest in robust early-warning systems for climate-sensitive infectious diseases. Priorities for advancing real-world use include standardised performance reporting, rigorous external validation and calibration to context-specific settings, steps essential to translating modelling advances into dependable public-health decision-support tools.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/make8010015/s1. Table S1: Full search strings used in PubMed, Embase, and Scopus, including all controlled vocabulary terms and free-text keywords, reported exactly as executed; Table S2: The table reports, for each study, the corresponding continent and country (or countries, for multi-country studies), together with the main data sources used for model development; Table S3: Results of classification metrics for the comparator AI models used in the included studies; Table S4: Results of regression metrics for the comparator AI models used in the included studies.

Author Contributions

Conceptualization, V.G. and F.P.; methodology, A.P.; software, V.G. and F.P.; validation, A.P., V.G. and F.P.; formal analysis, A.P., V.G. and F.P.; investigation, A.P., V.G. and F.P.; resources, A.P., V.G. and F.P.; data curation, A.P., V.G. and F.P.; writing—original draft preparation, all authors; writing—review and editing, all authors; visualization, A.P., V.G. and F.P.; supervision, C.S., V.B., O.E.S. and V.G.; project administration, V.G.; funding acquisition, O.E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wilder-Smith, A.; Gubler, D.J.; Weaver, S.C.; Monath, T.P.; Heymann, D.L.; Scott, T.W. Epidemic arboviral diseases: Priorities for research and public health. Lancet Infect. Dis. 2017, 17, e101–e106. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Kamara, F.; Zhou, G.; Puthiyakunnon, S.; Li, C.; Liu, Y.; Zhou, Y.; Yao, L.; Yan, G.; Chen, X.-G. Urbanization Increases Aedes albopictus Larval Habitats and Accelerates Mosquito Development and Survivorship. PLoS Neglected Trop. Dis. 2014, 8, e3301. [Google Scholar] [CrossRef]
Abbasi, E. Global expansion of Aedes mosquitoes and their role in the transboundary spread of emerging arboviral diseases: A comprehensive review. IJID One Health 2025, 6, 100058. [Google Scholar] [CrossRef]
Nucci, D.; Pennisi, F.; Pinto, A.; De Ponti, E.; Ricciardi, G.E.; Signorelli, C.; Veronese, N.; Castagna, A.; Maggi, S.; Cadeddu, C.; et al. Impact of extreme weather events on food security among older people: A systematic review. Aging Clin. Exp. Res. 2025, 37, 137. [Google Scholar] [CrossRef]
Chilakam, N.; Lakshminarayanan, V.; Keremutt, S.; Rajendran, A.; Thunga, G.; Poojari, P.G.; Rashid, M.; Mukherjee, N.; Bhattacharya, P.; John, D. Economic Burden of Mosquito-Borne Diseases in Low- and Middle-Income Countries: Protocol for a Systematic Review. JMIR Res. Protoc. 2023, 12, e50985. [Google Scholar] [CrossRef]
Roiz, D.; Pontifes, P.A.; Diagne, C.; Leroy, B.; Vaissi, A. Science of the Total Environment the rising global economic costs of invasive Aedes mosquitoes and Aedes -borne diseases. Sci. Total. Environ. 2024, 933, 173054. [Google Scholar] [CrossRef] [PubMed]
Lim, A.; Shearer, F.M.; Sewalk, K.; Pigott, D.M.; Clarke, J.; Ghouse, A.; Judge, C.; Kang, H.; Messina, J.P.; Kraemer, M.U.G.; et al. The overlapping global distribution of dengue, chikungunya, Zika and yellow fever. Nat. Commun. 2025, 16, 3418. [Google Scholar] [CrossRef] [PubMed]
Cintra, A.M.; Noda-Nicolau, N.M.; de Oliveira Soman, M.L.; de Andrade Affonso, P.H.; Valente, G.T.; Grotto, R.M.T. The Main Arboviruses and Virus Detection Methods in Vectors: Current Approaches and Future Perspectives. Pathogens 2025, 14, 416. [Google Scholar] [CrossRef]
Ureña, G.E.; Diaz, Y.; Pascale, J.M.; Lo, S. A framework for the early detection and prediction of dengue outbreaks in the Republic of Panama. Front. Trop. Dis. 2025, 5, 1465856. [Google Scholar] [CrossRef]
Patiño, L.; Benítez, A.D.; Carrazco-Montalvo, A.; Regato-Arrata, M. Genomics for Arbovirus Surveillance: Considerations for Routine Use in Public Health Laboratories. Viruses 2024, 16, 1242. [Google Scholar] [CrossRef]
Pinto, A.; Pennisi, F.; Ricciardi, G.E.; Signorelli, C.; Gianfredi, V. Evaluating the impact of artificial intelligence in antimicrobial stewardship: A comparative meta-analysis with traditional risk scoring systems. Infect. Dis. Now 2025, 55, 105090. [Google Scholar] [CrossRef]
Abdi, Y.H.; Abdullahi, Y.B.; Abdi, M.S.; Bashir, S.G.; Ahmed, N.I. Using Artificial Intelligence in Vector Control: A New Path for Public Health. J. Vector Borne Dis. 2025, 144, 25. [Google Scholar] [CrossRef] [PubMed]
Pennisi, F.; Pinto, A.; Ricciardi, G.E.; Signorelli, C.; Gianfredi, V. Artificial intelligence in antimicrobial stewardship: A systematic review and meta-analysis of predictive performance and diagnostic accuracy. Eur. J. Clin. Microbiol. Infect. Dis. 2025, 44, 463–513. [Google Scholar] [CrossRef]
Brady, O.J.; Bastos, L.S.; Caldwell, J.M.; Cauchemez, S.; Clapham, H.E.; Dorigatti, I.; Gaythorpe, K.A.M.; Hu, W.; Hussain-Alkhateeb, L.; Johansson, M.A.; et al. Why the growth of arboviral diseases necessitates a new generation of global risk maps and future projections. PLoS Comput. Biol. 2025, 21, e1012771. [Google Scholar] [CrossRef]
Pinto, A.; Pennisi, F.; Odelli, S.; Ponti EDe Veronese, N.; Signorelli, C.; Baldo, V.; Gianfredi, V. Artificial Intelligence in the Management of Infectious Diseases in Older Adults: Diagnostic, Prognostic, and Therapeutic Applications. Biomedicines 2025, 13, 2525. [Google Scholar] [CrossRef]
Velasco, H.; Ortiz, S.; Catano-Lopez, A.; Castro, C.; Martin-Barreiro, C.; Leiva, V. Integrating machine learning and time-to-event models to explain and predict risk of hospitalization due to dengue in Colombia. Sci. Rep. 2025, 15, 38847. [Google Scholar] [CrossRef] [PubMed]
Freitas, L.P.; Ferreira, D.A.d.C.; Lana, R.M.; Câmara, D.C.P.; Portella, T.P.; Carvalho, M.S.; Gouveia, A.S.; de Almeida, I.F.; Araujo, E.C.; Vacaro, L.B.; et al. A statistical model for forecasting probabilistic epidemic bands for dengue cases in Brazil. Infect. Dis. Model. 2025, 10, 1479–1487. [Google Scholar] [CrossRef] [PubMed]
Al Mobin, M. Forecasting dengue in Bangladesh using meteorological variables with a novel feature selection approach. Sci. Rep. 2024, 14, 32073. [Google Scholar] [CrossRef]
Anggraeni, W.; Yuniarno, E.M.; Rachmadi, R.F.; Purnomo, M.H. A Sparse Representation of Social Media, Internet Query, and Surveillance Data to Forecast Dengue Case Number using Hybrid Decomposition- Bidirectional LSTM. Int. J. Intell. Eng. Syst. 2021, 14, 209–225. [Google Scholar] [CrossRef]
Nur, D.; Ningrum, A.; Li, Y.J.; Hsu, C. Artificial Intelligence Approach for Severe Dengue Early Warning System. Stud. Health Technol. Inform. 2024, 310, 881–885. [Google Scholar] [CrossRef]
Anno, S.; Hara, T.; Kai, H.; Lee, M.; Chang, Y.; Oyoshi, K.; Mizukami, Y.; Tadono, T. Spatiotemporal dengue fever hotspots associated with climatic factors in Taiwan including outbreak predictions based on machine-learning. Geospat. Health 2019, 14, 183–194. [Google Scholar] [CrossRef]
Anno, S.; Tsubasa, H.; Sugita, S.; Yasumoto, S.; Sasaki, Y.; Oyoshi, K. Geo-spatial Information Science Challenges and implications of predicting the spatiotemporal distribution of dengue fever outbreak in Chinese Taiwan using remote sensing data and deep learning. Geo-Spat. Inf. Sci. 2024, 27, 1155–1161. [Google Scholar] [CrossRef]
Buebos-esteve, D.E.; Heherson, N.; Dagamac, A. Acta Tropica Spatiotemporal models of dengue epidemiology in the Philippines: Integrating remote sensing and interpretable machine learning. Acta Trop. 2024, 255, 107225. [Google Scholar] [CrossRef]
Carvajal, T.M.; Viacrusis, K.M.; Hernandez, L.F.T.; Ho, H.T.; Amalin, D.M.; Watanabe, K. Machine learning methods reveal the temporal pattern of dengue incidence using meteorological factors in metropolitan Manila, Philippines. BMC Infect. Dis. 2018, 18, 183. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Hui, J.; Ong, Y.; Rajarethinam, J.; Yap, G.; Ng, L.C.; Cook, A.R. Neighbourhood level real-time forecasting of dengue cases in tropical urban Singapore. BMC Med. 2018, 16, 129. [Google Scholar] [CrossRef]
Cheng, Y.; Cheng, R.; Xu, T.; Tan, X.; Bai, Y.; Yang, J. Integrating meteorological data and hybrid intelligent models for dengue fever prediction. BMC Public Health 2025, 25, 1516. [Google Scholar] [CrossRef]
Chowdhury, A.H. Comparison of Deep Learning and Gradient Boosting: ANN Versus XGBoost for Climate—Based Dengue Prediction in Bangladesh. Health Sci. Rep. 2025, 8, e70714. [Google Scholar] [CrossRef]
Zuanna, T.D.; Del Manso, M.; Giambi, C.; Riccardo, F.; Bella, A.; Caporali, M.G.; Dente, M.G.; Declich, S. The Italian Survey CARE Working Group. Immunization offer targeting migrants: Policies and practices in Italy. Int. J. Environ. Res. Public Health 2018, 15, 968. [Google Scholar] [CrossRef]
Arya Dala, I.M.Y.; Darma Putra, I.K.G.; Buana, P.W. Forecasting Cases of Dengue Hemorrhagic Fever Using the Backpropagation, Gaussians and Support-Vector Machine Methods. J. RESTI 2021, 5, 335–341. [Google Scholar] [CrossRef]
Kumar, D.; Omveer, D.; Yatindra, S.; Ram, G. Prediction of dengue patients using deep learning methods amid complex weather conditions in Jaipur, India. Discov. Public Health 2025, 22, 58. [Google Scholar] [CrossRef]
Doni, A.R.; Sasipraba, T. Ingénierie des Systèmes d ’ Information LSTM-RNN Based Approach for Prediction of Dengue Cases in India. Syst. D’inf. 2020, 25, 327–335. [Google Scholar]
Edussuriya, C.; Id, S.D.; Id, I.G. An accurate mathematical model predicting number of dengue cases in tropics. PLoS Neglected Trop. Dis. 2021, 15, e0009756. [Google Scholar] [CrossRef]
Francisco, M.E.; Carvajal, T.M.; Id, K.W. Hybrid Machine Learning Approach to Zero-Inflated Data Improves Accuracy of Dengue Prediction. PLoS Neglected Trop. Dis. 2024, 18, e0012599. [Google Scholar] [CrossRef]
Guo, P.; Liu, T.; Zhang, Q.; Wang, L.; Xiao, J.; Zhang, Q.; Luo, G.; Li, Z.; He, J.; Zhang, Y.; et al. Developing a dengue forecast model using machine learning: A case study in China. PLoS Neglected Trop. Dis. 2017, 11, e0005973. [Google Scholar] [CrossRef]
Handari, B.D.; Niman, I.M.S.; Hasan, A.; Purba, J.R.P.; Hertono, G.F. Comparation of Elman Neural Network, Long Short- Term Memory, and Gated Recurrent Unit in Predicting Dengue Hemorrhagic Fever at Dki Jakarta. Commun. Math. Biol. Neurosci. 2021, 2021, 87. [Google Scholar] [CrossRef]
Husin, N.A.; Mustapha, N.; Sulaiman, N. Performance of Hybrid GANN in Comparison with Other Standalone Models on Dengue Outbreak Prediction. J. Comput. Sci. 2016, 12, 300–306. [Google Scholar] [CrossRef]
Islam, S.; Shahrear, P.; Saha, G. Mathematical analysis and prediction of future outbreak of dengue on time-varying contact rate using machine learning approach. Comput. Biol. Med. 2024, 178, 108707. [Google Scholar] [CrossRef]
Ismail, S.; Fildes, R.; Ahmad, R.; Najdah, W.; Mohamad, W.; Omar, T. The practicality of Malaysia dengue outbreak forecasting model as an early warning system. Infect. Dis. Model. 2022, 7, 510–525. [Google Scholar] [CrossRef] [PubMed]
Javaid, M.; Sarfraz, M.S.; Aftab, M.U.; Zaman, Q.; Rauf, H.T.; Alnowibet, K.A. WebGIS-Based Real-Time Surveillance and Response System for Vector-Borne Infectious Diseases. Int. J. Environ. Res. Public Health 2023, 20, 3740. [Google Scholar] [CrossRef] [PubMed]
Jayabalan, D.; Elango, S. ICE-VDOP: An integrated clustering and ensemble machine learning methods for an enhanced vector-borne disease outbreak prediction using climatic variables. Int. J. Inf. Technol. 2024, 16, 2077–2088. [Google Scholar] [CrossRef]
Kerdprasop, N.; Kerdprasop, K.; Chuaybamroong, P. Computational Intelligence and Statistical Learning Performances on Predicting Dengue Incidence using Remote Sensing Data. Adv. Sci. Technol. Eng. Syst. J. 2020, 5, 344–350. [Google Scholar] [CrossRef]
Cuomo, G.; Franconi, I.; Riva, N.; Bianchi, A.; Digaetano, M.; Santoro, A.; Codeluppi, M.; Bedini, A.; Guaraldi, G.; Mussini, C. Migration and health: A retrospective study about the prevalence of HBV, HIV, HCV, tuberculosis and syphilis infections amongst newly arrived migrants screened at the Infectious Diseases Unit of Modena, Italy. J. Infect. Public Health 2019, 12, 200–204. [Google Scholar] [CrossRef]
Kesorn, K.; Ongruk, P.; Chompoosri, J.; Phumee, A. Morbidity Rate Prediction of Dengue Hemorrhagic Fever (DHF) Using the Support Vector Machine and the Aedes aegypti Infection Rate in Similar Climates and Geographical Areas. PLoS ONE 2015, 10, e0125049. [Google Scholar] [CrossRef]
Kiang, M.V.; Santillana, M.; Chen, J.T.; Onnela, J.P.; Krieger, N.; Monsen, K.E.; Ekapirat, N.; Areechokchai, D.; Prempree, P.; Maude, R.J.; et al. Incorporating human mobility data improves forecasts of Dengue fever in Thailand. Sci. Rep. 2021, 11, 923. [Google Scholar] [CrossRef]
Koh, Y.; Spindler, R.; Sandgren, M.; Jiang, J. A model comparison algorithm for increased forecast accuracy of dengue fever incidence in Singapore and the auxiliary role of total precipitation information. Int. J. Environ. Health Res. 2018, 28, 535–552. [Google Scholar] [CrossRef] [PubMed]
Kukkar, A.; Kumar, Y.; Sandhu, J.K.; Kaur, M.; Walia, T.S. DengueFog: A Fog Computing-Enabled Weighted Random Forest-Based Smart Health Monitoring System for Automatic Dengue Prediction. Diagnostics 2024, 14, 624. [Google Scholar] [CrossRef]
Dey, S.K.; Rahman, M.; Howlader, A.; Siddiqi, U.R.; Uddin, K.M.M.; Borhan, R.; Rahman, E.U. Prediction of dengue incidents using hospitalized patients, metrological and socio- economic data in Bangladesh: A machine learning approach. PLoS ONE 2022, 17, e0270933. [Google Scholar] [CrossRef]
Kuo, C.Y.; Yang, W.W.; Chia, E.; Su, Y. Improving dengue fever predictions in Taiwan based on feature selection and random forests. BMC Infect. Dis. 2024, 24, 334. [Google Scholar] [CrossRef]
Liu, K.; Wang, T.; Yang, Z.; Huang, X.; Milinovich, G.J.; Lu, Y.; Jing, Q.; Xia, Y.; Zhao, Z.; Yang, Y.; et al. Using Baidu Search Index to Predict Dengue Outbreak in China. Sci. Rep. 2016, 6, 38040. [Google Scholar] [CrossRef]
Liu, K.; Zhang, M.; Xi, G.; Deng, A.; Song, T.; Id, Q.L. Enhancing fine-grained intra-urban dengue forecasting by integrating spatial interactions of human movements between urban regions. PLoS Neglected Trop. Dis. 2020, 14, e0008924. [Google Scholar] [CrossRef]
Lu, X.; Teh, S.Y.; Tay, C.J.; Abu Kassim, N.F.; Fam, P.S.; Soewono, E. Application of multiple linear regression model and long short-term memory with compartmental model to forecast dengue cases in Selangor, Malaysia based on climate variables. Infect. Dis. Model. 2025, 10, 240–256. [Google Scholar] [CrossRef]
Majeed, M.A.; Shafri, H.Z.M.; Wayayok, A.; Zulkafli, Z. Prediction of dengue cases using the attention-based long short-term memory (LSTM) approach. Geospat. Health 2023, 18, 1. [Google Scholar] [CrossRef]
Majeed, M.A.; Zulhaidi, H.; Shafri, M.; Zulkafli, Z. A Deep Learning Approach for Dengue Fever Prediction in Malaysia Using LSTM with Spatial Attention. Int. J. Environ. Res. Public Health 2023, 20, 4130. [Google Scholar] [CrossRef]
Majeed, M.A.; Shafri, H.Z.M.; Zulkafli, Z.; Wayayok, A. Dengue fever prediction using LSTM and integrated temporal—Spatial attention: A case study of Malaysia. Spat. Inf. Res. 2025, 33, 5. [Google Scholar] [CrossRef]
Mayrose, H.; Sampathila, N.; Bairy, G.M.; Nayak, T.; Saravu, K. An explainable artificial intelligence integrated system for automatic detection of dengue from images of blood smears using transfer learning. IEEE Access 2024, 12, 41750–41762. [Google Scholar] [CrossRef]
Al Mobin, M. Multivariate forecasting of dengue infection in Bangladesh: Evaluating the influence of data downscaling on machine learning predictive accuracy. BMC Infect. Dis. 2025, 25, 761. [Google Scholar] [CrossRef]
Farisha, N.; Krishnan, M.; Ahmad, Z.; Ahmad, A.; Jamaludin, M. Predicting Dengue Outbreak based on Meteorological Data Using Artificial Neural Network and Decision Tree Models. Int. J. Inform. Vis. 2022, 6, 597–603. [Google Scholar]
Mustaffa, N.A.; Zahari, S.M.; Farhana, N.A.; Nasir, N.; Azil, A.H. Forecasting the incidence of dengue fever in Malaysia: A comparative analysis of seasonal ARIMA, dynamic harmonic regression, and neural network models. Int. J. Adv. Appl. Sci. 2024, 11, 20–31. [Google Scholar] [CrossRef]
Necesito, I.V.; Velasco, J.M.; Kwak, J.; Lee, J.H.; Lee, M.J.; Kim, J.S.; Kim, H.S. Combination of Univariate Long-Short Term Memory Network And Wavelet Transform For Predicting Dengue Case Density In The National Capital Region, The Philippines. Southeast Asian J. Trop. Med. Public Health 2021, 52, 479–494. [Google Scholar]
Olmoguez, I.L.G.; Catindig, M.A.C.; Fel, M.; Amongos, L.; Lazan, F.G. Developing a Dengue Forecasting Model: A Case Study in Iligan City. Int. J. Adv. Comput. Sci. Appl. 2020, 10, 9. [Google Scholar] [CrossRef]
Ong, J.; Liu, X.; Rajarethinam, J.; Kok, S.Y.; Liang, S.; Tang, S.; Cook, A.R.; Ng, L.C.; Yap, G. Mapping dengue risk in Singapore using Random Forest. PLoS Neglected Trop. Dis. 2018, 12, e0006587. [Google Scholar] [CrossRef]
Ong, S.Q.; Isawasan, P.; Mohiddin, A.; Ngesom, M.; Shahar, H.; Lasim, A.; Nair, G. Predicting dengue transmission rates by comparing different machine learning models with vector indices and meteorological data. Sci. Rep. 2023, 13, 19129. [Google Scholar] [CrossRef]
Patra, S.; Jana, S.; Adak, S.; Kar, T.K. Regular Article—Statistical and Nonlinear Physics A deep learning architecture using hybrid and stacks to forecast weekly dengue cases in Laos. Eur. Phys. J. B 2024, 97, 110. [Google Scholar] [CrossRef]
Puengpreeda, A.; Yhusumrarn, S.; Sirikulvadhana, S. Weekly Forecasting Model for Dengue Hemorrhagic Fever Outbreak in Thailand. Eng. J. 2020, 24, 71–87. [Google Scholar] [CrossRef]
Rahman, S.; Shiddik, A.B. Explainable artificial intelligence for predicting dengue outbreaks in Bangladesh using eco-climatic triggers. Glob. Epidemiol. 2025, 10, 100210. [Google Scholar] [CrossRef]
Ren, H. Forecasting and mapping dengue fever epidemics in China: A spatiotemporal analysis. Infect. Dis. Poverty 2024, 13, 14–28. [Google Scholar] [CrossRef] [PubMed]
Salim, N.A.M.; Wah, Y.B.; Reeves, C.; Smith, M.; Yaacob, W.F.W.; Mudin, R.N.; Dapari, R.; Sapri, N.N.F.F.; Haque, U. Prediction of dengue outbreak in Selangor Malaysia using machine learning techniques. Sci. Rep. 2021, 11, 939. [Google Scholar] [CrossRef]
Salsabiila, N.J. CNN + BiGRU-Attention Classification and TiDE-PSO Forecasting Approach for Social Media-based Predictive Analysis of Dengue. Int. J. Intell. Eng. Syst. 2025, 18, 716–735. [Google Scholar] [CrossRef]
Shaikh, S.G.; Sureshkumar, B.; Narang, G. Biomedical Signal Processing and Control Development of optimized ensemble classifier for dengue fever prediction and recommendation system. Biomed. Signal Process. Control 2023, 85, 104809. [Google Scholar] [CrossRef]
Shi, Y.; Liu, X.; Kok, S.; Rajarethinam, J.; Liang, S.; Yap, G.; Chong, C.-S.; Lee, K.-S.; Tan, S.S.; Chin, C.K.Y.; et al. Three-Month Real-Time Dengue Forecast Models: An Early Warning System for Outbreak Alerts and Policy Decision Support in Singapore. Environ. Health Perspect. 2016, 124, 1369–1375. [Google Scholar] [CrossRef] [PubMed]
Rahman, S. Dengue Early Warning System and Outbreak Prediction Tool in Bangladesh Using Interpretable Tree—Based Machine Learning Model. Health Sci. Rep. 2025, 8, e70726. [Google Scholar] [CrossRef] [PubMed]
Stavelin Abhinandithe, K.; Madhu, B.; Balasubramanian, S.; Ramachandran, S. Forecasting Multivariate time-series data using LSTM Neural Network in Mysore district, Karnataka. Indian J. Public Health Res. Dev. 2022, 13, 2–7. [Google Scholar] [CrossRef]
Tian, N.; Zheng, J.; Li, L.; Xue, J.; Xia, S.; Lv, S.; Zhou, X.-N. Precision Prediction for Dengue Fever in Singapore: A Machine Learning Approach Incorporating Meteorological Data. Trop. Med. Infect. Dis. 2024, 9, 72. [Google Scholar] [CrossRef]
Tuan, D.A. Leveraging Climate Data for Dengue Forecasting in Ba Ria Vung Tau Province, Vietnam: An Advanced Machine Learning Approach. Trop. Med. Infect. Dis. 2024, 9, 250. [Google Scholar] [CrossRef]
Wu, C.; Kao, S. Knowledge discovery in open data for epidemic disease prediction. Health Policy Technol. 2021, 10, 126–134. [Google Scholar] [CrossRef]
Nejad, F.Y.; Varathan, K.D. Identification of significant climatic risk factors and machine learning models in dengue outbreak prediction. BMC Med. Inform. Decis. Mak. 2021, 21, 141. [Google Scholar] [CrossRef]
Yeh, D.Y.; Leu, J.H.; Ye, S.; Cheng, C.H. An intelligent autoregressive-distributed lag model: A climate-driven approach for predicting dengue fever incidence in Taiwan cities. Acta Trop. 2025, 269, 107761. [Google Scholar] [CrossRef]
Yi, C.; Vajdi, A.; Ferdousi, T.; Cohnstaedt, L.W.; Scoglio, C. PICTUREE—Aedes: A Web Application for Dengue Data Visualization and Case Prediction. Pathogens 2023, 12, 771. [Google Scholar] [CrossRef]
Zhao, X.; Li, K.; Ke, C.; Ang, E.; Hao, K. Chaos, Solitons and Fractals A deep learning based hybrid architecture for weekly dengue incidences forecasting. Chaos Solitons Fractals 2023, 168, 113170. [Google Scholar] [CrossRef]
Baquero, O.S.; Maria, L.; Santana, R.; Chiaravalloti-Neto, F. Dengue forecasting in São Paulo city with generalized additive models, artificial neural networks and seasonal autoregressive integrated moving average models. PLoS ONE 2018, 13, e0195065. [Google Scholar] [CrossRef]
Bogado, J.V.; Schaerer, C.E.; Stalder, D.H.; Mart, G. Cluster-based LSTM models to improve Dengue cases forecast. CLEI Electron. J. 2023, 26, 1–14. [Google Scholar] [CrossRef]
Bomfim, R.; Pei, S.; Shaman, J.; Yamana, T.; Makse, A.; Andrade, J.S.; Neto, A.S.L.; Furtado, V. Predicting dengue outbreaks at neighbourhood level using human mobility in urban areas. J. R. Soc. Interface 2020, 17, 20200691. [Google Scholar] [CrossRef]
Campbell, K.M.; Haldeman, K.; Lehnig, C.; Munayco, C.V.; Halsey, S.; Laguna-torres, V.A.; Yagui, M.; Morrison, A.C.; Lin, C.-D.; Scott, T.W. Weather Regulates Location, Timing, and Intensity of Dengue Virus Transmission between Humans and Mosquitoes. PLoS Neglected Trop. Dis. 2015, 9, e0003957. [Google Scholar] [CrossRef]
Chen, X.; Moraga, P. Forecasting Dengue across Brazil with LSTM Neural Networks and SHAP-Driven Lagged Climate and Spatial Effects. BMC Public Health 2024, 25, 973. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Moraga, P. Dengue forecasting and outbreak detection in Brazil using LSTM: Integrating human mobility and climate factors. Infect. Dis. Model. 2025, 11, 338–354. [Google Scholar] [CrossRef]
Cordeiro, C.; Lins, C.; Ana, D.L.; Gomes, C.; Machado, G.; Moreno, M.; Musah, A.; Aldosery, A.; Dutra, L.; Ambrizzi, T.; et al. Spatiotemporal forecasting for dengue, chikungunya fever and Zika using machine learning and artificial expert committees based on meta-heuristics. Res. Biomed. Eng. 2022, 38, 499–537. [Google Scholar] [CrossRef]
Silva, S.T.; Gabrick, E.C.; Protachevicz, P.R.; Iarosz, K.C. When climate variables improve the dengue forecasting: A machine learning approach. Eur. Phys. J. Spéc. Top. 2025, 234, 555–569. [Google Scholar] [CrossRef]
Ferdousi, T.; Cohnstaedt, L.E.E.W. A Windowed Correlation-Based Feature Selection Method to Improve Time Series Prediction of Dengue Fever Cases. IEEE Access 2021, 9, 141210–141222. [Google Scholar] [CrossRef]
Hamlet, A.; Ramos, D.G.; Gaythorpe, K.A.M.; Pecego, A.; Romano, M.; Garske, T.; Ferguson, N.M. Seasonality of agricultural exposure as an important predictor of seasonal yellow fever spillover in Brazil. Nat. Commun. 2021, 12, 3647. [Google Scholar] [CrossRef]
Id, G.K.; Id, F.L.; Id, L.C.; Buckee, C.; Santillana, M. Predicting dengue incidence leveraging internet-based data sources. A case study in 20 cities in Brazil. PLoS Neglected Trop. Dis. 2022, 16, e0010071. [Google Scholar] [CrossRef]
Li, Z.; Gurgel, H.; Xu, L.; Yang, L.; Dong, J. Improving Dengue Forecasts by Using Geospatial Big Data Analysis in Google Earth Engine and the Historical Dengue Information-Aided Long Short Term Memory Modeling. Biology 2022, 11, 169. [Google Scholar] [CrossRef] [PubMed]
Mills, C.; Falconi-agapito, F.; Carrera, J.; Munayco, C.V.; Moritz, U.G. Multi-model approach to understand and predict past and future dengue epidemic dynamics. R. Soc. Open Sci. 2025, 12, 1–32. [Google Scholar] [CrossRef]
Mussumeci, E.; Coelho, F.C. Spatial and Spatio-temporal Epidemiology Large-scale multivariate forecasting models for Dengue—LSTM versus random forest regression. Spat. Spatiotemporal Epidemiol. 2020, 35, 100372. [Google Scholar] [CrossRef]
Roster, K.; Connaughton, C.; Rodrigues, F.A. Practice of Epidemiology Machine-Learning—Based Forecasting of Dengue Fever in Brazilian Cities Using Epidemiologic and Meteorological Variables. Am. J. Epidemiol. 2022, 191, 1803–1812. [Google Scholar] [CrossRef] [PubMed]
Sofía, B.; López, S.; Nolberto, D.C.; Antonio, J.; Gutiérrez, T.; López, Y.G. Traditional Machine Learning based on Atmospheric Conditions for Prediction of Dengue Presence. Comput. Sist. 2023, 27, 769–777. [Google Scholar] [CrossRef]
Gendriz, I.S.; Souza GFDe Andrade IGMDe Duarte, A.; Neto, D.; Tavares, A.D.M. Data—Driven computational intelligence applied to dengue outbreak forecasting: A case study at the scale of the city of Natal, RN—Brazil. Sci. Rep. 2022, 12, 6550. [Google Scholar] [CrossRef]
Sebastianelli, A.; Spiller, D.; Carmo, R.; Wheeler, J.; Nowakowski, A.; Jacobson, L.V.; Kim, D.; Barlevi, H.; Cordero, Z.E.R.; Colón-González, F.J.; et al. OPEN A reproducible ensemble machine learning approach to forecast dengue outbreaks. Sci. Rep. 2024, 14, 3807. [Google Scholar] [CrossRef]
Soliman, M. Ensemble forecasting of the Zika space-time spread with topological data analysis. Environmetrics 2020, 31, e2629. [Google Scholar] [CrossRef]
Souza, C.; Maia, P.; Stolerman, L.M.; Rolla, V.; Velho, L. Predicting dengue outbreaks in Brazil with manifold learning on climate data. Expert Syst. Appl. 2022, 192, 116324. [Google Scholar] [CrossRef]
Theodorakos, K.; Broeckhove, J.; Willem, L. Examination of influencing factors and high-risk regions of dengue in Nicaragua, using spatiotemporal compartmental simulations. Trop. Med. Int. Health 2018, 22, 156–157. [Google Scholar]
Id, N.Z.; Charland, K.; Carabali, M.; Nsoesie, E.O.; Maheu-, M.; Rees, E.; Yuan, M.; Balaguera, C.G.; Ramirez, G.J.; Zinszer, K. Machine learning and dengue forecasting: Comparing random forests and artificial neural networks for predicting dengue burden at national and sub-national scales in Colombia. PLoS Neglected Trop. Dis. 2020, 14, e0008056. [Google Scholar] [CrossRef]
Akhtar, M.; Kraemer, M.U.G.; Gardner, L.M. A dynamic neural network model for predicting risk of Zika in real time. BMC Med. 2019, 17, 171. [Google Scholar] [CrossRef]
Appice, A.; Gel, Y.R.; Iliev, I. A Multi-Stage Machine Learning Approach to Predict Dengue Incidence: A Case Study in Mexico. IEEE Access 2020, 8, 52713–52725. [Google Scholar] [CrossRef]
Gutiérrez, R.A.C.; Márquez, D.C.A.; Gonzalez, N.P.B. Parallel prediction of dengue cases with different risks in Mexico using an artificial neural network model considering meteorological data. Int. J. Biometeorol. 2024, 68, 1043–1060. [Google Scholar] [CrossRef]
Holcomb, K.M.; Staples, J.E.; Nett, R.J.; Beard, C.B.; Petersen, L.R. Multi-Model Prediction of West Nile Virus Neuroinvasive Disease with Machine Learning for Identification of Important Regional Climatic Drivers. GeoHealth 2023, 7, e2023GH000906. [Google Scholar] [CrossRef]
Laureano-Rosario, A.E.; Duncan, A.P.; Mendez-Lazaro, P.A.; Garcia-Rejon, J.E.; Id, S.G.; Farfan-Ale, J.; Savic, D.A.; Muller-Karger, F.E. Application of Artificial Neural Networks for Dengue Fever Outbreak Predictions in the Northwest Coast of Yucatan, Mexico and San Juan, Puerto Rico. Trop. Med. Infect. Dis. 2018, 3, 5. [Google Scholar] [CrossRef] [PubMed]
Yamana, T.K.; Kandula, S.; Shaman, J. Superensemble forecasts of dengue outbreaks. J. R. Soc. Interface 2016, 13, 20160410. [Google Scholar] [CrossRef]
Salami, D.; Sousa, C.A.; Oliveira, R. Predicting dengue importation into Europe, using machine learning and model-agnostic methods. Sci. Rep. 2020, 10, 9689. [Google Scholar] [CrossRef] [PubMed]
Mulwa, D.; Kazuzuru, B.; Misinzo, G. An XGBoost Approach to Predictive Modelling of Rift Valley Fever Outbreaks in Kenya Using Climatic Factors. Big Data Cogn. Comput. 2024, 8, 148. [Google Scholar] [CrossRef]
Teurlai, M.; Eug, C.; Cavarero, V.; Degallier, N. Socio-economic and Climate Factors Associated with Dengue Fever Spatial Heterogeneity: A Worked Example in New Caledonia. PLoS Neglected Trop. Dis. 2015, 9, e0004211. [Google Scholar] [CrossRef]
Id, C.M.B.; Shea, K.M.; Jenkins, H.E.; Id, L.Y.K.; Markuzon, N. Weekly dengue forecasts in Iquitos, Peru; San Juan, Puerto Rico; and Singapore. PLoS Neglected Trop. Dis. 2020, 14, e0008710. [Google Scholar] [CrossRef]
Anh, D.; Vu, P.; Uyen, N. Acta Tropica Bridging the predictive divide: A hybrid early warning system for scalable and real-time dengue surveillance in LMICs. Acta Trop. 2025, 269, 107765. [Google Scholar] [CrossRef]
Long, H.; Chen, Y.; Feng, J.; Chen, J.; Zhang, X.; Han, W. Annual global dengue dynamics are related to multi-source factors revealed by a machine learning prediction analysis. PLoS Neglected Trop. Dis. 2025, 19, e0013232. [Google Scholar] [CrossRef] [PubMed]
Panja, M.; Chakraborty, T.; Shahid, S.; Ghosh, I. Chaos, Solitons and Fractals An ensemble neural network approach to forecast Dengue outbreak based on climatic condition. Chaos Solitons Fractals 2023, 167, 113124. [Google Scholar] [CrossRef]
Li, Z.; Dong, J. Big Geospatial Data and Data-Driven Methods for Urban Dengue Risk Forecasting: A Review. Remote Sens. 2022, 14, 5052. [Google Scholar] [CrossRef]
Kumar, S.; Vaishali, S.; Isha, S.; Sahil, M. An intelligent healthcare system for predicting and preventing dengue virus infection. Computing 2021, 105, 617–655. [Google Scholar] [CrossRef]
Wallin, J.; Abiri, N.; Sewe, O. Articles Artificial intelligence to predict West Nile virus outbreaks with eco-climatic drivers. Lancet Reg. Health Eur. 2022, 17, 100370. [Google Scholar] [CrossRef]
Coppola, N.; Alessio, L.; De Pascalis, S.; Macera, M.; Di Caprio, G.; Messina, V.; Onorato, L.; Minichini, C.; Stanzione, M.; Stornaiuolo, G.; et al. Effectiveness of test-and-treat model with direct-acting antiviral for hepatitis C virus infection in migrants: A prospective interventional study in Italy. Infect. Dis. Poverty 2024, 13, 39. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Moraga, P. Assessing dengue forecasting methods: A comparative study of statistical models and machine learning techniques in Rio de Janeiro, Brazil. Trop. Med. Health 2025, 53, 52. [Google Scholar] [CrossRef]
Leung, X.Y.; Islam, R.M.; Adhami, M.; Ilic, D.; McDonald, L.; Palawaththa, S.; Diug, B.; Munshi, S.U.; Karim, N. A systematic review of dengue outbreak prediction models: Current scenario and future directions. PLoS Negl. Trop. Dis. 2023, 17, e0010631. [Google Scholar] [CrossRef]
Sutriyawan, A.; Rahardjo, M.; Martini, M.; Sutiningsih, D.; Rattanapan, C.; Abu Kassim, N.F. Global Forecasting Models for Dengue Outbreaks in Endemic Regions: A Systematic Review. Microbiol. Epidemiol. Immunobiol. 2025, 102, 331–342. [Google Scholar] [CrossRef]
Hoyos, W.; Aguilar, J.; Toro, M. Dengue models based on machine learning techniques: A systematic literature review. Artif. Intell. Med. 2021, 119, 102157. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Hossain, M.F.; Hossain, S. A comparative evaluation of multiple machine learning approaches for forecasting dengue outbreaks in Bangladesh. Sci. Rep. 2025, 15, 35931. [Google Scholar] [CrossRef]
Nekorchuk, D.M.; Bharadwaja, A.; Simonson, S.; Ortega, E.; França, C.M.B.; Dinh, E.; Reik, R.; Burkholder, R.; Wimberly, M.C. The Arbovirus Mapping and Prediction (ArboMAP) system for West Nile virus forecasting. JAMIA Open 2024, 7, ooad110. [Google Scholar] [CrossRef]
Jaya, I.G.; Andriyana, Y.; Tantular, B.; Pangastuti, S.S.; Kristiani, F. Spatiotemporal Dengue Forecasting for Sustainable Public Health in Bandung, Indonesia: A Comparative Study of Classical, Machine Learning, and Bayesian Models. Sustainability 2025, 17, 6777. [Google Scholar] [CrossRef]
Adde, A.; Roucou, P.; Mangeas, M.; Ardillon, V.; Desenclos, J.-C.; Rousset, D.; Girod, R.; Briolant, S.; Quenel, P.; Flamand, C. Predicting Dengue Fever Outbreaks in French Guiana Using Climate Indicators. PLoS Negl. Trop. Dis. 2016, 10, e0004681. [Google Scholar] [CrossRef] [PubMed]
Gianfredi, V.; Nucci, D.; Pennisi, F.; Provenzano, S. Knowledge and attitudes towards Zika virus: An Italian nation-wide cross-sectional study. Ann. Dell’ist. Super. Sanita 2022, 58, 34–41. [Google Scholar] [CrossRef]
Pennisi, F.; Borlini, S.; Cuciniello, R.; D’Amelio, A.C.; Calabretta, R.; Pinto, A.; Signorelli, C. Improving Vaccine Coverage Among Older Adults and High-Risk Patients: A Systematic Review and Meta-Analysis of Hospital-Based Strategies. Healthcare 2025, 13, 1667. [Google Scholar] [CrossRef]
Signorelli, C.; Pennisi, F.; Lunetti, C.; Blandi, L.; Pellissero, G.; Fondazione Sanità Futura, W.G. Quality of hospital care and clinical outcomes: A comparison between the Lombardy Region and the Italian national data. Ann. Ig. Med. Prev. Comunità 2024, 36, 234–249. [Google Scholar] [CrossRef]
Pennisi, F.; Pinto, A.; Ricciardi, G.E.; Signorelli, C.; Gianfredi, V. The Role of Artificial Intelligence and Machine Learning Models in Antimicrobial Stewardship in Public Health: A Narrative Review. Antibiotics 2025, 14, 134. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flow diagram depicting the selection process.

Figure 2. Geographical distribution of studies by continent. The colour gradient represents the number of studies per continent, with darker shades indicating higher counts. Three studies involving multiple continents are not represented in the figure.

Figure 3. Geographical distribution of studies by country. Labels indicate countries with ≥2 studies; shaded countries without labels correspond to those represented by a single study. Nine multicountry studies are not displayed in the figure.

Figure 4. Temporal evolution of artificial intelligence model types used in the included studies (2015–2025), considering both principal model and comparator algorithms.

Figure 5. Frequency of feature types used as model predictors across the included studies. Each bar represents the total number of studies incorporating a given feature category.

Figure 6. Unweighted distributions of classification performance metrics across modelling families. The figure presents the distribution of seven classification metrics (accuracy, AUC, F1-score, NPV, PPV, sensitivity, and specificity) across five modelling families: classical machine learning, deep learning, hybrid/superensemble, mechanistic, and tree-ensemble approaches. Each subplot displays horizontal box-and-scatter representations in which boxplots are shown only when ≥3 observations were available; otherwise, point estimates are plotted individually. Metrics are reported as observed in the original studies without weighting, transformation, or standardisation.

Figure 7. Root mean squared error (RMSE) distributions stratified by error magnitude across modelling families. Panels represent predefined RMSE ranges (≤1, 1–10, 10–1000, >1000). Within each panel, horizontal boxplots and jittered points display unweighted study-level estimates for tree-ensemble, classical machine-learning, deep-learning, hybrid/superensemble, time-series/statistical, and heuristic models.

Figure 8. Multi-panel visualisation of mean absolute error (MAE) values reported across included studies, stratified into four predefined magnitude ranges (very small ≤1, small 1–10, medium 10–1000, large >1000) to account for scale-dependent behaviour of absolute errors. Within each panel, MAE distributions are displayed by modelling family using horizontal boxplots with overlaid jittered points representing individual study estimates.

Figure 9. Comparative visualisation of regression-based performance metrics across modelling families. Multi-panel display summarising mean absolute percentage error (MAPE), mean squared error (MSE), coefficient of determination (R²), and Pearson’s correlation coefficient (r) for all included studies. Each subplot reports the unweighted distribution of study-level estimates across modelling families, illustrating differences in error magnitude, dispersion, and predictive agreement.

Figure 10. Distribution of risk-of-bias assessments across the five PROBAST domains.

Table 1. Descriptive characteristics of included studies.

First Author (Year)	Study Design	Study Period	Setting	Population	Disease	Disease and Case Definition	Prediction Horizon	Missing/ Imbalanced Data Handling	Data-Split	Implementation Readiness
Akhtar (2019) [102]	Model development/Modelling study	2015–2016	Multicentre (multi-country or multi-site)	GP	Zika	Case counts—NA	Short-term (weekly, ≤4 weeks)	Advanced imputation	Train/Test (70/15)	Pilot/proof-of-concept
Al Mobin (2024) [18]	Forecasting study	2010–2024	National	GP	Dengue	Monthly incidence—suspected + lab	Medium-term (months, >3–12 months)	Simple imputation	Train/Val/Test (70/10/20)	Research only
Anggraeni (2021) [19]	Model development/Modelling study	2009–2019	Community-based (field/surveillance in population)	GP	Dengue	Monthly incidence—NA	Long-term (≥1 year)	NA	K-fold CV (K = 5)	Research only
Anno (2019) [21]	Ecological/Spatiotemporal study	1998–2015	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—NA	NA	NA	K-fold CV (K = 8)	Research only
Anno (2024) [22]	Ecological/Spatiotemporal study	2002–2020	National	GP	Dengue	Case counts—suspected + lab	Short-term (weekly, ≤4 weeks)	Data balancing/Resampling	Temporal (train = 2002–2017; val = 2018; test = 2019–2020)	Research only
Appice (2020) [103]	Model development/Modelling study	1985–2010	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—NA	Long-term (≥1 year)	NA	Temporal (train = January 1985–December 2009; test = January–December 2010)	Research only
Baquero (2018) [80]	Forecasting study	2000–2016	Community-based (field/surveillance in population)	GP	Dengue	NA	NA	Simple imputation	Time-series CV (rolling) [train = January 2000–December 2014; val = with train 165 months + validate next 6 months; test = remainder]	Research only
Benedum (2020) [111]	Forecasting study	2009–2016	Urban	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 4 years; test = 1 year)	Research only
Bogado (2023) [81]	Forecasting study	2009–2013	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—NA	NA	NA	Temporal (train = (2009–2012); test = (2013))	Research only
Bomfim (2020) [82]	Forecasting study	2007–2015	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2011–2014; test = 2015–2016)	Pilot/proof-of-concept
Buebos-Esteve (2024) [23]	Ecological/Spatiotemporal study	2016–2020	Sub-national (province/state/municipality)	GP	Dengue	NA—suspected + lab	Long-term (≥1 year)	NA	Temporal	Research only
Campbell (2015) [83]	Ecological/Spatiotemporal study	1994–2012	Sub-national (province/state/municipality)	GP	Dengue	Case counts—suspected + lab	NA	Simple imputation	Temporal (test = within 2005–2012 by exhaustive classification tree search)	Research only
Carvajal (2018) [24]	Forecasting study	2009–2013	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—suspected + lab	Short/Medium-term (weeks to ≤3 months)	Advanced imputation	Temporal (train = 2009–2012; val = 2009–2012; test = 2013)	Research only
Chen (2018) [25]	Forecasting study	2010–2016	Urban	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2010–2015; test = 2016)	Public health decision support
Chen (2024) [84]	Ecological/Spatiotemporal study	2016–2023	Sub-national (province/state/municipality)	GP	Dengue	Case counts—suspected + lab	Medium-term (months, >3–12 months)	NA	Temporal (train = January 2016–December 2022; test = January 2023–December 2023)	Research only
Chen (2025) [85]	Ecological/Spatiotemporal study	2016–2023	Multicentre (multi-country or multi-site)	GP	Dengue	Case counts—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2016–2022; test = 2023)	Research only
Cheng (2025) [26]	Forecasting study	2005–2024	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—suspected + lab	Long-term (≥1 year)	Simple imputation	Train/Test (70/30)	Research only
Chowdhury (2025) [27]	Model development/Modelling study	2000–2022	National	GP	Dengue	Monthly incidence—lab	Medium-term (months, >3–12 months)	Data cleaning/Exclusion/Normalization	Train/Test (80/20)	Conceptual/simulation study
Conde-Gutiérrez (2024) [104]	Forecasting study	2012–2019	Sub-national (province/state/municipality)	GP	Dengue	NA	Short-term (weekly, ≤4 weeks)	NA	Train/Test (80/20)	Research only
da Silva (2022) [86]	Forecasting study	2009–2017	Community-based (field/surveillance in population)	GP	Dengue, Chikungunya, Zika	Monthly incidence—NA	Medium-term (months, >3–12 months)	NA	K-fold CV (K = 10)	Research only
da Silva (2025) [87]	Forecasting study	2016–2019 Iquitos: 2001–2012 (597 weeks) Barranquilla: 2011–2016 (307 weeks)	Urban	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	Simple imputation	Train/Test (70/30)	Research only
Dala (2021) [29]	Forecasting study	2008–2018	Community-based (field/surveillance in population)	GP	Dengue	Case counts—NA	Long-term (≥1 year)	NA	Temporal (val = with multiple hold-outs: 50/50)	Research only
Dang Anh Tuan (2025) [112]	Model development/Modelling study	2020–2023 Vietnam (2018–2023)	Sub-national (province/state/municipality)	GP	Dengue	NA—lab	NA	Data cleaning/Exclusion/Normalization	Temporal	Conceptual/simulation study
Dhaked (2025) [30]	Model development/Modelling study	2015–2021	Urban	GP	Dengue	Monthly incidence—lab	Medium-term (months, >3–12 months)	Data cleaning/Exclusion/Normalization	Train/Test (80/20)	Research only
Doni (2020) [31]	Model development/Modelling study	2015–2019	National	GP	Dengue	Case counts—NA	Long-term (≥1 year)	NA	Temporal (train = 2015–2018; test = 2019)	Research only
Edussuriya (2021) [32]	Forecasting study	2010–2019	National	GP	Dengue	Monthly incidence—suspected + lab	Medium-term (months, >3–12 months)	Data balancing/Resampling	Temporal (train = 2010–2018; test = January–March 2019)	Pilot/proof-of-concept
Farooq (2022) [91]	Forecasting study	2010–2019	Community-based (field/surveillance in population)	GP	West Nile	Case counts—suspected	Long-term (≥1 year)	Data balancing/Resampling	K-fold CV (K = 5)	Research only
Ferdousi (2021) [88]	Forecasting study	2010–2019	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	NA	Train/Val/Test (unspecified)	Research only
Francisco (2024) [33]	Forecasting study	2009–2013	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—suspected + lab	NA	NA	Temporal (train = 2009–2012; test = 2013)	Research only
Guo (2017) [34]	Forecasting study	2011–2014	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2011–2013; test = 2014)	Research only
Hamlet (2021) [89]	Ecological/Spatiotemporal study	2003–2018	Sub-national (province/state/municipality)	GP + NHP	Yellow fever	Case counts—suspected + lab	Short-term (weekly, ≤4 weeks)	Data cleaning/Exclusion/Normalization	Temporal (train = 60–70/Test 30_40; test = 30_40)	Proof-of-concept/Early research
Handari (2021) [35]	Forecasting study	2009–2017	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	Simple imputation	NA	Research only
Holcomb (2023) [105]	Model development/Modelling study	2015–2021	National	GP	West Nile	NA—lab	Long-term (≥1 year)	NA	Temporal (train = 2015–2019; test = 2020–2021)	Pilot/proof-of-concept
Husin (2016) [36]	Forecasting study	NA	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—NA	NA	NA	NA	Research only
Islam (2024) [37]	Forecasting study	2001–2023	National	GP	Dengue	Monthly incidence—NA	Long-term (≥1 year)	NA	Train/Test (80/20)	Research only
Ismail (2022) [38]	Model development/Modelling study	2010–2019	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	Simple imputation	K-fold CV (K = 10)	Pilot/proof-of-concept
Javaid (2023) [39]	Model development/Modelling study	2014–2018	Sub-national (province/state/municipality)	GP	Dengue	Case counts—suspected + lab	NA	Simple imputation	Temporal (test = split)	Public health decision support
Jayabalan (2024) [40]	Model development/Modelling study	2003–2021	National and sub-national	GP	Dengue	Monthly incidence—NA	Medium-term (months, >3–12 months)	NA	Train/Test (70/30)	Research only
Kerdprasop (2020) [41]	Model development/Modelling study	2003–2017	Urban	GP	Dengue	Monthly incidence—NA	Medium-term (months, >3–12 months)	NA	Temporal (train = 2003–2015 (156 records); test = 2016–2017 (24 records))	Research only
Kesorn (2015) [43]	Model development/Modelling study	2007–2013	Sub-national (province/state/municipality)	GP	Dengue	NA	NA	Data cleaning/Exclusion/Normalization	K-fold CV (K = 10)	Research only
Kiang (2021) [44]	Forecasting study	2010–2017	National	GP	Dengue	Monthly incidence—lab	Medium-term (months, >3–12 months)	NA	Temporal (test = 42 months (January-2010 → June-2013))	Pilot/proof-of-concept
Koh (2018) [45]	Forecasting study	2016	National	GP	Dengue	Weekly incidence—lab	NA	NA	NA	Research only
Koplewitz (2022) [90]	Forecasting study	2010–2016	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—suspected + lab	Short/Medium-term (weeks to ≤3 months)	NA	Temporal (train = 2010–2015; test = 2016)	Pilot/proof-of-concept
Kukkar (2024) [46]	Model development/Modelling study	2016–2020	Hospital-based/Clinical	Pts	Dengue	NA	NA	NA	NA	Conceptual/simulation study
Kumar Dey (2022) [47]	Model development/Modelling study	2011–2021	National	Pts	Dengue	Case counts—NA	Medium-term (months, >3–12 months)	Simple imputation	Train/Test (80/20)	Conceptual/simulation study
Kuo (2024) [48]	Ecological/Spatiotemporal study	2013–2015	Urban	GP	Dengue	NA	NA	Data cleaning/Exclusion/Normalization	Train/Test (80/20)	Research only
Laureano Rosario (2018) [106]	Model development/Modelling study	1994–2012	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	NA	Train/Val/Test (unspecified)	Research only
Li (2022) [115]	Forecasting study	2007–2019	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2007–2015; val = 2016–2017; test = 2018–2019)	Research only
Li (2022) [91]	Forecasting study	2013–2020	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2013-mid 2019 (326 weeks); test = until December 2020 (92 w))	Research only
Liu (2016) [49]	Forecasting study	2010–2014	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—NA	NA	NA	NA	Research only
Liu (2020) [50]	Model development/Modelling study	2015–2019	Sub-national (province/state/municipality)	GP	Dengue	Case counts—NA	Short/Medium-term (weeks to ≤3 months)	NA	Temporal (train = 2015–2018; test = January–September 2019)	Research only
Long (2025) [113]	Model development/Modelling study	1990–2018	Multicentre (multi-country or multi-site)	GP	Dengue	NA—suspected + lab	Long-term (≥1 year)	Simple imputation	K-fold CV (K = 4)	Research only
Lu (2025) [51]	Forecasting study	2014–2020	National	GP	Dengue	Case counts—lab	Short/Medium-term (weeks to ≤3 months)	NA	Temporal (train = 2014–2018 (Weeks 1–261); test = 2019–2020 (Weeks 262–365))	Research only
Majeed (2023) [53]	Forecasting study	2010–2017	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (test = 2010–2016)	Research only
Majeed (2025) [54]	Forecasting study	2011–2016	National	GP	Dengue	Weekly incidence—lab	Short/Medium-term (weeks to ≤3 months)	NA	Temporal (test = with temporal partitioning (not cross-country))	Research only
Majeed2 (2023) [52]	Forecasting study	2010–2016	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (test = 2010–2016)	Research only
Mayrose (2024) [55]	Model development/Modelling study	2010–2019	National	GP	Dengue	NA	Medium-term (months, >3–12 months)	Data balancing/Resampling	Train/Val/Test (70/20/10)	Research only
Mills (2025) [92]	Forecasting study	2010–2021	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—suspected + lab	NA	NA	Temporal (train = 2010–2017; test = 2018–2021)	Research only
Mobin (2025) [56]	Forecasting study	2010–2023	National	GP	Dengue	Monthly incidence—suspected + lab	Long-term (≥1 year)	Simple imputation	Train/Test (80/20)	Research only
Muhamad Krishnan (2022) [57]	Model development/Modelling study	2015–2019	Community-based (field/surveillance in population)	GP	Dengue	Case counts—NA	NA	Data cleaning/Exclusion/Normalization	Train/Test (unspecified)	Research only
Mulwa (2024) [109]	Model development/Modelling study	1981–2010	National	GP	Rift Valley fever	Monthly incidence—lab	Medium-term (months, >3–12 months)	Simple imputation	Train/Test (80/20)	Research only
Mussumeci (2020) [93]	Model development/Modelling study	2010–2018	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = January 2010–June 2017; val = July 2017–June 2018))	Research only
Mustaffa (2024) [58]	Forecasting study	2017–2022	National	GP	Dengue	Weekly incidence—lab	NA	NA	Temporal (train = 207 weeks (2017–2020); test = 99 weeks (2021–2022))	Research only
Necesito (2021) [59]	Forecasting study	1994–2018	Community-based (field/surveillance in population)	GP	Dengue	Monthly incidence—NA	Medium-term (months, >3–12 months)	NA	Temporal (train = 1994–2015)	Research only
Ningrum (2024) [20]	Ecological/Spatiotemporal study	2014–2021	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	Data balancing/Resampling	Train/Test (80/20)	Pilot/proof-of-concept
Olmoguez (2019) [60]	Model development/Modelling study	2008–2017	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—NA	NA	Simple imputation	NA	Research only
Ong (2018) [61]	Ecological/Spatiotemporal study	2006–2016	Urban	GP	Dengue	Case counts—suspected + lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2006–2013)	Operational use in surveillance
Ong (2023) [62]	Model development/Modelling study	2018–2020	Sub-national (province/state/municipality)	GP	Dengue	NA	NA	NA	Train/Test (70/30)	Research only/proof-of-concept
Panja (2023) [114]	Forecasting study	1991–2012	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 1170; test = 1144)	Research only
Patra (2025) [63]	Forecasting study	2013–2023	National	GP	Dengue	Weekly incidence—NA	NA	NA	Temporal (train = October 2013–July 2020; test = last 30% (July 2020–May 2023))	Research only
Puengpreedaa (2020) [64]	Model development/Modelling study	2014–2018	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	NA	K-fold CV (K = 5)	Research only
Rahman (2025) [65]	Model development/Modelling study	2000–2023	National	GP	Dengue	Case counts—lab	Medium-term (months, >3–12 months)	NA	Temporal (train = 2000–2019; test = 2020–2023)	Research only
Ren (2024) [66]	Forecasting study	2003–2022	Sub-national (province/state/municipality)	GP	Dengue	NA	Long-term (≥1 year)	Data balancing/Resampling	Train/Test (70/30)	Research only
Roster (2023) [94]	Forecasting study	2007–2019	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—suspected + lab	Short-term (weekly, ≤4 weeks)	Data balancing/Resampling	Temporal (train = 2007–2016 Test 2016–2019; test = 2016–2019)	Research only
Salami (2020) [108]	Model development/Modelling study	2010–2015	Multicentre (multi-country or multi-site)	T	Dengue	Case counts—lab	Medium-term (months, >3–12 months)	Data balancing/Resampling	Temporal (test = split)	Research only
Salim (2021) [67]	Ecological/Spatiotemporal study	2013–2017	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	Simple imputation	Train/Test (70/30)	Research only
Salsabiila (2025) [68]	Model development/Modelling study	2010–2020	National and sub-national	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	Data balancing/Resampling	Train/Test (80/20)	Research only
Sánchez López (2023) [95]	Model development/Modelling study	2010–2020	Community-based (field/surveillance in population)	GP	Dengue	Weekly incidence—suspected + lab	NA	Data cleaning/Exclusion/Normalization	K-fold CV (K = 5)	Research only
Sanchez-Gendriz (2022) [96]	Forecasting study	2016–2019	Urban	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	NA	NA	Pilot/proof-of-concept
Sebastianelli (2024) [97]	Forecasting study	2001–2019	National	GP	Dengue	Case counts—NA	NA	NA	Temporal (train = 2001–2016; val = 2017–2019 (Brazil))	Pilot/proof-of-concept
Shaikh (2023) [69]	Model development/Modelling study	NA	Other (benchmark dataset)	GP	Dengue	NA—suspected + lab	NA	NA	NA	Research only
Shi (2016) [70]	Forecasting study	2001–2012	Urban	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	NA	Temporal (train = 2001–2010; val = 2011–2012)	Operational use in surveillance
Siddikur Rahman (2025) [71]	Forecasting study	2000–2021	National	GP	Dengue	Weekly incidence—NA	NA	Advanced imputation	Temporal (train = 2000–2018; test = 2019–2021)	Research only
Soliman (2020) [98]	Model development/Modelling study	2017–2018	Sub-national (province/state/municipality)	GP	Zika	Monthly incidence—NA	Long-term (≥1 year)	NA	Temporal (train = 2017; test = 2018)	Research only
Sood (2020) [116]	Model development/Modelling study	NA	Hospital-based/Clinical	Pts	Dengue	NA	NA	NA	K-fold CV (K = 10)	Research only
Souza (2022) [99]	Forecasting study	2002–2012	Community-based (field/surveillance in population)	GP	Dengue	NA	Long-term (≥1 year)	NA	Temporal (train= first 11 years; val= noise-augmented set per tuning (Gaussian noise); test= last 3–5 years for cities)	Research only
Stavelin (2022) [72]	Forecasting study	2006–2019	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—suspected + lab	Long-term (≥1 year)	Data cleaning/Exclusion/Normalization	NA	Research only
Teurlai (2015) [110]	Model development/Modelling study	1995–2012	Sub-national (province/state/municipality)	GP	Dengue	Case counts—suspected + lab	NA	NA	NA	Research only
Theodorakos (2017) [100]	Ecological/Spatiotemporal study	2002–2002	National	GP	Dengue	Monthly incidence—NA	NA	NA	Temporal (train = and validation on same epidemic season (2002); val = on same epidemic season (2002))	Conceptual/simulation study
Tian (2024) [73]	Ecological/Spatiotemporal study	2012–2022	National	GP	Dengue	Case counts—suspected + lab	Short-term (weekly, ≤4 weeks)	Simple imputation	Train/Test (80/20)	Research only
Tuan (2024) [74]	Forecasting study	2010–2020	Sub-national (province/state/municipality)	GP	Dengue	Monthly incidence—lab	Medium-term (months, >3–12 months)	Simple imputation	Temporal (train = January 2010–October 2018; test = November 2018–December 2020)	Research only
Wu (2021) [75]	Model development/Modelling study	2005–2016	Urban	GP	Dengue, Enterovirus, Influenza	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	Data balancing/Resampling	Train/Test (83/17)	Research only
Yamana (2016) [107]	Model development/Modelling study	1990–2013	Sub-national (province/state/municipality)	GP	Dengue	Weekly incidence—lab	NA	NA	Temporal (train = seasons 1–14; test = seasons 15–23)	Research only
Yavari Nejad (2021) [76]	Ecological/Spatiotemporal study	2010–2013	National	GP	Dengue	Weekly incidence—lab	Short-term (weekly, ≤4 weeks)	Data cleaning/Exclusion/Normalization	Temporal (train = 75/Test 25; test = 25)	Pilot/proof-of-concept
Yeh (2025) [77]	Forecasting study	2014–2018	Urban	GP	Dengue	Weekly incidence—lab	NA	NA	Temporal (train = 257 weeks; test = last 4 weeks)	Research only
Yi (2023) [78]	Model development/Modelling study	NA	Sub-national (province/state/municipality)	GP	Dengue	Case counts—suspected + lab	Medium-term (months, >3–12 months)	NA	Temporal (train = epidemic curves from historical datasets (1960–2012) Test: Malaysian outbreak of 2022 (3 timepoints predictions); test = Malaysian outbreak of 2022 (3 timepoints predictions))	Public health decision support
Zhao (2020) [101]	Forecasting study	2014–2018	Sub-national (province/state/municipality)	GP	Dengue	Case counts—suspected + lab	Short-term (weekly, ≤4 weeks)	Data cleaning/Exclusion/Normalization	Train/Test (80/20)	Pilot/proof-of-concept
Zhao (2023) [79]	Forecasting study	2012–2022	Urban	GP	Dengue	Weekly incidence—NA	Short-term (weekly, ≤4 weeks)	NA	Train/Test (70/30)	Research only

GP = General population; lab = laboratory; Pts = Patients (hospital/clinical cases); T = Travelers; val = validation.

Table 2. Results of classification metrics for the principal AI models used in the included studies.

First Author (Year)	Principal AI Model	AI Category	N Variables Included vs. Considered	Validation	AUC	Sensitivity	Specificity	PPV/Precision	NPV	Accuracy	F1-Score
Akhtar (2019) [102]	NARX NN	Classical ML	11/16	INT	1 w 0.91–0.95, 2 w 0.91–0.93, 4 w 0.83–0.87, 8 w 0.75–0.80, 12 w 0.70–0.74	NA	NA	NA	NA	1 w 0.94, 2 w 0.92, 4 w 0.88, 8 w 0.82, 12 w 0.78	NA
Al Mobin (2024) [18]	DT + Sequential Squeeze FS	Classical ML	12/13	INT 5-fold TSCV	NA	NA	NA	NA	NA	0.82	NA
Anggraeni (2021) [19]	BiLSTM	Deep Learning	NA	INT	NA	NA	NA	NA	NA	NA	NA
Anno (2019) [21]	CNN	Deep Learning	4/2	INT 8-fold CV	NA	NA	NA	NA	NA	1.0 → 0.81 (longitude-time), 0.75 (latitude-time), 0.48 → 0.26 (longitude-latitude)	NA
Anno (2024) [22]	CNN	Deep Learning	4/4	INT train/val/test split	NA	NA	NA	NA	NA	SST + Rainfall + SWR 1.00, 0.51; SST only 1.00, 0.51; Rainfall only 1.00, 0.51; SWR only 1.00, 0.51; Rainfall + SWR 1.00, 0.51	NA
Appice (2020) [103]	AutoTiC-NN	Classical ML	1/2	EXT	NA	NA	NA	NA	NA	NA	NA
Baquero (2018) [80]	GAM, ANN (MLP), LSTM	Hybrid/Superensemble	4/NA	INT	NA	NA	NA	NA	NA	NA	NA
Benedum (2020) [111]	RF	Tree Ensemble	5/5	INT TS split 4 y train; 1 y test	NA	NA	NA	NA	NA	NA	NA
Bogado (2023) [81]	LSTM	Deep Learning	4/4	INT	NA	NA	NA	NA	NA	NA	NA
Bomfim (2020) [82]	NN	Classical ML	2/2	INT TS split train 2011–2014; test 2015–2016	NA	0.91	NA	0.92	NA	NA	0.92
Buebos-Esteve (2024) [23]	RF	Tree Ensemble	4/4	INT + EXT nested resampling (spatiotemporal LOOCV internal; 3-fold CV external)	NA	NA	NA	NA	NA	NA	NA
Campbell (2015) [83]	DT	Classical ML	2/2	INT	NA	0.95	0.95	NA	NA	NA	NA
Carvajal (2018) [24]	RF	Tree Ensemble	5/19	INT	NA	NA	NA	NA	NA	NA	NA
Chen (2018) [25]	LASSO	Classical ML	73/73	INT	1 w 0.88; 2 w 0.86; 4 w 0.82; 8 w 0.78; 12 w 0.76	NA	NA	NA	NA	NA	NA
Chen (2024) [84]	LSTM + SHAP	Deep Learning	7/17	INT	NA	NA	NA	NA	NA	NA	NA
Chen (2025) [85]	LSTM	Deep Learning	4/4	INT	NA	Mean threshold, mean + 2SD threshold = Manaus 0.76, 0.60, Belém 0.08, 0.00, Fortaleza 0.75, 0.00, Salvador 0.71, 0.73, Brasília 0.92, 1.00, Goiânia 0.69, 0.57, Belo Horizonte 0.73, 0.72, Rio de Janeiro 0.88, 0.88, São Paulo 0.90, 0.88, Curitiba 0.85, 0.78	Mean threshold, mean + 2SD threshold = Manaus 1.00, 1.00, Belém 0.90, 1.00, Fortaleza 0.98, 1.00, Salvador 0.33, 0.71, Brasília 1.00, 0.95, Goiânia 0.95, 0.98, Belo Horizonte 1.00, 1.00, Rio de Janeiro, 1.00, São Paulo 1.00, 1.00, Curitiba 1.00, 0.94	NA	NA	Mean threshold, mean + 2SD threshold = Manaus 0.92, 0.88, Belém 0.69, 0.96, Fortaleza 0.96, 0.96, Salvador 0.69, 0.73, Brasília 0.94, 0.98, Goiânia 0.88, 0.92, Belo Horizonte 0.79, 0.79, Rio de Janeiro 0.88, 0.92, São Paulo 0.90, 0.88, Curitiba 0.90, 0.83	Mean threshold, mean + 2SD threshold = Manaus 0.87, 0.75, Belém 0.11, 0.00, Fortaleza 0.75, 0.00, Salvador 0.81, 0.83, Brasília 0.96, 0.98, Goiânia 0.75, 0.67, Belo Horizonte 0.85, 0.84, Rio de Janeiro 0.94, 0.94, São Paulo 0.95, 0.94, Curitiba 0.93, 0.86
Cheng (2025) [26]	Feature selection: Regression + fuzzy c-means + IHLOA; Classificators: SVM, KNN, RF	Hybrid/Superensemble	3/13 (Zhejiang), 9/13 (Guangdong)	INT	NA	NA	NA	NA	NA	Guangdong SVM 0.96, Guangdong KNN 0.96, Guangdong RF 0.96, Zhejiang SVM 0.96, Zhejiang KNN 0.96, Zhejiang RF 0.96	Guangdong SVM 0.96, Guangdong KNN 0.96, Guangdong RF 0.96, Zhejiang SVM 0.96, Zhejiang KNN 0.96, Zhejiang RF 0.96
Chowdhury (2025) [27]	ANN, XGB	Hybrid/Superensemble	4/7	INT 10-fold CV	NA	NA	NA	NA	NA	NA	NA
Conde-Gutiérrez (2024) [104]	ANN	Classical ML	5/5	INT	NA	NA	NA	NA	NA	NA	NA
da Silva (2022) [86]	RF	Tree Ensemble	44/44	INT	NA	NA	NA	NA	NA	NA	NA
da Silva (2025) [87]	RF	Tree Ensemble	2/2	INT 70/30 temporal split	NA	NA	NA	NA	NA	NA	NA
Dala (2021) [29]	Backpropagation NN	Classical ML	4/4	INT	NA	NA	NA	NA	NA	NA	NA
Dang Anh Tuan (2025) [112]	GLM + XGB, LSTM	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	0.80–0.90	NA
Dhaked (2025) [30]	1D-CNN	Deep Learning	4/4	INT 80/20 split	NA	NA	NA	NA	NA	NA	NA
Doni (2020) [31]	LSTM	Deep Learning	6/6	INT	NA	NA	NA	NA	NA	Cases overall: 0.89; deaths overall: 0.81	NA
Edussuriya (2021) [32]	LSTM + Grey Wolf Optimizer	Deep Learning	4/3	INT TS split train 2010–2018; test January–March 2019	NA	NA	NA	NA	NA	NA	NA
Farooq (2022) [91]	XGB + SHAP	Tree Ensemble	57/57	INT	2018 0.97; 2019 0.93	2018 0.86; 2019 0.69	2018 0.95; 2019 0.93	NA	NA	NA	NA
Ferdousi (2021) [88]	GRU, LSTM	Deep Learning	12/12	INT	NA	NA	NA	NA	NA	NA	NA
Francisco (2024) [33]	Hybrid ML (CIF, RF, GAM, ANN, SVM/SVR, XGB)	Hybrid/Superensemble	8–30/8–30	INT TS split train 2009–2012; test 2013	GAM 0.69, RF 0.79, CIF 0.79, SVM 0.75, ANN 0.71, XGB 0.79	NA	NA	NA	NA	GAM 0.49, RF 0.59, CIF 0.77, SVM 0.57, ANN 0.51, XGB 0.59	NA
Guo (2017) [34]	SVR	Classical ML	5/12	INT	NA	NA	NA	NA	NA	>0.90	NA
Hamlet (2021) [89]	BRT	Tree Ensemble	18/18	INT SB-CV (~200 bootstraps)	0.93 (95% CI: 0.90–0.96)	NA	NA	NA	NA	NA	NA
Handari (2021) [35]	LSTM	Deep Learning	4/4	INT	NA	NA	NA	NA	NA	NA	NA
Holcomb (2023) [105]	RF, NN	Hybrid/Superensemble	12/>20	INT temporal split train 2015–2019; test 2020–2021; LOOCV by year/state	NA	NA	NA	NA	NA	NA	NA
Husin (2016) [36]	GANN	Other/Heuristic	NA	NA	NA	NA	NA	NA	NA	NA	NA
Islam (2024) [37]	LSTM	Deep Learning	1/1	INT hold-out TS	NA	NA	NA	NA	NA	0.71	NA
Ismail (2022) [38]	RF	Tree Ensemble	13/13	INT 10-fold CV	0.98	0.97	0.96	NA	NA	0.95	NA
Javaid (2023) [39]	RF	Tree Ensemble	23/16 (18 after preprocessing)	INT random 80/20 split; k-fold CV	NA	0.97	NA	0.96	NA	0.94	0.97
Jayabalan (2024) [39]	GB	Tree Ensemble	3/3	INT 70/30 train-test split	Bangkok 0.97; Bangladesh 0.98	Bangkok 0.98; Bangladesh 0.96	NA	Bangkok 0.96; Bangladesh 0.99	NA	Bangkok 0.97; Bangladesh 0.98	Bangkok 0.96; Bangladesh 0.98
Kerdprasop (2020) [40]	ANFIS	Other/Heuristic	3/8	EXT	NA	NA	NA	NA	NA	NA	NA
Kesorn (2015) [43]	SVM with kernel RBF (SVM-R)	Classical ML	9/9	INT 10-fold CV	NA	0.94	0.94	NA	NA	0.88	NA
Kiang (2021) [44]	LASSO	Classical ML	NA/77	INT	NA	NA	NA	NA	NA	NA	NA
Koh (2018) [45]	NN(AR(2)) with rainfall	Time-series/Statistical	2/4	INT	NA	NA	NA	NA	NA	NA	NA
Koplewitz (2022) [90]	RF	Tree Ensemble	10–15/NA	INT TS split train 2010–2015; test 2016 rolling OOS	NA	NA	NA	NA	NA	NA	NA
Kukkar (2024) [46]	WRF	Mechanistic	NA	INT 10-fold CV	NA	0.88	0.95	0.85	NA	0.94	0.86
Kumar Dey (2022) [47]	SVR	Classical ML	4/4	INT 80/20 train-test; CV on same dataset	NA	NA	NA	NA	NA	0.75	NA
Kuo (2024) [48]	RF	Tree Ensemble	121/121	INT 10-fold CV	0.95	0.97	0.73	NA	NA	0.87	NA
Laureano Rosario (2018) [106]	ANN	Classical ML	9/12	INT	Puerto Rico < 24 y: 0.91; Puerto Rico < 5 & 65 y: 0.71; Mexico < 24 y: 0.88; Mexico < 5 & 65 y: 0.90	NA	NA	NA	NA	Puerto Rico < 24 y: 0.47; Puerto Rico < 5 & 65 y: 0.58; Mexico < 24 y: 0.51; Mexico < 5 & 65 y: 0.66	Puerto Rico < 24 y: 0.97; Puerto Rico <5 & 65 y: 0.81; Mexico < 24 y: 0.80; Mexico < 5 & 65 y: 0.73
Li (2022) [115]	LSTM	Deep Learning	7/7	INT	NA	NA	NA	NA	NA	NA	NA
Li (2022) [91]	LSTM, LSTM + Attention	Deep Learning	6/6	INT TS split	NA	NA	NA	NA	NA	NA	NA
Liu (2016) [49]	CART	Classical ML	1/2	INT 10-fold CV	NA	Guangzhou 0.87; Zhongshan 0.96	Guangzhou 0.92; Zhongshan 0.94	NA	NA	Guangzhou 0.92; Zhongshan 0.95	NA
Liu (2020) [50]	LSTM	Deep Learning	144/144	INT	NA	NA	NA	NA	NA	NA	NA
Long (2025) [113]	RF, XGB, SVR, MLP	Hybrid/Superensemble	28/28	INT 4-fold CV	NA	NA	NA	NA	NA	NA	NA
Lu (2025) [51]	MLR, LSTM, SI-SIR	Hybrid/Superensemble	5/5	INT temporal validation train 2014–2018; test 2019–2020	NA	NA	NA	NA	NA	NA	NA
Majeed (2023) [53]	LSTM	Deep Learning	9/9	INT vs. benchmark model	NA	NA	NA	NA	NA	NA	NA
Majeed (2025) [54]	LSTM	Deep Learning	8–10/8–10	INT	NA	NA	NA	NA	NA	NA	NA
Majeed2 (2023) [52]	LSTM	Deep Learning	9/9	INT vs. benchmark model	NA	NA	NA	NA	NA	NA	NA
Mayrose (2024) [55]	MobileNetV3Small	Deep Learning	12/20	INT train/val/test split	0.98 ± 0.01	0.97 ± 0.03	0.99 ± 0.01	0.99 ± 0.01	NA	0.98 ± 0.01	0.98 ± 0.01
Mills (2025) [92]	Median Ensemble	Hybrid/Superensemble	6/6	INT	0.88	0.82	0.94	NA	NA	NA	NA
Mobin (2025) [56]	DT, RF, GB, XGB, SVR, KNN (Daily dataset)	Hybrid/Superensemble	78–86/78–86	INT 5-fold TSCV on train 80%; independent hold-out test 20%	NA	NA	NA	NA	NA	DT: 0.93; RF: 0.96; XGB: 0.93; GB: 0.92; SVR: 0.90; KNN: 0.89	NA
Muhamad Krishnan (2022) [57]	ANN	Classical ML	4/4	INT	NA	0.99	0.01	NA	NA	0.69	NA
Mulwa (2024) [109]	XGB	Tree Ensemble	5/5	INT 80/20 split; 5-fold CV for tuning	0.89	0.99	NA	0.99	NA	1.00	NA
Mussumeci (2020) [93]	LASSO, LSTM, RF	Hybrid/Superensemble	6/6	INT TS split train January 2010–June 2017; val/test July 2017–June 2018	NA	NA	NA	NA	NA	NA	NA
Mustaffa (2024) [58]	NNAR	Time-series/Statistical	1/1	INT train/test split	NA	NA	NA	NA	NA	NA	NA
Necesito (2021) [59]	LSTM	Deep Learning	1/1	INT	NA	NA	NA	NA	NA	NA	NA
Ningrum (2024) [20]	ETC (best model), CatBoost, XGB, LightGBM, LSTM, CBR, GB, OMP, Huber Regressor	Hybrid/Superensemble	NA	INT 80/20 train-test split	ETC: 0.95	ETC: 0.61	NA	ETC: 0.89	NA	ETC: 0.89	ETC: 0.72
Olmoguez (2019) [60]	RF	Tree Ensemble	2/8	INT	NA	NA	NA	NA	NA	NA	NA
Ong (2018) [61]	RF	Tree Ensemble	8/8	EXT temporal validation train 2006–2013; test 2014–2016	NA	NA	NA	NA	NA	NA	NA
Ong (2023) [62]	LR, DT, RF, SVM, NB, XGB, AdaBoost + Boruta	Hybrid/Superensemble	7/8	INT train/val split	ML and Boruta features selection: LR 0.79, DT 0.65, RF 0.75, SVM 0.82, XGB 0.72, AdaBoost 0.62	NA	NA	NA	NA	NA	NA
Panja (2023) [114]	XEWNet	Deep Learning	2/2	INT	NA	NA	NA	NA	NA	NA	NA
Patra (2025) [63]	CNN + BiLSTM	Deep Learning	1/1	INT train/test split	NA	NA	NA	NA	NA	NA	NA
Puengpreedaa (2020) [64]	RF, AdaBoost, ETC, LASSO	Hybrid/Superensemble	NA	INT	NA	NA	NA	NA	NA	NA	NA
Rahman (2025) [65]	XGB, LightGBM	Tree Ensemble	18/18	INT 10-fold CV; independent test set	XGBoost 0.89, LightGBM 0.84	LightGBM 0.96	LightGBM 0.98	LightGBM 0.97, XGBoost 0.95	LightGBM 0.98, XGBoost 0.96	LightGBM 0.97, XGBoost 0.95	LightGBM 0.96, XGBoost 0.95
Ren (2024) [66]	RF	Tree Ensemble	11/11	INT 5-fold CV on train 2003–2018; independent temporal validation 2019–2022	0.92	NA	NA	NA	NA	0.95	NA
Roster (2023) [94]	RF, GB, SVR, MLP	Hybrid/Superensemble	9/9	INT temporal expanding-window CV	NA	NA	NA	NA	NA	NA	NA
Salami (2020) [108]	PLS, glmnet, RF, XGB	Hybrid/Superensemble	17/17	INT 70/30 train-test; 5 × 10-fold CV on train	PLS: 0.88 (95% CI: 0.86–0.90); glmnet: 0.89 (95% CI: 0.87–0.91); RF: 0.97 (95% CI: 0.96–0.98); XGB: 0.97 (95% CI: 0.96–0.98)	PLS: 0.75 (95% CI: 0.71–0.78); glmnet: 0.79 (95% CI: 0.76–0.82); RF: 0.89 (95% CI: 0.87–0.91); XGB: 0.88 (95% CI: 0.86–0.91)	PLS: 0.84 (95% CI: 0.83–0.84); glmnet: 0.93 (95% CI: 0.92–0.93); RF: 0.93 (95% CI: 0.92–0.93); XGB: 0.94 (95% CI: 0.94–0.95)	PLS: 0.70; glmnet: 0.83; RF: 0.90; XGB: 0.93	PLS:0.88; glmnet: 0.91; RF: 0.94; XGB: 0.96	PLS: 0.84; glmnet: 0.89; RF: 0.92; XGB: 0.95	PLS: 0.76; glmnet: 0.81; RF: 0.91; XGB: 0.90
Salim (2021) [67]	RF, SVM, ANN	Hybrid/Superensemble	5/5	INT 70/30 hold-out	Epidemiological only: RF 0.80, SVM 0.75; ANN 0.70–0.72; Epidemiological + Climatic: RF 0.88–0.90, SVM 0.82–0.85, ANN 0.75–0.80	Epidemiological only: RF 0.75–0.80, SVM 0.70–0.75, ANN 0.75–0.85; Epidemiological + Climatic: RF 0.85–0.90, SVM 0.75–0.85, ANN 0.70–0.80	Epidemiological only: RF 0.75–0.80, SVM 0.70–0.75, 0.68–0.72; Epidemiological + Climatic: RF 0.85–0.90, SVM 0.75–0.85, ANN 0.70–0.80	NA	NA	Epidemiological only: RF 0.85–0.88, SVM 0.78, ANN 0.72–0.75; Epidemiological + Climatic data: RF 0.85–0.88, SVM 0.80–0.85, ANN 0.75–0.80	NA
Salsabiila (2025) [68]	CNN-BiGRU + Attention	Deep Learning	4/4	INT 80/20 temporal split; CV for ablation	NA	0.79	NA	0.88	NA	0.74	0.82
Sánchez López (2023) [95]	SVM	Classical ML	10/10	INT	0.96	0.97	NA	NA	NA	0.97	0.97
Sanchez-Gendriz (2022) [96]	LSTM	Deep Learning	2/2	INT chronological split 2016–2018 train; 2019 test; 30 runs	NA	NA	NA	NA	NA	NA	NA
Sebastianelli (2024) [97]	CatBoost, SVM, LSTM, RF	Hybrid/Superensemble	20–40/42	INT temporal validation train 2001–2016; test 2017–2019	NA	NA	NA	NA	NA	NA	NA
Shaikh (2023) [69]	Optimized Ensemble (CNN + ANN + SVM, NC-DEFO)	Hybrid/Superensemble	20/NA	INT validation	NA	NA	NA	NA	NA	NA	NA
Shi (2016) [70]	LASSO	Classical ML	60/226	INT CV	NA	NA	NA	NA	NA	NA	NA
Siddikur Rahman (2025) [71]	RF, XGB, LightGBM + SHAP	Tree Ensemble	22/22	INT	NA	NA	NA	NA	NA	NA	NA
Soliman (2020) [98]	DFFN (deep feed-forward neural network)	Deep Learning	7/7	EXT	NA	NA	NA	NA	NA	NA	NA
Sood (2020) [116]	Naive Bayesian Network (NBN)	Classical ML	17/17	INT 10-fold CV	NA	0.93	0.93	0.92	NA	0.93	0.92
Souza (2022) [99]	Diffusion Maps + SVM (RBF)	Classical ML	2/2	INT	NA	NA	NA	NA	NA	Aracaju 1.00; Belo Horizonte 0.80; Manaus 0.80; Recife 0.20; Rio de Janeiro 1.00; Salvador 0.80; São Luís 1.00. Mean: 0.80 ± 0.20	NA
Stavelin (2022) [72]	LSTM	Deep Learning	20/20	INT TS rolling forecast	NA	NA	NA	NA	NA	NA	NA
Teurlai (2015) [110]	SVM	Classical ML	5/34	INT	NA	NA	NA	NA	NA	NA	NA
Theodorakos (2017) [100]	Differential Evolution (Numerical)	Other/Heuristic	NA	INT	NA	NA	NA	NA	NA	NA	NA
Tian (2024) [73]	SVM, XGB	Hybrid/Superensemble	19/19	INT 80/20 train-test split	NA	NA	NA	NA	NA	NA	NA
Tuan (2024) [74]	RF, GB, LSTM	Hybrid/Superensemble	13/13	INT cross-sectional + TS	NA	NA	NA	NA	NA	NA	NA
Wu (2021) [75]	SVR, RF, ANN (MLP)	Hybrid/Superensemble	12/12	INT chronological split 83/17	NA	NA	NA	NA	NA	NA	NA
Yamana (2016) [107]	Superensemble: F1 (SIR-EAKF), F2 (Bayesian weighted outbreaks), F3 (historical likelihood)	Hybrid/Superensemble	NA	EXT	NA	NA	NA	NA	NA	NA	NA
Yavari Nejad (2021) [76]	Bayes Net (BN) + TRF	Classical ML	6/7	INT 10-fold CV (WEKA 3.8)	NA	NA	NA	NA	NA	Bayes Net: + TRF 0.92; without TRF 0.91	NA
Yeh (2025) [77]	ARDL + LSTM, SVR, MLP, GRNN, RBF, GMDH, GEP	Hybrid/Superensemble	8/8 (+lags)	INT train/test split	NA	NA	NA	NA	NA	NA	NA
Yi (2023) [78]	Hybrid NN + RNN + EnKF superensemble (PICTUREE-Aedes)	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA	NA
Zhao (2020) [101]	RF	Tree Ensemble	25–30/25–30	INT	NA	NA	NA	NA	NA	NA	NA
Zhao (2023) [79]	CNN-BiLSTM	Deep Learning	3/3	INT	NA	NA	NA	NA	NA	1 w 0.88; 2 w 0.85; 3 w 0.81; 4 w 0.78	NA

AUC = Area Under the ROC Curve; AdaBoost = Adaptive Boosting; ANFIS = Adaptive Neuro-Fuzzy Inference System; AR(2) = autoregressive model of order 2; ARDL = Autoregressive Distributed Lag; BN = Bayesian Network; Boruta = Boruta feature selection; BRT = Boosted Regression Trees; CART = Classification and Regression Tree; CatBoost = Categorical Gradient Boosting; CBR = CatBoost Regressor; CIF = Conditional Inference Forest; CNN = Convolutional Neural Network; CNN-BiLSTM = Convolutional Neural Network + Bidirectional LSTM; DFFN = Deep Feed-Forward Network; DT = decision tree; EAKF = Ensemble Adjustment Kalman Filter; EnKF = Ensemble Kalman Filter; ETC = Extra Trees Classifier; EXT = external validation; F1-score = harmonic mean of precision and recall; GAM = Generalized Additive Model; GB = Gradient Boosting; GANN = Genetic Algorithm Neural Network; GEP = Gene Expression Programming; glmnet = Elastic-Net Regularized GLMs; GMDH = Group Method of Data Handling; GRNN = General Regression Neural Network; GRU = Gated Recurrent Unit; Huber Regressor = robust regression with Huber loss; IHLOA = Improved Harris Hawks/Heuristic Learning Optimization Algorithm (feature selection); INT = internal validation; KNN = k-Nearest Neighbours; LASSO = Least Absolute Shrinkage and Selection Operator; LightGBM = Light Gradient Boosting Machine; LOOCV = Leave-One-Out Cross-Validation; LR = Logistic Regression; LSTM = Long Short-Term Memory; LSTM + attention = LSTM with attention mechanism; MLP = Multilayer Perceptron; MobileNetV3Small = lightweight CNN architecture; N = Number; NARX = Nonlinear Autoregressive Network with Exogenous Inputs; NB = Naïve Bayes; NBN = Naïve Bayesian Network; NN = Neural Network; NNAR = Neural Network Autoregression; NPV = Negative Predictive Value; OMP = Orthogonal Matching Pursuit; OOS = Out-of-Sample; PPV (Precision) = Positive Predictive Value; PLS = Partial Least Squares; RF = Random Forest; RBF = Radial Basis Function (kernel); SB-CV = Spatial Block Cross-Validation; SHAP = SHapley Additive exPlanations; SI-SIR = Susceptible–Infectious/Susceptible–Infectious–Recovered (compartmental model); SIR-EAKF = SIR model with Ensemble Adjustment Kalman Filter; SVM = Support Vector Machine; SVR = Support Vector Regression; SWR = Short-Wave Radiation; SST = Sea Surface Temperature; TSCV = Time-Series Cross-Validation; w = Week; WRF = Weather Research and Forecasting.

Table 3. Results of regression metrics for the principal AI models used in the included studies.

First Author (Year)	Principal AI Model	AI Category	Metric Scale (Unit–Temporal–Spatial)	MAE	RMSE	MSE	MAPE	SMAPE	R²	r
Akhtar (2019) [102]	NARX NN	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Al Mobin (2024) [18]	DT + Sequential Squeeze FS	Classical ML	cases–monthly–national	4759.06	9296.35	NA	0.94	NA	NA	NA
Anggraeni (2021) [19]	BiLSTM	Deep Learning	cases–monthly–city level	Surabaya 19.11; Malang 25.73	Surabaya 30.11; Malang 28.65	NA	NA	Surabaya 0.31; Malang 0.18	NA	NA
Anno (2019) [21]	CNN	Deep Learning	NA	NA	NA	NA	NA	NA	NA	NA
Anno (2024) [22]	CNN	Deep Learning	NA	NA	NA	NA	NA	NA	NA	NA
Appice (2020) [103]	AutoTiC-NN	Classical ML	cases–monthly–regional	NA	32 states: 13; 17 active states: 7	NA	NA	NA	NA	NA
Baquero (2018) [80]	GAM, ANN (MLP), LSTM	Hybrid/Superensemble	cases–monthly–city level	NA	GAM 2152; Ensemble 3164; MLP 4422	NA	NA	NA	NA	NA
Benedum (2020) [111]	RF	Tree Ensemble	cases–weekly–city level	6.3	NA	NA	NA	NA	NA	NA
Bogado (2023) [81]	LSTM	Deep Learning	NA	NA	NA	NA	NA	NA	NA	NA
Bomfim (2020) [82]	NN	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Buebos-Esteve (2024) [23]	RF	Tree Ensemble	cases–10 days–regional	Regional incidence: rfsrc 32.55; ranger 74.56; rf 40.19; ensbl 43.76. Yearly incidence: rfsrc 39.56; ranger 41.24; rf 41.97; ensbl 39.82. Regional mortality: rfsrc 0.79; ranger 0.82; rf 0.73; ensbl 1.36	NA	Regional incidence: rfsrc 2414.53; ranger 11,018.55; rf 2539.72; ensbl 2445.02. Yearly incidence: rfsrc 4314.84; ranger 5210.29; rf 3740.15; ensbl 5430.59. Regional mortality: rfsrc 2.78; ranger 2.77; rf 2.04; ensbl 69.41	NA	NA	NA	NA
Campbell (2015) [83]	DT	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Carvajal (2018) [24]	RF	Tree Ensemble	per 1000 population–weekly–city level	0.15	0.21	NA	NA	NA	NA	NA
Chen (2018) [25]	LASSO	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Chen (2024) [84]	LSTM + SHAP	Deep Learning	cases–monthly–national	Top3/Worst3 out of 27 = 1 month: Top Roraima 6.36; Amapá 27.45; Sergipe 41.52; Worst Espírito Santo 6300.78; Minas Gerais 5088.71; Paraná 4450.76; 3 months: Top Roraima 5.70; Amapá 37.99; Sergipe 44.62; Worst Minas Gerais 28,714.48; Santa Catarina 23,381.95; Espírito Santo 20,177.49	Top3/Worst3 out of 27 = 1 month: Top Santa Catarina 15.21; Ceará 15.51; Pernambuco 15.84; Worst Rio Grande do Sul 56.30; Roraima 44.31; Rondônia 40.93; 3 months: Top Sergipe 18.62; Roraima 20.76; Pernambuco 22.94; Worst Rio Grande do Sul 826.28; São Paulo 570.01; Santa Catarina 418.37	NA	NA	NA	NA	NA
Chen (2025) [85]	LSTM	Deep Learning	cases–weekly–city level	Manaus 75.45, Belém 23.98, Fortaleza 247.27, Salvador 228.55, Brasília 1067.66, Goiânia 439.02, Belo Horizonte 1483.27, Rio de Janeiro 819.73, São Paulo 1102.75, Curitiba 65.99	NA	NA	Manaus 29.95, Belém 29.28, Fortaleza 22.59, Salvador 23.95, Brasília 22.12, Goiânia 23.26, Belo Horizonte 22.47, Rio de Janeiro 21.87, São Paulo 22.18, Curitiba 25.33	NA	NA	NA
Cheng (2025) [26]	Feature selection: Regression + fuzzy c-means + IHLOA; Classificators: SVM, KNN, RF	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Chowdhury (2025) [27]	ANN, XGB	Hybrid/Superensemble	cases–monthly–national	ANN 1260.98; XGB 479.44	ANN 2229.66; XGB 918.83	NA	ANN 1.92; XGB 2.25	NA	NA	NA
Conde-Gutiérrez (2024) [104]	ANN	Classical ML	cases–weekly–regional	NA	Non-severe dengue 0.26; Dengue with warning signs 0.17; Severe dengue 0.04	NA	NA	NA	Non-severe dengue 0.97; Dengue with warning signs 0.98; Severe dengue 0.81	NA
da Silva (2022) [86]	RF	Tree Ensemble	cases–bimonthly–city level	NA	Bimonthly (b1–b6): 2014: 4.67, 5.57, 3.79, 4.51, 3.24, 2.34; 2015: 9.19, 4.44, 2.97, 5.21, 5.12, 5.99; 2016: 4.15, 3.88, 4.38, 3.15, 3.94, 3.30	NA	NA	NA	NA	NA
da Silva (2025) [87]	RF	Tree Ensemble	cases–weekly–city level	Natal D 57.8–71.8; Natal CD 97.9. Iquitos CD 2.78–4.16; Iquitos D 4.02. Barranquilla HD 6.09–6.67; Barranquilla CD 7.81	NA	NA	NA	NA	NA	Natal D 0.92–0.95; Natal CD 0.90. Iquitos CD 0.85–0.89; Iquitos D 0.81. Barranquilla HD 0.94–0.95; Barranquilla CD 0.92
Dala (2021) [29]	Backpropagation NN	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Dang Anh Tuan (2025) [112]	GLM + XGB, LSTM	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Dhaked (2025) [30]	1D-CNN	Deep Learning	cases–monthly–city level	31.49	56.45	3187.43	NA	NA	NA	NA
Doni (2020) [31]	LSTM	Deep Learning		NA	NA	NA	NA	NA	NA	NA
Edussuriya (2021) [32]	LSTM + Grey Wolf Optimizer	Deep Learning	cases–monthly–district level	NA	Without GWO 25.45; GWO: 20.45; final model: 10.84	NA	NA	NA	NA	NA
Farooq (2022) [91]	XGB + SHAP	Tree Ensemble		NA	NA	NA	NA	NA	NA	NA
Ferdousi (2021) [88]	GRU, LSTM	Deep Learning	per 100,000 population–weekly–district level	GRU 0.34 ± 0.02; LSTM 0.36 ± 0.01	NA	NA	NA	NA	NA	NA
Francisco (2024) [33]	Hybrid ML (CIF, RF, GAM, ANN, SVM/SVR, XGB)	Hybrid/Superensemble	cases–weekly–city level	NA	NA	NA	NA	NA	NA	NA
Guo (2017) [34]	SVR	Classical ML	cases–weekly–provincial	NA	Guangzhou 16.26, Foshan 1.05, Zhongshan 0.35, Zhuhai 0.57, Shenzhen 0.80, Other cities 0.27	NA	NA	NA	Guangzhou 0.99, Foshan 0.99, Zhongshan 0.99, Zhuhai 0.99, Shenzhen 0.99, Other cities 0.99	NA
Hamlet (2021) [89]	BRT	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Handari (2021) [35]	LSTM	Deep Learning	cases–weekly–district level	NA	West 10.13, South 5.63, East 9.58, North 5.34, Central 4.79	NA	NA	NA	NA	NA
Holcomb (2023) [105]	RF, NN	Hybrid/Superensemble	cases–annual–national	RF 21.30; NN 22.70	RF 30.10; NN 31.60	NA	NA	NA	NA	NA
Husin (2016) [36]	GANN	Other/Heuristic	cases–weekly–district level	NA	NA	Sepang 0.07; Hulu Selangor 0.06; Hulu Langat 0.07; Klang 0.06; Kuala Selangor 0.06	NA	NA	NA	NA
Islam (2024) [37]	LSTM	Deep Learning	cases–monthly–national	301.64	414.23	NA	28.78	NA	NA	NA
Ismail (2022) [38]	RF	Tree Ensemble	NA	NA	NA	NA	5.46; after removal of entomological data 8.32	NA	NA	NA
Javaid (2023) [39]	RF	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Jayabalan (2024) [40]	GB	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Kerdprasop (2020) [41]	ANFIS	Other/Heuristic	cases–monthly–city level	151.51	216.54	NA	NA	NA	NA	0.83
Kesorn (2015) [43]	SVM with kernel RBF (SVM-R)	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Kiang (2021) [44]	LASSO	Classical ML	cases–monthly–provincial	LASSO: Bangkok, 1-month ahead 423.7	NA	NA	NA	NA	NA	NA
Koh (2018) [45]	NN(AR(2)) with rainfall	Time-series/Statistical	cases–weekly–city level	NA	NA	NA	NA	NA	NA	NA
Koplewitz (2022) [90]	RF	Tree Ensemble	cases–weekly–city level	NA	1 w 11.03; 3 w 17.62; 6 w 22.06; 8 w 23.36	NA	NA	NA	1 w 0.85; 3 w 0.62; 6 w 0.40; 8 w 0.34	1 w 0.93; 3 w 0.80; 6 w 0.67; 8 w 0.60
Kukkar (2024) [46]	WRF	Mechanistic		NA	NA	NA	NA	NA	NA	NA
Kumar Dey (2022) [47]	SVR	Classical ML	cases–monthly–city level	4.95	NA	NA	NA	NA	NA	NA
Kuo (2024) [48]	RF	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Laureano Rosario (2018) [106]	ANN	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Li (2022) [115]	LSTM	Deep Learning	log(cases)–weekly–regional	Test 2018–2019: 1 w 0.27; 2 w 0.27; 3 w 0.27; 4 w 0.26; 5 w 0.27; 6 w 0.31; 7 w 0.30; 8 w 0.29; 9 w 0.29; 10 w 0.31; 11 w 0.27; 12 w 0.33. Test 2019 peak (January–August): 1 w 0.20; 2 w 0.19; 3 w 0.20; 4 w 0.21; 5 w 0.19; 6 w 0.21; 7 w 0.22; 8 w 0.23; 9 w 0.27; 10 w 0.22; 11 w 0.23; 12 w 0.28	Test 2018–2019: 1 w 0.35; 2 w 0.34; 3 w 0.34; 4 w 0.35; 5 w 0.34; 6 w 0.40; 7 w 0.37; 8 w 0.38; 9 w 0.38; 10 w 0.39; 11 w 0.34; 12 w 0.40. Test 2019 peak (January–August): 1 w 0.23; 2 w 0.22; 3 w 0.25; 4 w 0.25; 5 w 0.22; 6 w 0.26; 7 w 0.28; 8 w 0.29; 9 w 0.32; 10 w 0.28; 11 w 0.28; 12 w 0.33	NA	NA	NA	NA	NA
Li (2022) [91]	LSTM, LSTM + Attention	Deep Learning	log(cases)–weekly–regional	Federal District LSTM w/o cases: 1 w 0.53; 2 w 0.56; 3 w 0.50; 4 w 0.50; LSTM with cases: 1 w 0.42; 2 w 0.41; 3 w 0.40; 4 w 0.46; LSTM-ATT w/o cases: 1 w 0.53; 2 w 0.49; 3 w 0.46; 4 w 0.47; LSTM-ATT with cases: 1 w 0.42; 2 w 0.38; 3 w 0.40; 4 w 0.43; Fortaleza LSTM w/o cases: 1 w 0.44; 2 w 0.47; 3 w 0.45; 4 w 0.43; LSTM with cases: 1 w 0.35; 2 w 0.35; 3 w 0.40; 4 w 0.44; LSTM-ATT w/o cases: 1 w 0.41; 2 w 0.44; 3 w 0.45; 4 w 0.43; LSTM-ATT with cases: 1 w 0.26; 2 w 0.34; 3 w 0.33; 4 w 0.39	Federal District LSTM w/o cases: 1 w 0.70; 2 w 0.73; 3 w 0.66; 4 w 0.66; LSTM with cases: 1 w 0.53; 2 w 0.52; 3 w 0.50; 4 w 0.56; LSTM-ATT w/o cases: 1 w 0.66; 2 w 0.68; 3 w 0.61; 4 w 0.61; LSTM-ATT with cases: 1 w 0.53; 2 w 0.46; 3 w 0.49; 4 w 0.51; Fortaleza LSTM w/o cases: 1 w 0.55; 2 w 0.57; 3 w 0.59; 4 w 0.55; LSTM with cases: 1 w 0.42; 2 w 0.44; 3 w 0.50; 4 w 0.56; LSTM-ATT w/o cases: 1 w 0.51; 2 w 0.53; 3 w 0.57; 4 w 0.55; LSTM-ATT with cases: 1 w 0.33; 2 w 0.46; 3 w 0.43; 4 w 0.51	NA	NA	NA	NA	NA
Liu (2016) [49]	CART	Classical ML	cases–weekly–city level	NA	1 w Guangzhou 3.22, Zhongshan 0.37; 1–3 w Guangzhou 3.72, Zhongshan 0.38	NA	NA	NA	NA	NA
Liu (2020) [50]	LSTM	Deep Learning	cases–weekly–district level	NA	NA	NA	NA	NA	NA	NA
Long (2025) [113]	RF, XGB, SVR, MLP	Hybrid/Superensemble	cases–annual–national	NA	RF 0.42; XGB 0.46; MLP 0.53; SVR 0.61	RF 0.18; XGB 0.21; MLP 0.28; SVR 0.37	NA	NA	RF 0.84 XGB 0.82; MLP 0.75; SVR 0.68	NA
Lu (2025) [51]	MLR, LSTM, SI-SIR	Hybrid/Superensemble	cases–weekly–national	Pre-lockdown 204.36; During lockdown 434.02	NA	NA	Pre-lockdown 13.97; During lockdown 87.03; Extended validation 13.12–17.09	NA	NA	NA
Majeed (2023) [53]	LSTM	Deep Learning	cases–weekly–national	NA	Best/Worst by look-back = 1 m Best SA-LSTM (Climate/time/geography) 3.27; Worst SA-LSTM (Climate) 6.77; 2 m Best A-LSTM (Climate/time/geography) 3.10; Worst LSTM (Climate/time) 5.01; 3 m Best SA-LSTM (Climate/time/geography) 4.56; Worst S-LSTM (Climate/time) 6.32; 4 m Best SA-LSTM (Climate) 3.01; Worst S-LSTM (Climate/time) 4.88; 5 m Best SA-LSTM (Climate/time/geography) 3.37; Worst LSTM (Climate) 6.69; 6 m Best SA-LSTM (Climate/time/geography) 4.32; Worst S-LSTM (Climate) 7.44	NA	NA	NA	NA	NA
Majeed (2025) [54]	LSTM	Deep Learning	cases–weekly–national	NA	ST-SLSTM 2.66 ± 0.57; ST-LSTM 3.61 ± 0.57; SSA-LSTM 3.17 ± 0.41; STA-LSTM 3.67 ± 0.60; TA-LSTM 3.66 ± 0.63; SA-LSTM 3.87 ± 0.58; S-LSTM 4.13 ± 0.59; Plain LSTM 4.15 ± 0.61	NA	NA	NA	NA	NA
Majeed2 (2023) [52]	LSTM	Deep Learning	cases–monthly–national	NA	LSTM: 4.15 ± 0.61; S-LSTM: 4.13 ± 0.59; TA-LSTM: 4.13 ± 0.59; STA-LSTM: 3.67 ± 0.60; SA-LSTM: 3.87 ± 0.58; SSA-LSTM (stacked + spatial attention): 3.17 ± 0.41	NA	NA	NA	NA	NA
Mayrose (2024) [55]	MobileNetV3Small	Deep Learning		NA	NA	NA	NA	NA	NA	NA
Mills (2025) [92]	Median Ensemble	Hybrid/Superensemble	per 100,000 population–monthly–provincial	NA	0.81	NA	NA	NA	0.74	NA
Mobin (2025) [56]	DT, RF, GB, XGB, SVR, KNN (Daily dataset)	Hybrid/Superensemble	cases–monthly–national	RF 90; DT 114; XGB 121; GB 132; SVR 147; KNN 161	RF 176; DT 225; XGB 240; GB 260; SVR 290; KNN 320	NA	RF 3.6; DT 4.5; XGB 5.0; GB 5.4; SVR 5.9; KNN 6.3	NA	NA	NA
Muhamad Krishnan (2022) [57]	ANN	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Mulwa (2024) [109]	XGB	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Mussumeci (2020) [93]	LASSO, LSTM, RF	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Mustaffa (2024) [58]	NNAR	Time-series/Statistical	cases–weekly–national	NA	597.74	NA	94.84	NA	NA	NA
Necesito (2021) [59]	LSTM	Deep Learning	cases–monthly–city level	NA	2016: 32.14; 2017: 38.41; 2018: 28.06	NA	NA	NA	NA	2016: 0.58; 2017: 0.82; 2018: 0.92
Ningrum (2024) [20]	ETC (best model), CatBoost, XGB, LightGBM, LSTM, CBR, GB, OMP, Huber Regressor	Hybrid/Superensemble	cases–weekly–district level	MAE 0.63	1.09	1.20	NA	NA	0.56	NA
Olmoguez (2019) [60]	RF	Tree Ensemble	NA	NA	NA	NA	NA	NA	0.73	NA
Ong (2018) [61]	RF	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Ong (2023) [62]	LR, DT, RF, SVM, NB, XGB, AdaBoost + Boruta	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Panja (2023) [114]	XEWNet	Deep Learning	per 10,000 population–weekly–regional	Puerto Rico 26 w 5.66, 52 w 42.14; Peru 26 w 1.57, 52 w 2.50; India 26 w 2.36, 52 w 6.55	Puerto Rico 26 w 7.69, 52 w 68.49; Peru 26 w 1.98, 52 w 4.73; India 26 w 2.04, 52 w 9.98	NA	NA	NA	NA	NA
Patra (2025) [63]	CNN + BiLSTM	Deep Learning	cases–weekly–national	54.53	106.96	NA	NA	NA	0.94	NA
Puengpreedaa (2020) [64]	RF, AdaBoost, ETC, LASSO	Hybrid/Superensemble	cases–weekly–provincial	Chiang Rai h1 10.98 RF, Chiang Rai h2 16.44 RF, Chiang Rai h3 21.27 RF, Chiang Rai h4 25.65 RF, Mukdahan h1 1.61 AdaBoost, Mukdahan h2 1.80 ETC, Mukdahan h3 2.05 LASSO, Mukdahan h4 2.02 RF, Pattani h1 2.83 ETC, Pattani h3 3.20 AdaBoost, Pattani h4 3.37 LASSO, Ayutthaya h3 9.34 RF, Ratchaburi h4 8.56 AdaBoost	NA	Chiang Rai h1 237.98 RF, Chiang Rai h2 543.87 RF, Chiang Rai h3 847.23 RF, Chiang Rai h4 1193.56 RF, Mukdahan h1 5.52 AdaBoost, Mukdahan h2 6.91 ETC, Mukdahan h3 9.18 LASSO, Mukdahan h4 10.12 RF, Pattani h1 17.13 ETC, Pattani h3 18.20 AdaBoost, Pattani h4 22.16 LASSO, Ayutthaya h3 155.30 RF, Ratchaburi h4 116.05 AdaBoost	NA		Chiang Rai h1 0.92 RF, Chiang Rai h2 0.82 RF, Chiang Rai h3 0.72 RF, Chiang Rai h4 0.61 RF, Mukdahan h1 0.81 AdaBoost, Mukdahan h2 0.76 ETC, Mukdahan h3 0.68 LASSO, Mukdahan h4 0.65 RF, Pattani h1 0.78 ETC, Pattani h3 0.78 AdaBoost, Pattani h4 0.73 LASSO, Ayutthaya h3 0.56 RF, Ratchaburi h4 0.47 AdaBoost	NA
Rahman (2025) [65]	XGB, LightGBM	Tree Ensemble	NA	NA	NA	NA	NA	NA	LightGBM 0.09, XGBoost 0.84	NA
Ren (2024) [66]	RF	Tree Ensemble	NA	NA	NA	NA	NA	NA	NA	NA
Roster (2023) [94]	RF, GB, SVR, MLP	Hybrid/Superensemble	cases–monthly–district level	Mean, median: train: GB Corr 39.0, 8.8; GB PCMCI 38.7, 8.9; GB OnlyD 38.3, 8.9; GB Clim 41.4, 9.7; MLP Corr 44.0, 12.1; MLP PCMCI 41.7, 12.4; MLP OnlyD 48.2, 13.0; MLP Clim 53.8, 22.7; RF Corr 37.4, 8.9; RF PCMCI 38.8, 8.5; RF OnlyD 37.2, 8.6; RF Clim 42.6, 9.6; SVR Corr 54.6, 15.1; SVR PCMCI 54.6, 15.3; SVR OnlyD 55.3, 14.9; SVR Clim 53.0, 14.8; test: RF OnlyD 53.7, 12.2; City specific best 52.5, 11.9	Mean, median: train: GB Corr 68.1, 13.6; GB PCMCI 67.6, 13.9; GB OnlyD 67.3, 13.5; GB Clim 72.5, 15.6; MLP Corr 74.6, 17.7; MLP PCMCI 71.2, 18.7; MLP OnlyD 78.6, 18.3; MLP Clim 89.9, 32.0; RF Corr 67.4, 15.4; RF PCMCI 67.9, 14.2; RF OnlyD 67.3, 15.5; RF Clim 73.5, 15.6; SVR Corr 90.5, 18.0; SVR PCMCI 90.4, 18.8; SVR OnlyD 90.5, 17.7; SVR Clim 91.1, 19.9; val: RF OnlyD 130.4, 25.4; City specific 119.2, 26.5	NA	NA	NA	NA	NA
Salami (2020) [108]	PLS, glmnet, RF, XGB	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Salim (2021) [67]	RF, SVM, ANN	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Salsabiila (2025) [68]	CNN-BiGRU + Attention	Deep Learning	cases–weekly–city level	TiDE-PSO 45.10	TiDE-PSO 75.76	NA	NA	NA	NA	NA
Sánchez López (2023) [95]	SVM	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Sanchez-Gendriz (2022) [96]	LSTM	Deep Learning	NA	NA	NA	NA	NA	NA	NA	0.92
Sebastianelli (2024) [97]	CatBoost, SVM, LSTM, RF	Hybrid/Superensemble	per 100,000 population–monthly–national	NA	Rondônia 0.24; Acre 0.28; Amazonas 0.23; Roraima 0.12; Piauí 0.08	NA	NA	NA	NA	NA
Shaikh (2023) [69]	Optimized Ensemble (CNN + ANN + SVM, NC-DEFO)	Hybrid/Superensemble	cases–weekly–city level	1.05	5.73	NA	NA	0.04	NA	NA
Shi (2016) [70]	LASSO	Classical ML	NA	NA	NA	NA	1 w: 17 (95% CI: 16–19); 12 w: 24 (95% CI: 22–26)	NA	NA	NA
Siddikur Rahman (2025) [71]	RF, XGB, LightGBM + SHAP	Tree Ensemble	cases–monthly–national	RF Climate test 0.65, RF Climate training 0.42, RF SocDem test 0.85, RF SocDem training 0.48, RF Landscape test 0.87, RF Landscape training 0.47, XGBoost Climate test 0.51, XGBoost Climate training 0.42, XGBoost SocDem test 0.53, XGBoost SocDem training 0.52, XGBoost Landscape test 0.54, XGBoost Landscape training 0.41, LightGBM Climate test 0.28, LightGBM Climate training 0.24, LightGBM SocDem test 0.46, LightGBM SocDem training 0.41, LightGBM Landscape test 0.47, LightGBM Landscape training 0.34	RF Climate test 0.71, RF Climate training 0.67, RF SocDem test 0.79, RF SocDem training 0.56, RF Landscape test 0.78, RF Landscape training 0.65, XGBoost Climate test 0.62, XGBoost Climate training 0.58, XGBoost SocDem test 0.68, XGBoost SocDem training 0.53, XGBoost Landscape test 0.67, XGBoost Landscape training 0.54, LightGBM Climate test 0.36, LightGBM Climate training 0.32, LightGBM SocDem test 0.53, LightGBM SocDem training 0.42, LightGBM Landscape test 0.57, LightGBM Landscape training 0.42	NA	RF Climate test 0.16, RF Climate training 0.15, RF SocDem test 0.17, RF SocDem training 0.14, RF Landscape test 0.15, RF Landscape training 0.12, XGB Climate test 0.13, XGB Climate training 0.12, XGB SocDem test 0.17, XGB SocDem training 0.132, XGB Landscape test 0.16, XGB Landscape training 0.13, LightGBM Climate test 0.09, LightGBM Climate training 0.05, LightGBM SocDem test 0.11, LightGBM SocDem training 0.08, LightGBM Landscape test 0.11, LightGBM Landscape training 0.09	NA	NA	NA
Soliman (2020) [98]	DFFN (deep feed-forward neural network)	Deep Learning	per 100,000 population–monthly–national	6.36	8.93	NA	NA	NA	NA	0.42
Sood (2020) [116]	Naive Bayesian Network (NBN)	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Souza (2022) [99]	Diffusion Maps + SVM (RBF)	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Stavelin (2022) [72]	LSTM	Deep Learning	log(cases)–monthly–district level	NA	Univariate Slice1 2010–2015 1.20; Slice2 2011–2019 1.30; mean 1.25; SD 0.05; Multivariate 1.13	NA	NA	NA	Univariate 1.00	NA
Teurlai (2015) [110]	SVM	Classical ML		NA	NA	NA	NA	NA	NA	NA
Theodorakos (2017) [100]	Differential Evolution (Numerical)	Other/Heuristic	cases–monthly–national	40.18	106.30	11,869.5	NA	NA	NA	NA
Tian (2024) [73]	SVM, XGB	Hybrid/Superensemble	cases–weekly–national	XGB (lag + temporal) 89.12; SVM 160.73; XGB (temporal only) 160.65; XGB (no lag/no temporal) 175.49	XGB (lag + temporal) 156.07; SVM 268.83; XGB (temporal only) 232.58; XGB (no lag/no temporal) 247.86	NA	NA	NA	XGB (lag + temporal) 0.83; SVM 0.50; XGB (temporal only) 0.49; XGB (no lag/no temporal) 0.42	NA
Tuan (2024) [74]	RF, GB, LSTM	Hybrid/Superensemble	cases–monthly–provincial	RF 232.22; GB 206.60; LSTM 89.15	RF 381.52; GB 336.40; LSTM 106.23	NA	NA	NA	NA	NA
Wu (2021) [75]	SVR, RF, ANN (MLP)	Hybrid/Superensemble		NA	NA	NA	NA	NA	NA	NA
Yamana (2016) [107]	Superensemble: F1 (SIR-EAKF), F2 (Bayesian weighted outbreaks), F3 (historical likelihood)	Hybrid/Superensemble	cases–weekly–city level	Timing, Peak, Total: SE(F1,F2) 3.3, 21, 473, SE(F1,F2,F3) 3.7, 20, 486	NA	NA	NA	NA	NA	NA
Yavari Nejad (2021) [76]	Bayes Net (BN) + TRF	Classical ML	NA	NA	NA	NA	NA	NA	NA	NA
Yeh (2025) [77]	ARDL + LSTM, SVR, MLP, GRNN, RBF, GMDH, GEP	Hybrid/Superensemble	cases–monthly–city level	NA	Kaohsiung (high incidence area): ARDL + SVR 4.25; ARDL + LSTM 4.41; ARDL + GRNN 4.68; ARDL + RBF 4.79; ARDL + MLP logistic 4.83; ARDL + GEP 5.20; ARDL + GMDH 5.62. Tainan (high incidence area): ARDL + GEP 0.76; ARDL + SVR 0.83; ARDL + GRNN 0.84; ARDL + RBF 0.91; ARDL + MLP logistic 0.93; ARDL + LSTM 0.96; ARDL + GMDH 1.08. Taipei (low incidence area): ARDL + MLP logistic 1.46; ARDL + GRNN 1.42; ARDL + SVR 1.69	NA	Kaohsiung (high incidence area): ARDL + SVR 34.3; ARDL + LSTM 35.8; ARDL + GRNN 39.2; ARDL + RBF 39.4; ARDL + MLP logistic 36.6; ARDL + GEP 39.1; ARDL + GMDH 37.9. Tainan (high incidence area): ARDL + GEP 30.1; ARDL + SVR 33.4; ARDL + GRNN 34.7; ARDL + RBF 33.9; ARDL + MLP logistic 32.1; ARDL + LSTM 32.8; ARDL + GMDH 32.5. Taipei (low incidence area): ARDL + MLP logistic 30.8; ARDL + GRNN 37.2; ARDL + SVR 34.4	NA	NA	NA
Yi (2023) [78]	Hybrid NN + RNN + EnKF superensemble (PICTUREE-Aedes)	Hybrid/Superensemble	NA	NA	NA	NA	NA	NA	NA	NA
Zhao (2020) [101]	RF	Tree Ensemble	cases–weekly–district level	1 w 0.93, 2 w 0.95, 3 w 0.94, 4 w 0.95, 5 w 0.95, 6 w 0.94, 7 w 0.93, 8 w 0.92, 9 w 0.90, 10 w 0.89, 11 w 0.87, 12 w 0.86	NA	NA	NA	NA	NA	NA
Zhao (2023) [79]	CNN-BiLSTM	Deep Learning	NA	1 w 41.40; 2 w 53.01; 3 w 65.99; 4 w 79.44	1 w 73.30–85.00; 2 w 90.33; 3 w 112.65; 4 w 136.36	NA	NA	NA	NA	NA

ANFIS = Adaptive Neuro-Fuzzy Inference System; ANN = Artificial Neural Network; AR(2) = autoregressive model of order 2; ARDL = Autoregressive Distributed Lag; BN = Bayesian Network; BRT = Boosted Regression Trees; CBR = Case-Based Reasoning; CD = Dengue with Climate; CNN = Convolutional Neural Network; D = Dengue only; DEFO = Differential Evolution Optimization; DFFN = Deep Feed-Forward Network; DT = decision tree; ENKF = Ensemble Kalman Filter; ETC = Extra Trees Classifier; F1 = SIR-EAKF Superensemble model variant; F2 = Bayesian weighted outbreaks superensemble model variant; F3 = historical likelihood superensemble model variant; FS = feature selection; GAM = Generalized Additive Model; GANN = Genetic Algorithm Neural Network; GB = Gradient Boosting; GEP = Gene Expression Programming; GLM = Generalized Linear Model; GMDH = Group Method of Data Handling; GRNN = General Regression Neural Network; GRU = Gated Recurrent Unit; GWO = Grey Wolf Optimizer; HD = Humidity (Dengue with Humidity); IHLOA = Improved Harris Hawks Optimization Algorithm; KNN = K-Nearest Neighbours; LASSO = Least Absolute Shrinkage and Selection Operator; LightGBM = Light Gradient Boosting Machine; LR = Logistic Regression; LSTM = Long Short-Term Memory; MAE = Mean Absolute Error; MAPE = Mean Absolute Percentage Error; MLP = Multi-Layer Perceptron; MLR = Multiple Linear Regression; MSE = Mean Squared Error; NARX NN = Nonlinear AutoRegressive Network with eXogenous inputs Neural Network; NB = Naive Bayes; NC-DEFO = Novel Combined Differential Evolution Optimization; NN = Neural Network; NNAR = Neural Network Autoregression; OMP = Orthogonal Matching Pursuit; PICTUREE-Aedes = Hybrid NN + RNN + EnKF Superensemble model; PLS = Partial Least Squares; R² = Coefficient of Determination; RBF = Radial Basis Function; RF = Random Forest; RMSE = Root Mean Squared Error; r = Pearson Correlation Coefficient; SA-LSTM = Spatial Attention Long Short-Term Memory; SE = Superensemble; SHAP = SHapley Additive exPlanations; SI-SIR = Spatially Informed Susceptible–Infectious–Recovered model; S-LSTM = Standard Long Short-Term Memory; SMAPE = Symmetric Mean Absolute Percentage Error; SIR-EAKF = Susceptible–Infectious–Recovered Ensemble Adjustment Kalman Filter; SSA-LSTM = Stacked Spatial Attention Long Short-Term Memory; STA-LSTM = Spatiotemporal Attention Long Short-Term Memory; ST-LSTM = Spatiotemporal Long Short-Term Memory; ST-SLSTM = Spatiotemporal Stacked Long Short-Term Memory; SVR = Support Vector Regression; SVM = Support Vector Machine; TA-LSTM = Temporal Attention Long Short-Term Memory; TRF = Trust Region Framework; w = Week; WRF = Weather Research and Forecasting model; XGB = Extreme Gradient Boosting.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pennisi, F.; Pinto, A.; Borgonovo, F.; Scaglione, G.; Ligresti, R.; Santangelo, O.E.; Provenzano, S.; Gori, A.; Baldo, V.; Signorelli, C.; et al. Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis. Mach. Learn. Knowl. Extr. 2026, 8, 15. https://doi.org/10.3390/make8010015

AMA Style

Pennisi F, Pinto A, Borgonovo F, Scaglione G, Ligresti R, Santangelo OE, Provenzano S, Gori A, Baldo V, Signorelli C, et al. Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis. Machine Learning and Knowledge Extraction. 2026; 8(1):15. https://doi.org/10.3390/make8010015

Chicago/Turabian Style

Pennisi, Flavia, Antonio Pinto, Fabio Borgonovo, Giovanni Scaglione, Riccardo Ligresti, Omar Enzo Santangelo, Sandro Provenzano, Andrea Gori, Vincenzo Baldo, Carlo Signorelli, and et al. 2026. "Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis" Machine Learning and Knowledge Extraction 8, no. 1: 15. https://doi.org/10.3390/make8010015

APA Style

Pennisi, F., Pinto, A., Borgonovo, F., Scaglione, G., Ligresti, R., Santangelo, O. E., Provenzano, S., Gori, A., Baldo, V., Signorelli, C., & Gianfredi, V. (2026). Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis. Machine Learning and Knowledge Extraction, 8(1), 15. https://doi.org/10.3390/make8010015

Article Menu

Artificial Intelligence Models for Forecasting Mosquito-Borne Viral Diseases in Human Populations: A Global Systematic Review and Comparative Performance Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Search Strategy

2.2. Eligibility Criteria

2.3. Study Selection

2.4. Data Extraction

2.5. Data Synthesis and Statistical Analysis

2.6. Risk of Bias Assessment

3. Results

3.1. Literature Search

3.2. Geographical Distribution

3.3. Temporal Distribution and Evolution of AI Model Types

3.4. Characteristics of the Features of the Included Studies

3.5. Included Studies Characteristics

3.6. Model Performance by AI Category

Model Validation Approaches

3.7. Classical Machine Learning

3.7.1. Classification Metrics

3.7.2. Regression Metrics

3.8. Tree-Ensemble Models

3.8.1. Classification Metrics

3.8.2. Regression Metrics

3.9. Deep Learning Models

3.9.1. Classification Metrics

3.9.2. Regression Metrics

3.10. Hybrid/Superensemble Models

3.10.1. Classification Metrics

3.10.2. Regression Metrics

3.11. Time-Series/Statistical Models

3.11.1. Classification Metrics

3.11.2. Regression Metrics

3.12. Mechanistic and Heuristic Models

3.12.1. Classification Metrics

3.12.2. Regression Metrics

3.13. Descriptive Performance Patterns Based on Unweighted Comparative Analyses

3.13.1. Classification Performance

3.13.2. Regression Performance

3.14. Assessment of Risk of Bias Using PROBAST

4. Discussion

4.1. Interpretation of Main Findings

4.2. Interpretation and Comparison with Existing Literature

4.3. Implications for Public-Health Practice

4.4. Strengths and Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI