Meta-Analysis on Criteria and Forecasting Models for Potato Late Blight Pathosystem: Assessing Robustness and Temporal Consistency
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
I found this review to be interesting and helpful. I have been interested and involved in forecasting for late blight for some years, and this review was very helpful in collating lots of good information. The comparisons were intriguing. I am not sufficiently versed in the statistical methods for meta-analysis to be able to comment on these. However, on the assumption that they are employed appropriately, I think this manuscript should be published. I found it helpful – perhaps the only review on forecasting that I’ve ever found to be interesting and helpful, and from which I learned things.
Author Response
Thank you very much for the reviewers' valuable comments and suggestions on the manuscript that greatly improve its quality. Modifications are marked in red in the text.
Response to questions and comments raised.
Reviewer 1:
Comment 1: I found this review to be interesting and helpful. I have been interested and involved in forecasting for late blight for some years, and this review was very helpful in collating lots of good information. The comparisons were intriguing. I am not sufficiently versed in the statistical methods for meta-analysis to be able to comment on these. However, on the assumption that they are employed appropriately, I think this manuscript should be published. I found it helpful – perhaps the only review on forecasting that I’ve ever found to be interesting and helpful, and from which I learned things.
Response 1: Thank you very much for your review and comments. We're very glad you found the work useful and that you were able to access the information you needed on predictive models of late blight in a practical and concise way.
Reviewer 2 Report
Comments and Suggestions for Authors
Dear authors,
I commend you on a well-written and designed research manuscript. The work is a systematic literature review of decision-support systems targetting Phythophtora blight in potatoes. As the oldest plant disease ever studied formally, there is a lot of data in the literature that needs to be summarized. The multiple non-mechanistic, semi-mechanistic and mechanistic models divided by the authors into 3 generations were assessed for accuracy and potential to reduce fungicide applications. The discussion is sound.
Comments for author File:
Comments.pdf
Comments on the Quality of English Language
There are a few minor typographical errors that are highlighted in the pdf file.
Author Response
Reviewer 2:
Comment 1: I commend you on a well-written and designed research manuscript. The work is a systematic literature review of decision-support systems targetting Phythophtora blight in potatoes. As the oldest plant disease ever studied formally, there is a lot of data in the literature that needs to be summarized. The multiple non-mechanistic, semi-mechanistic and mechanistic models divided by the authors into 3 generations were assessed for accuracy and potential to reduce fungicide applications. The discussion is sound.
Response 1: Thank you very much for your attention in the review. We especially appreciate the assessment of the classifications that were carried out in the research (generation and mechanism) as a way to structure the work solidly.
Regarding the comments to improve the English text, we especially appreciate your effort in posting each localized comment. As you can see from the new manuscript, the suggestions have been highlighted in red. Thank you very much!
Reviewer 3 Report
Comments and Suggestions for Authors
The study presents a systematic review with meta-analysis of 59 forecasting criteria/models for potato late blight (Phytophthora infestans), evaluated across 271 trials in 25 countries, and classifies them by “generation” (G1–G3) and mechanism (non-mechanistic, semi-mechanistic, mechanistic). Key findings are that (i) mechanistic models report higher mean “accuracy” than non-mechanistic models, and (ii) third-generation (ML/ANN/algorithmic) models tend to show higher average performance. The historical span (1926–2025) and the dual taxonomy (generation × mechanism) make this a useful synthesis for DSS work in agriculture.
I² is not “publication bias”
In Table 4 I² is labeled and discussed as if it indicated publication bias. It does not. I² quantifies heterogeneity between study effects. Extremely large I² (≈99%) signals very high heterogeneity, not absence of bias. To assess publication bias, add funnel plots and a formal test (e.g., Egger’s regression), and consider a trim-and-fill sensitivity check. Please relabel Table 4 (“Heterogeneity (I²)”) and remove any inference about publication bias from I² alone
Define outcomes precisely: “accuracy/skill” vs. Effectiveness
The manuscript blends forecast skill (e.g., hits, sensitivity/specificity) with effectiveness metrics (e.g., RAUDPC reduction, fungicide savings), sometimes calling 100–RAUDPC “accuracy”. Please separate:
- Forecast skill/accuracy (classification, Brier score, hit/miss rates against observed disease events).
- DSS effectiveness (e.g., reduction in AUDPC; reduction in fungicide applications/AI).
Analyze and compare models within outcome families to avoid apples-to-oranges comparisons.
Meta-analytic model and dependence of effects
It’s unclear whether a random-effects model or a meta-regression was used, and how multiple effects per study/model/site were handled. With many non-independent effects, use random-effects (or multilevel) meta-regression (e.g., with metafor) and, when effects are dependent, robust variance estimation (RVE). Report pooled estimates with 95% CIs, heterogeneity components, and leave-one-out sensitivity.
Numerical and temporal inconsistencies to fix
- The period 1966–2007 is called “81 years”; that is 41 years.
- The abstract says “last 105 years” while the data span is 1926–2025 (~99 years). Please correct both figures for internal consistency.
In §3.2.3 you report a G1 mean for fungicide reduction, but Figure 8 says G1 has no data for that outcome. Please reconcile (either add the underlying G1 data or remove the claim).
Line-by-line
Lines 40–59: Remove the repeated header label (“Type of the Paper”).
Lines 83–91 (Abstract): Replace “last 105 years” with “since 1926 (≈99 years to 2025)”
Lines 101–109 (Abstract): Standardize hyphenation (non-mechanistic, etc.).
Lines 121–133 (Intro): Separate (i) the time frame for “acute hunger” from (ii) causal factors; align statistics and periods clearly.
Lines 172–198 (Intro): Break the long sentence on winds/COâ‚‚/irrigation into shorter assertions, each with a source.
Lines 200–255: First time each historical model is named (Beaumont, Smith, Wallin, Hyre, Blitecast), add the original year.
Lines 351–367 (Methods): Add last search dates and languages.
Lines 381–540 (Table 1): Include the exact database queries (quoted search strings), fields (title/abstract/keywords), and filters (year, language).
Lines 545–571 (Screening): Specify any study-level risk-of-bias checklist or justify not performing one.
Lines 605–644 (Outcomes): Rename RAUDPC metric to “DSS effectiveness (RAUDPC)”; place “correctly forecast days” under forecast skill.
Lines 647–664: State how you handled negative or >100% values and any winsorization of outliers.
Lines 880–961 (Stats): Declare the base model (ANOVA/GLM/GLMM vs. meta-regression); justify transforms for percentages; consider RVE for clustered effects (Hedges et al., 2010; Pustejovsky & Tipton, 2022).
Lines 1016–1021 (Results): Correct “81 years” → 41 years (1966–2007).
Lines 1184–1197 (Results): Add a derivation diagram (supplement) showing model lineages and variants.
Lines 1241–1257 (Table 4 + text): Change “I²: publication bias” → “I²: heterogeneity”; add funnel plot/Egger (Egger et al., 1997).
Lines 1311–1349 (Fig. 5): Add n per group in the legend and exact p-values for pairwise comparisons (besides letter groupings).
Lines 1384–1391 vs. 1539–1540: Resolve the G1 fungicide-reduction contradiction.
Lines 957–961 (Reproducibility): Link a public repository with data and analysis code.
1) Borenstein, M. (2023). The difference between I-squared and prediction intervals.
Research Synthesis Methods, 14(6), 896–914. https://doi.org/10.1002/jrsm.1699
2) Meno, L., Escuredo, O., & Seijo, M. C. (2024). Opportunity of the NEGFRY
decision support system for the sustainable control of potato late blight in A Limia
(NW of Spain). Agriculture, 14(5), 652. https://doi.org/10.3390/agriculture14050652
Author Response
Reviewer 3:
Comment 1: The study presents a systematic review with meta-analysis of 59 forecasting criteria/models for potato late blight (Phytophthora infestans), evaluated across 271 trials in 25 countries, and classifies them by “generation” (G1–G3) and mechanism (non-mechanistic, semi-mechanistic, mechanistic). Key findings are that (i) mechanistic models report higher mean “accuracy” than non-mechanistic models, and (ii) third-generation (ML/ANN/algorithmic) models tend to show higher average performance. The historical span (1926–2025) and the dual taxonomy (generation × mechanism) make this a useful synthesis for DSS work in agriculture.
Response 1: Thank you very much for your review and comments. Below are the comments and suggestions addressed.
Comment 2: I² is not “publication bias”. In Table 4 I² is labeled and discussed as if it indicated publication bias. It does not. I² quantifies heterogeneity between study effects. Extremely large I² (≈99%) signals very high heterogeneity, not absence of bias.
Response 2: We resolved the I2 naming errors and supplemented the heterogeneity metric with Egger's regression and a funnel plot for publication bias. We eliminated the initial Q and I2 calculations because they were considered incorrect. The new results are in 3.2 Bibliometric results and Table S2. Thank you very much
Comment 3: To assess publication bias, add funnel plots and a formal test (e.g., Egger’s regression), and consider a trim-and-fill sensitivity check. Please relabel Table 4 (“Heterogeneity (I²)”) and remove any inference about publication bias from I² alone.
Response 3: This comment has been resolved (with Table S2). Thank you very much.
Comment 4: Define outcomes precisely: “accuracy/skill” vs. Effectiveness.
The manuscript blends forecast skill (e.g., hits, sensitivity/specificity) with effectiveness metrics (e.g., RAUDPC reduction, fungicide savings), sometimes calling 100–RAUDPC “accuracy”. Please separate: Forecast skill/accuracy (classification, Brier score, hit/miss rates against observed disease events). DSS effectiveness (e.g., reduction in AUDPC; reduction in fungicide applications/AI). Analyze and compare models within outcome families to avoid apples-to-oranges comparisons.
Response 4: We appreciate the reviewer's distinction between prognostic skill and DSS effectiveness. Following this recommendation, we have revised the terminology in lines 189 and 195 to improve conceptual clarity between the two outcome families.
Regarding the use of 100 – RAUDPC, we wish to clarify that this metric was not intended to represent prognostic skill, but rather the effectiveness of the DSS. RAUDPC (Relative Area Under the Disease Progression Curve) expresses the relative progression of disease under a given DSS model compared to an untreated control. Thus, calculating 100 – RAUDPC provides a standardized measure of the relative effectiveness of the DSS, where 100 represents the theoretical maximum effectiveness (i.e., complete suppression of epidemic development and therefore absolute accuracy). In this sense, 100 – RAUDPC can be interpreted as effectiveness accuracy: a bounded index (0–100) quantifying the percentage reduction in epidemic development attributable to model-based decision-making. Conceptually, this aligns more with accuracy metrics.
Comment 5: Meta-analytic model and dependence of effects. It’s unclear whether a random-effects model or a meta-regression was used, and how multiple effects per study/model/site were handled. With many non-independent effects, use random-effects (or multilevel) meta-regression (e.g., with metafor) and, when effects are dependent, robust variance estimation (RVE). Report pooled estimates with 95% CIs, heterogeneity components, and leave-one-out sensitivity.
Response 5: Thank you very much for this comment. Thank you very much for this comment. We have included the suggested statistical methodology, improving and clarifying the presentation of the results. We performed a multilevel meta-regression with mixed effects and robust variance estimation, as well as other sensitivity analyses. Lines 267-282.
Comment 6: Numerical and temporal inconsistencies to fix. The period 1966–2007 is called “81 years”; that is 41 years.
Response 6: Corrected, thank you very much.
Comment 7: The abstract says “last 105 years” while the data span is 1926–2025 (~99 years). Please correct both figures for internal consistency.
Response 7: We always refer to 105 years or more than 100 years because Van Everdingen's first research was published in a journal in 1926, but the dates included in this research began in 1919. This meta-analysis was first written in 2024, but given the progress toward 2025, we will correct the number of years to 106 years. Thank you very much.
Comment 8: In §3.2.3 you report a G1 mean for fungicide reduction, but Figure 8 says G1 has no data for that outcome. Please reconcile (either add the underlying G1 data or remove the claim).
Response 8: This part has been reconciled on lines 565 and 569. Thank you very much.
Comments Line-by-line and responses Line-by-line:
Lines 1: Remove the repeated header label (“Type of the Paper”).
Response: Removed, Thank you very much.
Lines 79 (Abstract): Replace “last 105 years” with “since 1926 (≈99 years to 2025)”
Response: This clarification has already been made in Response 7.
Lines 29 (Abstract): Standardize hyphenation (non-mechanistic, etc.).
Response: Standardized, Thank you very much.
Lines 38-44 (Intro): Separate (i) the time frame for “acute hunger” from (ii) causal factors; align statistics and periods clearly.
Response: If you're referring to the beginning of the introduction, we didn't talk about "hunger" but rather about food insecurity, and we believe the way the causal factors were worded facilitates logical continuity in the reading to the next paragraph on climate change. We've corrected the figures and statistics, thank you very much.
Lines 52-54 (Intro): Break the long sentence on winds/COâ‚‚/irrigation into shorter assertions, each with a source.
Response: We have improved the wording, taking into consideration your comment. Thank you very much.
Lines 200–255: First time each historical model is named (Beaumont, Smith, Wallin, Hyre, Blitecast), add the original year.
Response: The year has already been added to each historical model, thank you very much.
Lines 147 (Methods): Add last search dates and languages; (Table 1): Include the exact database queries (quoted search strings), fields (title/abstract/keywords), and filters (year, language).
Response: In Table 1, Line 147 we've added quotation marks to the keyword string and the last search date. We didn't filter by language or year, as we sought to conduct a comprehensive, historically complete meta-analysis. Other review criteria can be seen in the table footnotes a, b, and c.
Section 2.1.2 (Screening): Specify any study-level risk-of-bias checklist or justify not performing one.
Response: We appreciate the reviewer’s suggestion. This study did not apply a formal study-level risk-of-bias checklist such as ROBIS or QUADAS-2 because these tools are designed primarily for intervention or diagnostic accuracy studies, whereas our meta-analysis focuses on the predictive performance of modeling approaches for Phytophthora infestans.
Instead, we assessed potential sources of bias through methodological quality criteria relevant to modeling studies, including (i) model calibration and validation approach, (ii) number of years and locations included, and (iii) publication type (peer-reviewed vs. non-peer-reviewed).
Additionally, we evaluated residual heterogeneity (I² statistics), performed sensitivity analyses by excluding influential studies, and conducted funnel plots and Egger’s tests to detect potential publication bias. Line 242-259 and Figure 4,
Lines 189-190 (Outcomes): Rename RAUDPC metric to “DSS effectiveness (RAUDPC)”; place “correctly forecast days” under forecast skill.
Response: This comment has already been resolved in Response 4.
Lines 203-204: State how you handled negative or >100% values and any winsorization of outliers.
Response: For "fungicide reduction" we declare in line 202-203: In trials where treatment with forecasting models was superior to routine treat-ment, a value of 0% was awarded for “fungicide reduction”. For "accuracy", given the way it is calculated according to the literature, we did not find values that needed winzorization. We declarate this in line 203-204. Thank you very much.
Lines 880–961 (Stats): Declare the base model (ANOVA/GLM/GLMM vs. meta-regression); justify transforms for percentages; consider RVE for clustered effects (Hedges et al., 2010; Pustejovsky & Tipton, 2022).
Response: This clarification has already been made in Response 5.
Lines 297 (Results): Correct “81 years” → 41 years (1966–2007).
Response: Corrected, thank you very much.
(Results): Add a derivation diagram (supplement) showing model lineages and variants.
Response: We appreciate the reviewer’s suggestion to include a derivation diagram showing model lineages and variants. However, we believe that such a schematic, while interesting, falls beyond the main scope of the present work. Our study focuses on evaluating the predictive robustness and temporal performance of existing models under changing climatic and geographical contexts, rather than on describing their structural evolution or specific input variables.
We fully agree that a detailed reconstruction of model lineages and input architectures would be of great value for the research community, and we consider this a promising direction for future work, where each model’s historical derivation and variable dependencies can be systematically compared.
(Table 4 + text): Change “I²: publication bias” → “I²: heterogeneity”; add funnel plot/Egger (Egger et al., 1997).
Response: This comment has already been resolved in response 2.
Lines 1311–1349 (Fig. 5): Add n per group in the legend and exact p-values for pairwise comparisons (besides letter groupings).
Response: The value "n" is present in the figure. We have added the p values for the pairwise comparisons in Figure 5 and Figure 6 in the supplementary material. (Table S3 y S4).
Figure 9: Resolve the G1 fungicide-reduction contradiction.
Response: This comment has already been resolved in Response 8.
Lines 957–961 (Reproducibility): Link a public repository with data and analysis code.
Response: The repository has been included on line 724.
1) Borenstein, M. (2023). The difference between I-squared and prediction intervals.
Research Synthesis Methods, 14(6), 896–914. https://doi.org/10.1002/jrsm.1699
2) Meno, L., Escuredo, O., & Seijo, M. C. (2024). Opportunity of the NEGFRY
decision support system for the sustainable control of potato late blight in A Limia
(NW of Spain). Agriculture, 14(5), 652. https://doi.org/10.3390/agriculture14050652
Response: Thank you very much for the bibliographic proposal. We greatly appreciate continuing to learn and improve the methodological and statistical aspects of our research. Regarding the article by Laura Meno et al. (2024), it is already part of our meta-analysis, as you can see in Reference No. 12.
Final response: We sincerely thank the reviewer for the positive and constructive evaluation. We have carefully addressed all the points mentioned — including the clarification of bias assessment procedures, the harmonization of outcome metrics across studies, and the addition of a meta-regression analysis to strengthen the interpretation of variability among models.
These revisions have improved the clarity, consistency, and robustness of the manuscript, and we truly appreciate the reviewer’s guidance in helping us enhance its overall quality.
Reviewer 4 Report
Comments and Suggestions for Authors
The paper reviews 100+ years of work on late blight forecasting. It groups models by type (rule-based, semi-mechanistic, mechanistic) and compares their performance across many countries and trials. Main takeaways: mechanistic models tend to perform better, and decision tools can reduce fungicide use.
Positive features
- A clear overview of many models and trials. Helpful for researchers and extension;
- The “generations/mechanism” framework makes the history and differences easy to follow;
- The paper reports overall accuracy and potential spray reduction in a way practitioners can understand;
- Many countries and decades are included, adding weight to the conclusions.
Fragilities
- Very high variability. I² is ~100%, meaning results differ a lot across studies. This does not prove there is no publication bias. Could you add funnel plots and simple bias checks (e.g., Egger, trim-and-fill) and adjust the wording?
- Mixed “accuracy” definitions. The authors combine different measures (e.g., accuracy %, severity/RAUDPC, “correct forecast days”). Could you standardize definitions in a table and run sensitivity analyses using one consistent metric?
- Uncertain study weights. Many studies lack SE/CI. You impute variance from SD and n. Please justify this choice and add robustness checks (e.g., leave-one-out, robust variance). Report τ² together with Q and I².
- Confounding factors.The pairwise comparisons do not adjust for region, host resistance, fungicide program, weather data resolution (hourly vs daily), or calibration. A simple meta-regression (mixed effects) with these moderators would strengthen the conclusions.
- “Consistency over time”: The current method is descriptive. Consider a model including “year” (and region) as moderators, or at least show trend lines per model with basic adjustment.
- Model lineage and overlap. Some derivative models and overlapping trials are hard to track. Add a flow/table mapping parent → derivative models and clarify how you avoided double counting.
Minor comments
- Standardize terms and capitalization (e.g., “non-mechanistic”, “semi-mechanistic”, “mechanistic”);
- When saying mechanistic models “perform better,” note that the difference between the newer groups may not be significant everywhere;
- Could you add a one-page “model at a glance” table: inputs needed, calibration needs, hourly/daily weather, host resistance input, etc.
The review is valuable and timely. Addressing the points above regarding bias checks, metric harmonization, and a simple meta-regression) will make the paper clearer and stronger.
Author Response
Reviewer 4:
The paper reviews 100+ years of work on late blight forecasting. It groups models by type (rule-based, semi-mechanistic, mechanistic) and compares their performance across many countries and trials. Main takeaways: mechanistic models tend to perform better, and decision tools can reduce fungicide use.
Comment: Positive features
A clear overview of many models and trials. Helpful for researchers and extension; The “generations/mechanism” framework makes the history and differences easy to follow; The paper reports overall accuracy and potential spray reduction in a way practitioners can understand; Many countries and decades are included, adding weight to the conclusions.
Response: We sincerely appreciate the reviewer’s positive evaluation. We are glad that the overview of predictive models and trials, as well as the generation/mechanism framework, were found to be clear and useful for both researchers and practitioners.
Our aim was precisely to make the historical evolution of potato late blight forecasting systems accessible and comparable, while summarizing accuracy and fungicide reduction in a way that bridges scientific and applied perspectives.
We are grateful for the reviewer’s encouraging feedback regarding the breadth of countries, decades, and data included, which we believe strengthens the robustness and general relevance of our conclusions.
Comment: Fragilities. Very high variability. I² is ~100%, meaning results differ a lot across studies. This does not prove there is no publication bias. Could you add funnel plots and simple bias checks (e.g., Egger, trim-and-fill) and adjust the wording?
Response: We have recalculated the I2 values since the previous calculation was incorrect. We have also added the Egger regression and its components (I2 included) to Table S2 and have added Figure 4 with the funnel plots as well as a trim-and-fill calculation whose components can be seen in Figure 4. Thank you very much for the proposal as it has served to implement this missing methodology in the meta-analysis.
Comment: Mixed “accuracy” definitions. The authors combine different measures (e.g., accuracy %, severity/RAUDPC, “correct forecast days”). Could you standardize definitions in a table and run sensitivity analyses using one consistent metric?
Response: We appreciate the reviewer's insightful distinction between prognostic skill and DSS effectiveness. Following this recommendation, we have revised the terminology in lines 189 and 195 to improve conceptual clarity between the two outcome families.
Regarding the use of 100 – RAUDPC, we wish to clarify that this metric was not intended to represent prognostic skill, but rather the effectiveness of the DSS. RAUDPC (Relative Area Under the Disease Progression Curve) expresses the relative progression of disease under a given DSS model compared to an untreated control. Thus, calculating 100 – RAUDPC provides a standardized measure of the relative effectiveness of the DSS, where 100 represents the theoretical maximum effectiveness (i.e., complete suppression of epidemic development and therefore absolute accuracy). In this sense, 100 – RAUDPC can be interpreted as effectiveness accuracy: a bounded index (0–100) quantifying the percentage reduction in epidemic development attributable to model-based decision-making. Conceptually, this aligns more with accuracy metrics.
Comment: Uncertain study weights. Many studies lack SE/CI. You impute variance from SD and n. Please justify this choice and add robustness checks (e.g., leave-one-out, robust variance). Report τ² together with Q and I².
Response: We accompanied the meta-regression with robust variance calculations, the components of which can be seen in Table S5. Using Egger's regression, we recalculated the heterogeneity and publication bias values and finally imputed the sample sizes taking into account the duration of each study, assuming that longer studies would have better effect sizes. This is explained in the methodology section on line 249.
Comment: Confounding factors. The pairwise comparisons do not adjust for region, host resistance, fungicide program, weather data resolution (hourly vs daily), or calibration. A simple meta-regression (mixed effects) with these moderators would strengthen the conclusions.
Response: Thank you very much for the comment. We ran a hybrid mixed-effects multilevel regression with all the moderators for which we had metrics. The results are available in 3.2.4 Meta-analytic results.
Comment: “Consistency over time”: The current method is descriptive. Consider a model including “year” (and region) as moderators, or at least show trend lines per model with basic adjustment. Model lineage and overlap. Some derivative models and overlapping trials are hard to track. Add a flow/table mapping parent → derivative models and clarify how you avoided double counting.
Response: We appreciate the reviewer’s thoughtful comment. Regarding “Consistency over time,” we acknowledge that a more complex modeling approach incorporating year and region as moderators could further explore temporal dynamics. However, the aim of the present study was to provide a descriptive synthesis of validated models rather than to fit meta-regressions over time.
The figure referred to in this section already conveys temporal consistency through the generation variable, which integrates the chronological dimension of model development. To ensure comparability, models supported by three or more independent trials were included — those listed along the Y-axis.
Concerning model lineage and overlap, we recognize that certain derivative models share conceptual origins. To minimize double counting, each model was treated based on its publication-defined framework, and overlapping trials were carefully cross-checked. A detailed mapping of model derivations was not included, as it falls beyond the main scope of the current analysis, but it represents a valuable direction for future work.
Comment: Could you add a one-page “model at a glance” table: inputs needed, calibration needs, hourly/daily weather, host resistance input, etc.
Response: We appreciate the reviewer’s suggestion to include a one-page summary table detailing model inputs, calibration needs, temporal resolution (hourly/daily), and host-resistance parameters. However, we believe that such an in-depth technical synthesis would extend beyond the primary aim of the present study.
Our focus was to assess the overall predictive robustness and effectiveness of forecasting models across regions and decades, rather than to describe the internal structure or parameterization of each model.
We agree that compiling and harmonizing this information would be highly valuable for the research community, and we consider it a promising avenue for future work aimed at a more comprehensive methodological comparison among forecasting systems.
Comment: The review is valuable and timely. Addressing the points above regarding bias checks, metric harmonization, and a simple meta-regression) will make the paper clearer and stronger.
Final response: We sincerely thank the reviewer for the positive and constructive evaluation. We have carefully addressed all the points mentioned — including the clarification of bias assessment procedures, the harmonization of outcome metrics across studies, and the addition of a meta-regression analysis to strengthen the interpretation of variability among models.
These revisions have improved the clarity, consistency, and robustness of the manuscript, and we truly appreciate the reviewer’s guidance in helping us enhance its overall quality.

