Review Reports - Leveraging Ensemble Machine Learning Models for the Detection of Primary Myelofibrosis in Electronic Health Records

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript aims to assess the feasibility of a machine‑learning platform for identifying primary myelofibrosis using complex, real‑world clinical data derived from electronic health records. Its primary contribution is methodological: the authors compare two positive–unlabeled (PU) learning strategies across two classification paradigms—traditional supervised binary classification and semi‑supervised PU learning.

While the topic of AI‑assisted disease identification is of growing interest, the manuscript’s emphasis is predominantly methodological. For a hematology readership, the central concerns are the performance, clinical interpretability, and biological plausibility of the classification models. Much of the manuscript, however, is devoted to technical evaluation of machine‑learning frameworks, which may limit accessibility for clinicians and pathophysiologists.

The manuscript contains extensive methodological detail that may be difficult for a clinical hematologist to evaluate. Although the authors make commendable efforts to maintain clarity, several sections remain highly technical and would benefit from relocation to supplementary materials.

The pragmatic classification of patients into definite, uncertain, and excluded myelofibrosis, along with the selection of controls, is appropriate for a real‑world EHR‑based study. This aspect is clearly described and aligns with the study’s real world intent.

The authors list seven libraries used for data extraction and preprocessing, but it is unclear how outputs from these libraries were integrated or selected. For readers without a machine‑learning background, a brief explanation of the decision rules or harmonization strategy would improve transparency.

Model training and evaluation (Section 2.4). This section describes two training paradigms and four algorithms in approximately 40 lines. Given the journal’s clinical focus, this level of technical detail could be moved to the supplementary materials, with only the essential conceptual framework retained in the main text.

Results of PU‑learning vs. binary classification (Section 3.2). The comparison of binary and PU‑learning methods is methodologically interesting. However, relocating this 37‑line section to the supplementary materials would streamline the manuscript. The main text should emphasize the key finding: the LightGBM configuration outperformed all PU‑learning variants, justifying its use for downstream interpretability analyses.

Interpretability and feature importance (Section 3.3). The authors report that model predictions were driven primarily by CBC parameters. This raises an important clinical question: Were histopathological features from bone marrow biopsy included in the model? If not, the authors should explicitly justify their exclusion, given that bone marrow morphology is the cornerstone of myelofibrosis diagnosis. Clarifying this point is essential for clinical credibility.

The discussion is well structured and appropriately integrates methodological considerations with diagnostic implications.

Author Response

Comments 1:

Response 1: We thank you for this observation regarding the technical depth of the manuscript. We would like to clarify that our work is primarily addressed to the Data Science community and researchers specializing in computational tools within a medical context. The study aims to contribute to the niche but critical intersection of Artificial Intelligence and rare disease diagnostics.

To maintain clarity for a broader readership, we have refined the Introduction to explicitly state that the manuscript pursues two complementary goals:

We investigate the potential of creating screening algorithms based on routine CBC tests and ICD - 10 codes to identify patients suspicious of PMF, which is of direct interest to hematologists and data scientists.
We evaluate the applicability and efficiency of Positive-Unlabeled learning in a real-world clinical setting, which is a significant contribution to the field of medical AI.

We believe that this dual focus is essential, as the effectiveness of rare disease screening is inextricably linked to the choice of appropriate modeling methodology. By clarifying these objectives in the introduction, we aim to better manage the expectations of both clinical and technical readers.

Comments 2:

Response 2: We agree that providing more context for the selected libraries would benefit readers from different backgrounds. Although these libraries represent the standard toolkit in data science, we have expanded Section 2.2 to better explain their integration. The revised section now details the purpose of each library and the harmonization strategy used to combine their outputs into a unified dataset.

Comments 3:

Response 3: Thank you for this constructive suggestion. We have revised Section 2.4 (Model Training and Evaluation) by reducing the technical detail to the necessary minimum, focusing instead on the high-level logic of our approach.

Recognizing the importance of transparency, we have transferred the comprehensive technical specifications to the Supplement - Details of PU methods. We believe this restructuring improves the manuscript's flow for the clinical reader without compromising the reproducibility of our research for the data science community.

Comments 4:

Response 4: We appreciate the suggestion to streamline the manuscript; however, we would like to emphasize that the comparison between binary and PU-learning models (Section 3.2) is fundamental to the objectives of this study. As stated in our revised Introduction, one of our primary research goals is to evaluate the applicability of the PU paradigm in the context of rare disease screening.

This section is essential because it provides the necessary empirical evidence to show how PU methods behave in a real-world clinical dataset compared to traditional binary approaches. Relocating this analysis to the supplement would disconnect the clinical results from their methodological justification. We believe that demonstrating the effectiveness (or limitations) of PU learning is a valuable scientific contribution of this work, particularly for the Data Science audience interested in how these frameworks handle highly skewed, unlabeled medical data. Importantly, to the best of our knowledge, our study is the first to explore PU in real-life EHR data.

Comments 5:

Response 5: Thank you for this important comment.

Histopathological features from bone marrow biopsy were not included in our model. This was an intentional design choice aligned with the primary objective of the study. The aim of the algorithm is to support early identification of patients at risk of myelofibrosis prior to referral to specialized hematology care, where bone marrow biopsy is typically performed.

In clinical practice, patients who have already undergone a bone marrow biopsy are, by definition, under specialist hematologic evaluation. Therefore, incorporating histopathological findings would limit the model’s applicability to a later stage of the diagnostic pathway and reduce its value as an early screening or triage tool.

Our model is specifically intended to assist in identifying patients who may benefit from initial hematology consultation, thereby facilitating earlier referral and diagnosis. Excluding bone marrow biopsy features ensures that the algorithm remains applicable in non-specialist settings, such as primary care or general internal medicine, where such data are not routinely available.

We agree that bone marrow morphology remains the cornerstone of definitive diagnosis of myelofibrosis. However, the purpose of our model is not to replace diagnostic standards, but to enable earlier detection and appropriate referral for confirmatory evaluation.

We have addressed these concerns by updating the "Data Extraction and Preprocessing" section to clarify that histopathological features were intentionally excluded. Furthermore, we added a new section titled "Clinical Application and Screening Framework" to justify this approach, explaining that the model aims to identify high-risk patients who require urgent referral for bone marrow biopsy rather than replacing the histological gold standard.

Attachment

We have addressed all the comments provided in the review process. For your convenience, we have included manuscript-diff.pdf: A document highlighting the changes made to the original submission.

The difference file was generated using the latexdiff utility. Please note the following formatting conventions:

Added text is displayed in blue with a wavy underline.
Deleted text is marked in red and struck through.

Important Technical Clarifications:

Figures: Please be advised that latexdiff does not visually track changes within figures. However, all figures have been updated in accordance with your suggestions. The final manuscript includes updated visualizations.
Reference Numbering: Due to the technical limitations of the latexdiff tool, the citation numbering in the difference file may not perfectly align with the final version. We apologize for any inconvenience. The final PDF displays the bibliography correctly.

We hope these revisions meet your expectations and look forward to your further assessment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The study presents a compelling application of ensemble machine learning to address a clinically important problem-the early detection of a rare hematologic malignancy using real-world, heterogeneous electronic health record data. The authors’ use of multiple ensemble methods, careful handling of class imbalance, and exploration of positive-unlabeled learning are particularly noteworthy strengths. Despite these strengths, I recommend Major Revision to address several concerns.

Major Comments

The manuscript oscillates between presenting the model as a diagnostic tool (identifying patients who already have PMF) and a screening tool (prioritizing at-risk patients for further evaluation). Please clearly define the intended clinical use case and frame the study objectives and conclusions accordingly.
With only 67 PMF cases, repeated cross-validation on the same small set risks overfitting. Please discuss how this limitation was mitigated (e.g., through aggressive regularization, simple model architectures, or external validation plans) and acknowledge the implications for generalizability.
The PU learning methods did not improve precision and increased the false-positive burden, yet the conclusion recommends PU as a standard approach. Please revise the conclusion to reflect the actual findings and instead frame PU learning as a conceptual insight rather than a ready-for-implementation method.
The discussion notes that myeloid parameters were less important than nonspecific CBC indices (RDW, PLT), yet this finding is not fully explored. Please provide a more critical interpretation-does this reflect genuine biology, data limitations, or model behavior?

Minor Comments

Figures 4 and 7 are information-dense and difficult to interpret quickly. Consider simplifying visual elements or providing more detailed captions to guide readers.
The terms "PMFc," "Cr," "FP," "TN" are used consistently but may benefit from a brief reminder of definitions when first reintroduced in later sections.
Please briefly describe how the physician panel adjudicated the 139 false-positive cases, including inter-rater reliability or consensus approach, to strengthen confidence in the manual validation.
A careful proofreading pass is recommended to correct small grammatical inconsistencies (e.g., subject-verb agreement, article usage) and ensure consistent formatting of references.

Author Response

Comments 1:

The manuscript oscillates between presenting the model as a diagnostic tool (identifying patients who already have PMF) and a screening tool (prioritizing at-risk patients for further evaluation). Please clearly define the intended clinical use case and frame the study objectives and conclusions accordingly.

Response 1: We completely agree with the Reviewer’s observation. Defining a clear clinical application is crucial for the interpretability of our results. We have now explicitly framed the model as a preliminary screening and prioritization tool rather than a standalone diagnostic instrument.

To address this, we have added a new section in the Discussion titled "Clinical Application and Screening Framework". In this section, we clarify that the model’s primary value lies in its ability to navigate the "needle in a haystack" problem - identifying high-risk individuals in large-scale EHR populations where PMF prevalence is extremely low. We have also updated the study objectives in the Introduction and the Conclusion to reflect this refined focus.

Comments 2:

With only 67 PMF cases, repeated cross-validation on the same small set risks overfitting. Please discuss how this limitation was mitigated (e.g., through aggressive regularization, simple model architectures, or external validation plans) and acknowledge the implications for generalizability.

Response 2: To address the risk of overfitting associated with the small positive cohort, we have significantly expanded the Model Training and Evaluation section, emphasizing the intentionality behind our methodological choices. We clarify that the use of highly constrained architectures, such as shallow decision trees and rigorous regularization (Supplementary Table 1. Hyperparameter tuning settings), was a deliberate strategy to force the models to rely on key biological features rather than memorizing noise.

To provide quantitative evidence of these efforts, we have added a supplementary file containing the training and validation loss curves (Supplement - Train and Valid Loss). The convergence of these curves demonstrates the absence of overfitting, a point we now discuss in detail within the Model Performance and Evaluation section. Furthermore, we have updated the Limitations and Future Directions in the Discussion to explicitly acknowledge the inherent risks of a small sample size while highlighting our use of 10-times repeated stratified cross-validation and uncertainty quantification (95% CI) as robust tools for ensuring stable and reliable performance estimates in this ultra-rare disease context.

Comments 3:

The PU learning methods did not improve precision and increased the false-positive burden, yet the conclusion recommends PU as a standard approach. Please revise the conclusion to reflect the actual findings and instead frame PU learning as a conceptual insight rather than a ready-for-implementation method.

Response 3: We have revised the Conclusion section to align with your recommendation, framing PU learning as a subject to investigation rather than a ready-for-implementation method.

In the updated text, we explicitly acknowledge that while PU methods can enhance sensitivity, the associated loss of precision results in a prediction volume that exceeds the operational capacity of real-world clinical screening workflows. We now present PU paradigm as a theoretical validation of label incompleteness and a promising direction for future research in domains where sensitivity is prioritized, rather than a current replacement for established supervised methods.

Comments 4:

The discussion notes that myeloid parameters were less important than nonspecific CBC indices (RDW, PLT), yet this finding is not fully explored. Please provide a more critical interpretation-does this reflect genuine biology, data limitations, or model behavior?

Response 4: Thank you for this insightful observation. We have updated the section "Interpretability of Model Findings and Their Clinical Relevance" to provide a more critical analysis of why nonspecific indices like RDW and PLT appeared more influential than classical myeloid parameters. The revised text explains that this finding likely results from a combination of factors: the systemic nature of early PMF, where hematopoietic dysregulation manifests as broad variability, the inherent limitations of real-world EHR data regarding measurement frequency, and the tendency of tree-based models to prioritize stable predictive patterns over pathophysiological specificity. By addressing these biological, data-driven, and algorithmic perspectives, we clarify that the model’s behavior reflects a complex synergy of real-world constraints rather than a contradiction of established clinical knowledge.

Comments 5:

Figures 4 and 7 are information-dense and difficult to interpret quickly. Consider simplifying visual elements or providing more detailed captions to guide readers.

Response 5: We agree that the original labeling of Figure 7 was unfortunate and potentially confusing. To address this, we have moved the standard, tabular confusion matrices for all model variants to a dedicated supplementary file (Supplement - Confusion Matrices), ensuring that the classification results are available in a conventional format. We have also updated the caption and labels of Figure 7 to clarify that its purpose is to illustrate model consensus - specifically, the overlap in predictions among the three algorithms - rather than to serve as a substitute for a standard confusion matrix.

We acknowledge that Figure 4 is information-dense; however, this complexity is intentional and fundamental to the manuscript's objectives. This figure serves to highlight the specific challenges of plotting Precision-Recall characteristics in environments characterized by extreme class imbalance and a limited number of positive labels. To ensure these critical nuances are accessible to the reader, we have significantly expanded and refined the figure caption. These updates are designed to guide the reader through the visualization, explicitly clarifying how the plotted curves reveal the behavior of the models under these constrained, real-world conditions. We believe this restructuring fulfills the need for clarity while retaining the insights regarding model agreement.

Comments 6:

The terms "PMFc," "Cr," "FP," "TN" are used consistently but may benefit from a brief reminder of definitions when first reintroduced in later sections.

Response 6: We have addressed this comment by reintroducing the full definitions of PMFc, Cr, FP, and TN at their first mention in key sections (Results and Discussion) to ensure clarity. Additionally, we have identified and expanded all other abbreviations throughout the manuscript that were previously undefined.

Comments 7:

Please briefly describe how the physician panel adjudicated the 139 false-positive cases, including inter-rater reliability or consensus approach, to strengthen confidence in the manual validation.

Response 7: Thank you for this important comment. The section “False Positive Patients Analyses” has been updated to detail the physician panel’s workflow.

The adjudication of the 139 false-positive cases followed a structured, multi-step process involving two independent physician panels. Each case was reviewed and labeled separately by both teams using available clinical data (laboratory results, ICD codes, and clinical documentation). In cases of disagreement, discrepancies were discussed jointly, and final decisions were reached through a consensus process, with arbitration by a senior hematologist with over 20 years of clinical experience when needed.

We prioritized a structural resolution framework over inter-rater metrics to guarantee robustness. By utilizing a dual-team review followed by a mandatory consensus phase, we actively mitigated bias at the source. This process goes beyond simple agreement tracking by ensuring that every data point undergoes rigorous cross-verification and final reconciliation.

Labeling was performed according to a standardized framework to ensure consistency and support both clinical validation and methodological evaluation (including algorithm and NLP performance). We distinguished between disease risk (high, medium, low, no risk, not enough data) and diagnostic status (confirmed, excluded, suspected, unknown, or assigned by mistake), with predefined, disease-specific criteria.

Comments 8:

A careful proofreading pass is recommended to correct small grammatical inconsistencies (e.g., subject-verb agreement, article usage) and ensure consistent formatting of references.

Response 8: We have conducted a thorough linguistic and technical review of the entire manuscript. All identified grammatical inconsistencies, including subject-verb agreement and article usage, have been corrected to ensure professional academic standards. Additionally, we have performed a comprehensive audit of the bibliography to ensure consistent and accurate formatting of all references throughout the text.

Attachment

We have addressed all the comments provided in the review process. For your convenience, we have included a ZIP archive containing the following files:

manuscript-diff.pdf: A document highlighting the changes made to the original submission.
manuscript_2026-04-24.pdf: The final, clean version of the revised manuscript.
LaTeX Source Files: The complete source code used to generate the documents.

Notes on the Revision Tracking (manuscript-diff.pdf)

The difference file was generated using the latexdiff utility. Please note the following formatting conventions:

Added text is displayed in blue with a wavy underline.
Deleted text is marked in red and struck through.

Important Technical Clarifications:

Figures: Please be advised that latexdiff does not visually track changes within figures. However, all figures have been updated in accordance with your suggestions. We recommend referring to the final manuscript (manuscript_2026-04-24.pdf) to view the updated visuals.
Reference Numbering: Due to the technical limitations of the latexdiff tool, the citation numbering in the difference file may not perfectly align with the final version. We apologize for any inconvenience this may cause and kindly ask you to use the final PDF as the primary reference for the bibliography.

We hope these revisions meet your expectations and look forward to your further assessment.

Reviewer 3 Report

Comments and Suggestions for Authors

While the clinical question is relevant and the multi-center EHR setting is a strength, the manuscript has fundamental methodological weaknesses, underpowered positive cohort, inconsistencies in result reporting, incomplete figures, and presentation issues that collectively prevent acceptance. The concerns below are organized by section.

Only 67 confirmed PMF patients from 3 of 10 hospitals are used to train models with 212 features (~3.2 patients per feature). This ratio is far below acceptable norms and the overfitting risk is acknowledged only qualitatively in Limitations — it must be addressed quantitatively.
Age- and gender-matched Cr patients deviate substantially from the general population. A precision of 14.72% under this artificial class distribution cannot be extrapolated to clinical screening without population-level prevalence correction, which is not provided.
ICD-10 code Z03 ("observation for suspected disease") is the highest-ranked non-CBC feature in Figure 5. A patient flagged for diagnostic observation is already clinically suspected — this likely represents soft label leakage that undermines the model's validity.
No imputation was performed, but how within-record missingness for specific CBC parameters was handled during training (e.g., treated as zero, binned separately) is never stated.
The binary setting treats Cr as true negatives; the PU setting treats Cr as an unlabeled mixture containing hidden positives. These assumptions are mutually contradictory and used simultaneously. If Cr contains hidden positives, all binary metrics (sensitivity, precision, specificity) are invalid.
The non-standard precision-recall evaluation adopted in lines 278–280 needs citation or derivation. Choosing a metric after observing results risks circularity.
AUPRC of 20.31% and sensitivity of 45.52% mean the model misses over half of true PMF cases and generates roughly 6–7 false positives per true positive. The clinical workflow implications of this operating profile are not discussed.
Table 3 bolds LightGBM as best by AUPRC point estimate, yet XGBoost achieves a higher point AUPRC (21.71%) with a confidence interval so wide (2.57–43.36%) it is essentially uninformative. The stated rationale for choosing LightGBM shifts between stability and AUPRC without a consistent, pre-specified rule.
139 consensus FPs reviewed, 41 had PMF excluded by biopsy, 72 had no diagnostic data, and only 4 were confirmed PMF. Characterizing this group as predominantly "hidden positives" is not supported by the data.
Figure 3 is difficult to read. Overlapping ICD-10 labels in the volcano plot reduce legibility. A repel-based annotation strategy or a supplementary table with full feature statistics is needed.
Figure 7 is non-standard and confusing. Using Venn diagrams to display confusion matrix outcomes across three model variants is unconventional. The numbers in overlapping regions are never explained. A standard tabular confusion matrix would be clearer.

Comments for author File: Comments.pdf

Author Response

Comments 1:

Only 67 confirmed PMF patients from 3 of 10 hospitals are used to train models with 212 features (~3.2 patients per feature). This ratio is far below acceptable norms and the overfitting risk is acknowledged only qualitatively in Limitations — it must be addressed quantitatively.

Response 1: PMF is an ultra-rare malignancy with an expected incidence below 1 per 100,000. To identify our cohort, we conducted a rigorous manual review of the complete medical records of 448 potential cases - a process involving multiple clinicians over several months (Section 2.1). While 233 cases had a confirmed diagnosis, only 67 patients met the strict criteria of having CBC data available prior to diagnosis within the observable timeframe (Section 3.1). In several participating hospitals, no PMF cases were found, or those identified lacked the necessary CBC data. Expanding this "gold standard" cohort is a long-term process, and we opted to proceed with this diversified multi-center dataset to ensure the model captures real-world clinical variability.

To mitigate the risks associated with a small positive class, we employed a robust statistical framework (Section 2.4):

10 times repeated Stratified Cross-Validation (10 folds): We utilized frequent, stratified, and randomized splits to ensure stable performance estimates.
Uncertainty Quantification: All metrics are reported with 95% Confidence Intervals (CI) to transparently present the dispersion of results (Section 2.4).

Overfitting was addressed quantitatively through constrained learning and strict regularization (Appendix 1). We utilized:

Shallow Decision Trees: Limiting tree depth to prevent the memorization of noise.
Heavy Regularization: Constraints on the number of trees and leaf size.
Feature Selection: Consequently, the models rely primarily on a few key CBC features (e.g., RDW, PLT), effectively reducing the functional feature space.

While originally omitted to maintain clinical focus, we have now included the training vs. test loss functions (averaged across all folds) in Supplement - Train and Valid Loss. The convergence of these curves demonstrates that overfitting is not present in our final models.

We agree that the transferability of these solutions requires further study (Section 4.5), but we believe that given the extreme rarity of PMF, our current results represent a valuable contribution to the field.

Comments 2:

Age- and gender-matched Cr patients deviate substantially from the general population. A precision of 14.72% under this artificial class distribution cannot be extrapolated to clinical screening without population-level prevalence correction, which is not provided.

Response 2: Thank you for this important observation.

We agree that the age- and gender-matched control (Cr) cohort does not reflect the true population distribution, and therefore performance metrics such as precision (PPV) obtained under this artificial class balance cannot be directly extrapolated to population-level screening settings.

However, our study is not designed to operate at the general population level. Instead, the model is intended for use within a hospital-based EHR environment, where the underlying patient distribution, disease prevalence, and referral patterns differ substantially from the general population. The relationship between population-level prevalence and hospital-based cohorts is complex and not directly transferable, particularly for rare diseases such as PMF.

Importantly, the matching strategy was deliberately applied to make the task more challenging and to reduce the risk that the model would rely on simple demographic signals (such as age) to distinguish cases from controls. This forces the model to learn more subtle, disease-related patterns rather than exploiting easily separable features.

Our primary objective was to evaluate the model’s discriminative ability under controlled conditions, rather than to estimate real-world PPV. We acknowledge that in practical deployment, PPV will depend on the target population and should be recalibrated according to the local prevalence and clinical setting.

We have clarified these points in the revised section “Limitations and Future Directions” to ensure appropriate interpretation of the results.

Comments 3:

ICD-10 code Z03 ("observation for suspected disease") is the highest-ranked non-CBC feature in Figure 5. A patient flagged for diagnostic observation is already clinically suspected — this likely represents soft label leakage that undermines the model's validity.

Response 3: We agree with general observation that administrative codes can introduce informational noise or label leakage. However, in our case, the code Z03 is highly non-specific, used across almost all medical specialties, and does not serve as a proxy for a PMF diagnosis. It reflects a standard patient pathway rather than a suspicion of a specific hematological malignancy. It appears in only 19 out of 67 positive cases, while in over 3,000 patients in the control group (Supplementary Table 2).

Furthermore, according to the feature importance (Gain) metric, the model utilizes this feature very selectively. All other parameters contributing significantly to the final decision are derived from the CBC. The overwhelming majority of the model's predictive gain is driven by biological markers, specifically RDW and PLT.

While we acknowledge that administrative markers require careful interpretation. However, given its low frequency and the model's heavy reliance on CBC parameters, ZO3 does not replace the robust biological evidence provided by the CBC parameters.

We acknowledge that administrative markers, such as ZO3, require careful interpretation. To eliminate any ambiguity regarding their role, we have updated the section "LightGBM - Model Interpretability and Clinical Insights" to explicitly address this. We emphasize that while ZO3 is a present feature, the model’s performance relies predominantly on robust biological evidence from CBC parameters. It serves as a contextual element rather than a replacement for the primary hematological indicators that drive the classification.

Comments 4:

No imputation was performed, but how within-record missingness for specific CBC parameters was handled during training (e.g., treated as zero, binned separately) is never stated.

Response 4: We appreciate this request for clarification. The decision to omit manual imputation was intentional, as we relied on the native handling of missing values built into the selected machine learning algorithms.

Specifically, for the gradient-boosted decision tree models (LightGBM, XGBoost, and CatBoost), the algorithms determine the optimal branching direction for missing data at each split to minimize the loss function. This means the "strategy" for missingness is learned dynamically from the data distribution itself. In the case of the Weight of Evidence (WoE) + Random Forest approach, missing values are handled through the WoE transformation. In our implementation, missing values are treated as neutral weights are effectively integrated into the binned features without distorting the numerical scale.

To maintain the clinical focus of the manuscript, we originally omitted these algorithmic details; however, we have now added the references to the primary documentation for these algorithms, which provides an exhaustive technical description of their missing-data handling mechanisms.

Comments 5:

The binary setting treats Cr as true negatives; the PU setting treats Cr as an unlabeled mixture containing hidden positives. These assumptions are mutually contradictory and used simultaneously. If Cr contains hidden positives, all binary metrics (sensitivity, precision, specificity) are invalid.

Response 5: While these frameworks rely on distinct assumptions, establishing a "common ground" is a standard approach and a practical necessity for comparative benchmarking. Binary models simplify the data by treating the unlabeled set as negative, whereas PU methods are specifically designed to model the data as it truly is - a mixture of classes.

We state that, in the real-world hospital population, identifying rare diseases is inherently a PU problem; it is practically impossible to definitively label every single negative sample. Recognizing this limitation, we took steps to investigate the most suspicious cases - specifically the False Positives, to verify if they represented true negatives or previously undiagnosed patients. While reaching "ideal" experimental conditions is unfeasible, this does not imply the metrics are invalid, but should be viewed as informed approximations rather than absolute truths.

Evolution toward more accurate estimation in "noisy" environments remains a central goal of the PU research. Following emerging literature, such as Vogel & Cordier (2025) “Positive and unlabeled learning from hospital administrative data: a novel approach to identify sepsis cases”, which suggests focusing on positive-case metrics like Recall@k and Precision@k in unlabeled datasets, we prioritized the Average Precision (AP). This alignment ensures that our evaluation remains robust and corresponds with modern recommendations for handling skewed, unlabeled data. We have updated the Methodology: “Model Training and Evaluation” and Discussion: “Methodological Innovations” section to explicitly address the challenges in interpreting binary metrics within a PU context.

Comments 6:

The non-standard precision-recall evaluation adopted in lines 278–280 needs citation or derivation. Choosing a metric after observing results risks circularity.

Response 6: We sincerely thank the Reviewer for this critical observation. We acknowledge that introducing non-standard evaluation methods without sufficient theoretical backing can raise concerns regarding methodological circularity.

To address this, we have revised our approach and updated the manuscript as follows:

We have replaced the trapezoid-based AUPRC with Average Precision (AP) as our primary metric. As noted in the literature (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score), the linear interpolation used in standard AUPRC can lead to overoptimistic estimations of model performance, especially in datasets with high class imbalance. AP, calculated as the weighted mean of precisions achieved at each threshold, using the increase in recall from the previous threshold as the weight. This approach provides a more conservative and mathematically rigorous measure that avoids the pitfalls of interpolation. We have verified that this change does not significantly alter our core findings, but it ensures higher methodological correctness.
We have clarified the intent behind the binning approach. We were not intended to replace standard metrics but to provide a more transparent view of the data density. In the context of imbalanced classification with sparse labeling, standard curves often mask the true data distribution. Our proposed visualization is an alternative designed to highlight the true distribution of data, undistorted by interpolation. We grouped results into recall bins to smooth the graph and facilitate general conclusions without masking the underlying data density. Furthermore, to emphasize the uneven distribution of points, we supplemented the plots with histograms for both dimensions (recall and precision). The goal was not to make the results more "attractive," but to solve the issue of misleadingly optimistic estimations.

We believe this visualization offers valuable insight into model characteristics that standard curves often mask. However, since it is not our primary metric, we are willing to remove it entirely if the Reviewer finds it unnecessary.

Comments 7:

AUPRC of 20.31% and sensitivity of 45.52% mean the model misses over half of true PMF cases and generates roughly 6–7 false positives per true positive. The clinical workflow implications of this operating profile are not discussed.

Response 7: We appreciate the focus on the clinical workflow implications. In the context of very low prevalence (67/110 067 ~= 0.061%), the model provides over 340-fold enrichment (20.81/0.061) of PMF cases within the flagged cohort compared to random screening. It is designed not as a standalone diagnostic tool, but as a prioritization system to identify high-risk individuals for selective hematological review from a vast EHR population.

The clinical burden of false positives is further mitigated by the findings of our manual review. We discovered that 37% of these "false positive" patients had already undergone specialized hematological diagnostic procedures, and 22% were independently classified by clinicians as having medium-to-high clinical risk of PMF or other myeloproliferative disorders. This suggests that many flagged patients represent individuals requiring investigation due to clinical manifestations (including hidden positives), rather than simple model failures.

In practice, we propose a two-stage workflow where the model acts as an initial filter to identify the top 1% of the population. This manageable subset is then reviewed by hematologists to determine which patients warrant invasive procedures like bone marrow biopsy versus continued observation. This approach converts the modest precision into actionable screening capacity that is sustainable within real-world diagnostic resource constraints. We have updated the Discussion section to explicitly address these workflow implications.

Comments 8:

Table 3 bolds LightGBM as best by AUPRC point estimate, yet XGBoost achieves a higher point AUPRC (21.71%) with a confidence interval so wide (2.57 - 43.36%) it is essentially uninformative. The stated rationale for choosing LightGBM shifts between stability and AUPRC without a consistent, pre-specified rule.

Response 8: We appreciate the detailed look at Table 3; however, there may have been a slight misunderstanding. In the manuscript, XGBoost is the model indicated in bold and it has the highest point estimate AUPRC (now AP).

That being said, we fully agree with the observation regarding the uninformative nature of its wide confidence interval (2.57-43.36%). In Section 3.2, we explicitly highlight this high dispersion of results. Our rationale for preferring LightGBM (despite its marginally lower point estimate) is its significantly higher stability (narrower confidence interval). In clinical screening, where consistency is crucial, a model with slightly lower but reliable performance is often preferred over one with high variance.

This decision was further reinforced after our reassessment of the Precision-Recall characteristics. As discussed in previous points, upon identifying that conventional (trapezoid-based) AUPRC was prone to overestimation, we shifted to AP and an alternative PR visualization strategy. This analysis confirmed that LightGBM overperforms competitors. We have updated Section 3.2 to ensure our decision-making criteria (prioritizing stability over high-variance point estimates) are consistently presented.

Comments 9:

139 consensus FPs reviewed, 41 had PMF excluded by biopsy, 72 had no diagnostic data, and only 4 were confirmed PMF. Characterizing this group as predominantly "hidden positives" is not supported by the data.

Response 9: We acknowledge that our characterization could be seen as an overstatement. We have therefore refined our language to better reflect the specific findings within the False Positive cohort.

While we identified only 4 previously undiagnosed PMF cases among the 139 reviewed FPs, we emphasize that surfacing these few cases from a vast population of 110,000 (prevalence ~0.061%) is a significant finding that confirms the inherent Positive-Unlabeled (PU) nature of EHR data.

Furthermore, our analysis now highlights that 37% (51/139) of the flagged "false positive" patients had already undergone specialized hematological procedures (including biopsies) in clinical practice. This demonstrates that the model accurately targets individuals whose clinical profiles are suspicious enough to warrant invasive diagnostic workups by human clinicians, even if a final PMF diagnosis is eventually excluded.

The discussion has been updated in Section 4.4 "Changes in Interpretation of False Positive" to accurately reflect these findings and clarify the value of the PU paradigm in rare disease screening.

Comments 10:

Figure 3 is difficult to read. Overlapping ICD-10 labels in the volcano plot reduce legibility. A repel-based annotation strategy or a supplementary table with full feature statistics is needed.

Response 10: We have revised Figure 3 to ensure that the ICD-10 labels no longer overlap, significantly improving the legibility of the volcano plot.

Comments 11:

Figure 7 is non-standard and confusing. Using Venn diagrams to display confusion matrix outcomes across three model variants is unconventional. The numbers in overlapping regions are never explained. A standard tabular confusion matrix would be clearer.

Response 11: We agree that the original labeling of Figure 7 was unfortunate and potentially confusing. To address this, we have moved the standard, tabular confusion matrices for all model variants to a dedicated supplementary file (Supplement: Confusion Matrices), ensuring that the classification results are available in a conventional format. We have also updated the caption and labels of Figure 7 to clarify that its purpose is to illustrate model consensus - specifically, the overlap in predictions among the three algorithms - rather than to serve as a substitute for a standard confusion matrix. We believe this restructuring fulfills the need for clarity while retaining the insights regarding model agreement.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have engaged in research entitled “Leveraging Ensemble Machine Learning Models for the Detection of Primary Myelofibrosis in Electronic Health Records.” After carefully reviewing the article, I suggest accepting the paper only after incorporating the following comments.

Further grammar corrections are required.
The figures' content (text and numbers) is not visible. Readability must be improved at the resubmission.
In the abstract, the problem statement must be clearly mentioned. A motivated approach would improve the readability.
Please go through the introduction section by improving the research motivation of the study. The research questions you are addressing here are confusing. Please rewrite,
The methodology section must be improved with the models used and the justification for the model selection of the study. The manuscript should clarify whether preprocessing, feature selection, discretization, and tuning were all done strictly within training folds. With such a small positive class, repeated cross-validation may overestimate performance.
The last positive cohort used in modeling comprises just 67 confirmed cases of PMF, and it is explicitly stated in the paper that these cases were recruited from only three out of the ten involved hospitals. This creates serious doubts regarding the robustness of the results, increases the likelihood of learning on site level, and undermines multi-centered generalization. It would be much more convincing if the authors employed an external validation approach, or at least a split at the hospital level.
The paper also claims that trapezoid-based AUPRC overestimates the classifier's performance and proposes a new metric based on binned precision by recall. The authors should explain in detail why their new metric makes sense and how it differs from average precision and from a trapezoid approximation.

Author Response

Comments 1:

Further grammar corrections are required

Response 1: We have conducted a thorough linguistic review of the entire manuscript. All identified grammatical errors and stylistic inconsistencies have been corrected to ensure clarity and professional academic standards.

Comments 2:

The figures' content (text and numbers) is not visible. Readability must be improved at the resubmission.

Response 2: We have thoroughly revised all figures to ensure full legibility. The content has been enlarged, and font sizes have been adjusted for better visibility. Additionally, any overlapping labels have been separated to eliminate clutter and ensure that all data points are clearly readable in the resubmitted version.

Comments 3:

In the abstract, the problem statement must be clearly mentioned. A motivated approach would improve the readability.

Response 3: Thank you for this valuable suggestion. We have revised the abstract to more explicitly define the problem statement and clarify the motivation behind our methodology. We have clarified the main goals and primary challenges. We restructured the narrative to show a clear progression: from the clinical need, through the evaluation of specialized methods, to the practical findings regarding their feasibility in real-world healthcare constraints.

Comments 4:

Please go through the introduction section by improving the research motivation of the study. The research questions you are addressing here are confusing. Please rewrite,

Response 4: We agree that the introduction needed a more robust motivation and a clearer articulation of the research questions. We have substantially revised the Introduction section to address these concerns. We now explicitly highlight the "diagnostic gap" in rare diseases like Primary Myelofibrosis (PMF), where non-specific symptoms and data sparsity in EHRs lead to significant delays in diagnosis. We also expanded on the limitations of traditional supervised learning in real-world clinical settings, such as the absence of confirmed negative labels. We have reformulated the study's objectives into two distinct pillars:

To evaluate if routinely collected EHR data (CBC and ICD-10 codes) can be used to develop a screening tool for identifying at-risk PMF patients before formal diagnosis.
To assess the effectiveness of advanced machine learning strategies, particularly Positive-Unlabeled (PU) learning (Elkan & Noto, Spy methods), in handling the inherent label bias and class imbalance of rare disease detection in EHRs.

We believe these changes provide a much stronger foundation for the study and clearly define the problems we aim to solve.

Comments 5:

The methodology section must be improved with the models used and the justification for the model selection of the study. The manuscript should clarify whether preprocessing, feature selection, discretization, and tuning were all done strictly within training folds. With such a small positive class, repeated cross-validation may overestimate performance.

Response 5: We have addressed these concerns by expanding the "Model training and evaluation" subsection within the Methods and acknowledging the study's constraints in the “Limitations and Future Directions” section of the Discussion.

We clarified that our methodology involved a comparison of four state-of-the-art ensemble algorithms: XGBoost, LightGBM, CatBoost, and Random Forest. These models were selected for their proven performance on tabular data and their inherent ability to handle non-linear relationships and sparsity without requiring extensive normalization.

Regarding the training pipeline, we now explicitly describe the global data transformations applied during preprocessing and discuss their potential implications. We also specify that all models were trained strictly on training folds to prevent data leakage. Given the limited size of the positive class (n=67), we implemented several safeguards against overfitting: we eschewed synthetic data generation in favor of aggressive regularization, strictly limited tree depth and quantity, and monitored loss curves (we have added a supplementary file containing the training and validation loss curves: Supplement - Train and Valid Loss). To ensure stability and mitigate potential optimistic bias, we evaluated the finalized configurations using repeated randomized stratified 10-fold cross-validation (100 independent validation runs). Performance metrics are reported as median values with 95% confidence intervals, with Average Precision (AP) prioritized as a conservative and robust measure for this high-imbalance, sparse-labeling scenario.

Comments 6:

The last positive cohort used in modeling comprises just 67 confirmed cases of PMF, and it is explicitly stated in the paper that these cases were recruited from only three out of the ten involved hospitals. This creates serious doubts regarding the robustness of the results, increases the likelihood of learning on site level, and undermines multi-centered generalization. It would be much more convincing if the authors employed an external validation approach, or at least a split at the hospital level.

Response 6: We acknowledge the Reviewer’s concern regarding the potential for site-level learning and the challenges of multi-center generalization. However, the unique epidemiological nature of PMF and the rigorous data acquisition process significantly limited the feasibility of a hospital-level split. PMF is an ultra-rare malignancy with an expected incidence below 1 per 100,000, and our cohort was identified through a rigorous manual review of 448 potential cases by multiple clinicians (Section 2.1). While 233 cases were confirmed, only 67 patients met the strict criteria for CBC data availability prior to diagnosis (Section 3.1). In seven participating hospitals, no PMF cases were found or they lacked necessary longitudinal data, necessitating a pooled multi-center approach to capture essential biological variability.

To mitigate risks associated with a small positive class and ensure robustness, we employed a highly constrained statistical framework. This includes 10-times repeated 10-fold stratified cross-validation and the reporting of 95% Confidence Intervals to transparently present result dispersion (Section 2.4). Overfitting and site-specific noise were addressed through constrained learning and strict regularization (Appendix 1), using shallow decision trees that force the model to rely on key biological features (specifically RDW and PLT) rather than hospital-specific patterns.

Quantitative evidence of this robustness is now provided in Supplementary: Train and Valid Loss, where the convergence of training and validation loss curves demonstrates that the model generalizes well without overfitting. While we agree that true external validation remains the ideal standard for future research (Section 4.5), we believe that given the extreme rarity of PMF, these results represent a valuable contribution to the field.

Comments 7:

The paper also claims that trapezoid-based AUPRC overestimates the classifier's performance and proposes a new metric based on binned precision by recall. The authors should explain in detail why their new metric makes sense and how it differs from average precision and from a trapezoid approximation.

Response 7: We have clarified the description of our binning approach in the revised manuscript, emphasizing that it serves strictly as an insight into the model's properties rather than a metric used for model selection. This approach was designed to provide transparency regarding data density and to highlight how the model performs across different recall ranges.

To further ensure that our conclusions are not biased, we have also replaced the trapezoid-based AUPRC with Average Precision (AP) as our primary evaluation metric. Unlike trapezoidal approximation, which relies on linear interpolation and can lead to overoptimistic results in imbalanced datasets, AP provides a more robust and conservative estimate by calculating the weighted mean of precisions at each threshold. This change ensures that our performance evaluation is resistant to interpolation errors and remains focused on the actual distribution of the data.

Attachment

We have addressed all the comments provided in the review process. For your convenience, we have included a ZIP archive containing the following files:

manuscript-diff.pdf: A document highlighting the changes made to the original submission.
manuscript_2026-04-24.pdf: The final, clean version of the revised manuscript.
LaTeX Source Files: The complete source code used to generate the documents.

Notes on the Revision Tracking (manuscript-diff.pdf)

The difference file was generated using the latexdiff utility. Please note the following formatting conventions:

Added text is displayed in blue with a wavy underline.
Deleted text is marked in red and struck through.

Important Technical Clarifications:

Figures: Please be advised that latexdiff does not visually track changes within figures. However, all figures have been updated in accordance with your suggestions. We recommend referring to the final manuscript (manuscript_2026-04-24.pdf) to view the updated visuals.
Reference Numbering: Due to the technical limitations of the latexdiff tool, the citation numbering in the difference file may not perfectly align with the final version. We apologize for any inconvenience this may cause and kindly ask you to use the final PDF as the primary reference for the bibliography.

We hope these revisions meet your expectations and look forward to your further assessment.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have modified the manuscript according to the comments and suggestions of the referee.

Author Response

Thank you for your valuable feedback. Your remarks helped us ensure the technical correctness of the work and greatly improved its scientific value.

Reviewer 2 Report

Comments and Suggestions for Authors

The author's carefully addressed my comments

Author Response

We are grateful for the reviewer’s constructive comments, which have made the paper more accurate and scientifically robust.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have carefully revised the manuscript and provided detailed responses to the reviewers’ comments. The clarity of the methodology, presentation of results, and discussion of limitations have been significantly improved. The study addresses an important and challenging problem—rare disease detection in real-world EHR data—and provides a valuable contribution to the field.

The use of ensemble learning methods combined with a discussion of positive-unlabeled (PU) learning is particularly relevant in this context, where fully labeled datasets are difficult to obtain. The inclusion of confidence intervals, repeated cross-validation, and additional clarification on evaluation metrics improves the overall rigor of the work. While the manuscript is now suitable for publication, I recommend a few minor revisions to further strengthen it:

The proposed “top 1% screening workflow” is promising. It would be helpful to briefly quantify the expected number of patients flagged in a typical hospital setting and the corresponding clinical workload.
Given the rare disease setting and partially unlabeled control cohort, a short statement emphasizing that precision, sensitivity, and specificity should be interpreted with caution in real-world deployment would improve clarity.
Although the role of ICD-10 code Z03 has been clarified, a brief sensitivity statement (e.g., acknowledging potential residual bias or future validation without administrative codes) would further strengthen the robustness of the conclusions.

Author Response

Comments 1: The proposed “top 1% screening workflow” is promising. It would be helpful to briefly quantify the expected number of patients flagged in a typical hospital setting and the corresponding clinical workload.

Response 1:

Thank you for this suggestion. While we acknowledge that quantifying the expected number of patients flagged in a clinical setting would be valuable, we believe that providing such an estimate at this stage would constitute an overstatement. A reliable estimation would require a guarantee that the cohort used in our study is representative of a broad, typical hospital population. Since our current study design and data source may not account for the high variability in patient demographics and prevalence across different healthcare settings, we have opted to focus on the model’s performance characteristics rather than specific volume projections to maintain methodological rigor.

Regarding concerns about clinical workload, we propose adding the following sentence to the manuscript in the section “Clinical Application and Screening Framework”:

“To estimate the workload for the health system, we assumed ~15 minutes of specialist time per patient, similar to a standard outpatient visit, reflecting the time needed to review EHR data. The median of flagged patients per hospital was 15, corresponding to ~3.75 hours of additional specialist time.”

Comments 2: Given the rare disease setting and partially unlabeled control cohort, a short statement emphasizing that precision, sensitivity, and specificity should be interpreted with caution in real-world deployment would improve clarity.

Response 2: We completely agree with the reviewer’s assessment regarding the interpretation of performance metrics in the context of rare diseases and Positive-Unlabeled (PU) learning. We have already addressed this point in the current version of our manuscript in the Discussion section under the "Methodological Innovations" heading. In this section, we emphasize that the evaluation of models in such scenarios presents unique challenges, specifically noting that the use of classical binary classification metrics for PU learning carries significant interpretability risks due to the absence of fully verified true negative labels.

Following the reviewer suggestion, we have added a statement clarifying that reported metrics should be viewed as approximations rather than absolute ground truth. However, following your suggestion, we have explicitly stated that these classical binary classification metrics refer to precision, sensitivity, and specificity.

Comments 3: Although the role of ICD-10 code Z03 has been clarified, a brief sensitivity statement (e.g., acknowledging potential residual bias or future validation without administrative codes) would further strengthen the robustness of the conclusions.

Response 3: We appreciate your suggestion. In response, we have explicitly stated within the “Limitations and Future Directions” section that the inclusion of specific administrative ICD-10 codes, such as Z03, may introduce bias reflecting institutional or individual clinician practices, and therefore, the impact of such codes on predictions must be verified in subsequent studies.

Attachment

We have addressed all the comments provided in the review process. For your convenience, we have included a ZIP archive containing the following files:

manuscript-diff.pdf: A document highlighting the changes made to the original submission.
manuscript_2026-04-30.pdf: The final, clean version of the revised manuscript.
LaTeX Source Files: The complete source code used to generate the documents.

Notes on the Revision Tracking (manuscript-diff.pdf)

The difference file was generated using the latexdiff utility. Please note the following formatting conventions:

Added text is displayed in blue with a wavy underline.
Deleted text is marked in red and struck through.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have improved the manuscript according to my comments.

Figure readability, presentation, and overall paper flow have improved.

Furthermore, please improve 2.3. Statistical analysis. The information in the subsection is not sufficient. If possible, move the text to section 2.2 and rename 2.2 as Data Extraction, Pre-processing, and Statistical analysis

Author Response

Comments 1: Furthermore, please improve 2.3. Statistical analysis. The information in the subsection is not sufficient. If possible, move the text to section 2.2 and rename 2.2 as Data Extraction, Pre-processing, and Statistical analysis

Response 1: In accordance with your recommendation, the previous section 2.3 has been fully integrated into section 2.2, which has been renamed "Data Extraction, Pre-processing, and Statistical Analysis." Beyond the structural reorganization, we have expanded the content of this subsection. Specifically, we have added a detailed justification for the selection of each statistical method, ensuring that the choice of tests is clearly linked to the data distribution. Additionally, we have included a rationale for the specific data visualizations employed in the study, explaining how these graphical representations accurately reflect the underlying data characteristics and facilitate the interpretation of our findings. These enhancements ensure a more transparent and comprehensive description of our analytical framework.

Attachment

We have addressed all the comments provided in the review process. For your convenience, we have included a ZIP archive containing the following files:

manuscript-diff.pdf: A document highlighting the changes made to the original submission.
manuscript_2026-04-30.pdf: The final, clean version of the revised manuscript.
LaTeX Source Files: The complete source code used to generate the documents.

Notes on the Revision Tracking (manuscript-diff.pdf)

The difference file was generated using the latexdiff utility. Please note the following formatting conventions:

Added text is displayed in blue with a wavy underline.
Deleted text is marked in red and struck through.