Next Article in Journal
Efficacy of Baduanjin Versus Brisk Walking on Cognitive and Physical Functions in Schizophrenia: A Three-Arm Randomized Controlled Trial
Previous Article in Journal
Narrative Review on Post-Stroke Outcomes Through Recognition of Frailty, Sarcopenia, and Palliative Care Needs
 
 
Article
Peer-Review Record

Modeling Mental Health Case-Mix for Quality Improvement—A Comparison of Statistical and AI Models

Healthcare 2025, 13(23), 3012; https://doi.org/10.3390/healthcare13233012
by Jian Gao 1,*, Tamara L. Box 2, Ting Liu 3 and Stacey L. Farmer 4
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Healthcare 2025, 13(23), 3012; https://doi.org/10.3390/healthcare13233012
Submission received: 3 October 2025 / Revised: 14 November 2025 / Accepted: 19 November 2025 / Published: 21 November 2025
(This article belongs to the Special Issue Applications of Data Mining in Patient Care)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study addresses an important topic in mental health analytics, comparing statistical and AI models for developing a case-mix adjustment system using large-scale Veterans Health Administration (VHA) data. While the topic is relevant and the dataset impressive, the manuscript would benefit from substantial revisions to improve methodological transparency, conceptual framing, and interpretation of results.

  1. The introduction outlines the need for better mental health case-mix systems. Still, it does not clearly define the gap in existing models or explain how the proposed framework differs substantially from previous studies, such as Sloan et al. (2006) and Tran et al. (2019). The novelty claim should be more explicitly articulated with respect to prior VHA-related modeling work.
  2. The authors should explain why these models are most appropriate for cost data and what hypotheses guided their inclusion. Models are presented descriptively.
  3. The study relies solely on age, sex, and diagnosis-based features. While the authors justify this for comparability, the exclusion of socioeconomic and clinical covariates undermines the model’s real-world applicability. The implications of this simplification should be more critically discussed.
  4. The manuscript reports R² and PVE values but fails to contextualize what constitutes a “good” predictive performance in this domain. For instance, an R² of 0.458 is described as “four times higher than reported in the literature”, yet this comparison is unsubstantiated by references or comparable baselines.
  5. Although the text mentions a 50/50 train-validation split, the procedures for hyperparameter tuning, cross-validation, and model robustness assessment are insufficiently detailed.
  6. There is no clarity on the parameterization of the Box-Cox lambda value or how retransformation bias was handled. These details are essential for transparency.
  7. The discussion section essentially restates results without sufficient critical analysis of why AI models only marginally outperformed statistical models.
  8. The limitations section acknowledges the male-dominated VHA dataset but does not provide sufficient discussion of how this demographic skew affects model applicability to broader mental health populations.
  9. The manuscript contains numerous redundant phrases and overly long paragraphs, which reduce readability. Some tables (for example, Table 2) occupy excessive space without deeper analytical integration in the text.

Author Response

Dear Reviewer 1,

Comments 1: The introduction outlines the need for better mental health case-mix systems. Still, it does not clearly define the gap in existing models or explain how the proposed framework differs substantially from previous studies, such as Sloan et al. (2006) and Tran et al. (2019). The novelty claim should be more explicitly articulated with respect to prior VHA-related modeling work.

Response 1: First, we sincerely thank the reviewers for taking the time to review our work, and we greatly appreciate their thoughtful comments. As advised, we have revised the text in the Introduction accordingly, with the changes highlighted in yellow.

Comments 2: The authors should explain why these models are most appropriate for cost data and what hypotheses guided their inclusion. Models are presented descriptively.

Response 2: Although some studies have used alternative outcome measures, such as the number of outpatient visits, in their case-mix modeling (Tran et al., 2019), the best proxy for capturing total disease severity or burden is total patient care cost (Iezzoni, Risk Adjustment for Measuring Health Care Outcomes, 2012). When modeling cost, the performance (i.e., predictive power) of statistical and AI models depends on the structure of the data, including its distributional characteristics such as spread and skewness. No single model consistently outperforms others across different datasets. For this reason, we tested four statistical and four AI models, which are commonly used in published studies, to compare their performance in predicting mental health care costs. As advised, we have revised Section 2.3 to more clearly explain our rationale for selecting these models.

Comments 3: The study relies solely on age, sex, and diagnosis-based features. While the authors justify this for comparability, the exclusion of socioeconomic and clinical covariates undermines the model’s real-world applicability. The implications of this simplification should be more critically discussed.

Response 3: We respectfully disagree with the comment that “the exclusion of socioeconomic and clinical covariates undermines the model’s real-world applicability,” as we have already cautioned readers in the manuscript: “Nonetheless, socioeconomic factors can significantly influence health status and should be considered alongside case-mix measures when analyzing staffing levels or comparing patient outcomes.”

More importantly, the objective of this study is not to provide an off-the-shelf case-mix software or algorithm. Rather, our aim is to demonstrate how predictive power can be improved by grouping patients into more homogeneous categories, and to evaluate the performance of various statistical and AI models. Readers developing case-mix models can compare their models’ baseline performance (using age, sex, and diagnoses) with ours, and may incorporate additional variables as appropriate. As more variables are added, predictive performance is expected to improve.

We would be happy to further discuss this issue if the Reviewer prefers.

Comments 4: The manuscript reports R² and PVE values but fails to contextualize what constitutes a “good” predictive performance in this domain. For instance, an R² of 0.458 is described as “four times higher than reported in the literature”, yet this comparison is unsubstantiated by references or comparable baselines.

Response 4: Although it is often suggested that “an R² of >15% is generally a meaningful value in clinical research” (Gupta et al., Academic Medicine & Surgery, 2024), we are not aware of any established standard that defines what constitutes “good” predictive performance for case-mix models.

Developing case-mix models is an iterative learning process, as no single study can produce a perfect model. Predictive performance depends on several factors, including the outcome being modeled (e.g., total cost, ED visits, or mental health care visits), the degree of variation in utilization or cost across patients, and the accuracy of the underlying data. In this study, we present our findings and compare them to the predictive performance reported in published literature.

In the second paragraph of the Discussion section, we noted that, among the studies reviewed by Tran et al., the highest reported R² was 0.112 for models analyzing concurrent total mental health care costs — the same outcome modeled in our study.

For added clarity, we have also included this benchmark R² value (0.112) in the Introduction section.

Comments 5:  Although the text mentions a 50/50 train-validation split, the procedures for hyperparameter tuning, cross-validation, and model robustness assessment are insufficiently detailed.

Response 5: As reported in the manuscript, in addition to the 50/50 split, we also conducted sensitivity analyses using 60/40 and 80/20 splits. These produced virtually identical results, likely due to our large sample size. Based on our decades of experience, we believe that testing additional split ratios would offer limited added value. However, if the Reviewer would like to see results from other splits, we would be happy to conduct those analyses and share the findings.

As advised, we have revised the Results section to include additional details on the data splits and hyperparameter tuning.

Comments 6: There is no clarity on the parameterization of the Box-Cox lambda value or how retransformation bias was handled. These details are essential for transparency.

Response 6: As recommended by the Reviewer, we have added the value of the transformation parameter (λ = 0.548) to the Results section. To address retransformation bias, we applied Duan’s nonparametric smearing estimator (Duan, 1983, Journal of the American Statistical Association), a method commonly used in studies involving non-normally distributed data, as noted in the manuscript.

Comments 7: The discussion section essentially restates results without sufficient critical analysis of why AI models only marginally outperformed statistical models.

Response 7: Despite the impressive advances in generative AI, it has been observed that predictive AI models often do not outperform traditional statistical models, as discussed in AI Snake Oil (Narayanan & Kapoor, 2024). As noted in the manuscript, published studies in healthcare research have also confirmed that AI does not significantly outperform statistical approaches.

While the literature offers limited discussion on the reasons behind this, online discourse frequently suggests that:

  1. AI models require large volumes of high-quality data to effectively learn complex patterns, and
  2. They are particularly well-suited for capturing nonlinear and intricate relationships.

However, our dataset is both large and of high quality. A logical inference, therefore, is that in healthcare settings --particularly in our context -- the relationships between outcomes and predictors may be relatively straightforward, limiting the potential advantages of AI models.

As advised, we have revised the manuscript to include a more detailed discussion of this issue.

Comments 8:  The limitations section acknowledges the male-dominated VHA dataset but does not provide sufficient discussion of how this demographic skew affects model applicability to broader mental health populations.

Response 8: Again, the objective of this study is to explore ways to improve the predictive performance of case-mix models, rather than to provide an off-the-shelf case-mix software or algorithm. As noted in the Discussion section, excluding sex from the models only reduced the R² values by less than 0.01. Therefore, the male-dominated VHA dataset is unlikely to limit the applicability of the models to broader mental health populations.

Comments 9: The manuscript contains numerous redundant phrases and overly long paragraphs, which reduce readability. Some tables (for example, Table 2) occupy excessive space without deeper analytical integration in the text.

Response 9: As the Reviewer advised, we have revised the text to eliminate redundant phrases and shorten lengthy paragraphs. We have also added more detail to the description of Table 2, which presents the clinical categories that were expanded into 162 groups. If the Editor considers the table unnecessary, we are happy to remove it from the manuscript.

Thank you again for your thoughtful and constructive comments. Please feel free to contact us if any further revisions are needed.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This study aims to compare statistical and artificial intelligence models for case-mix adjustment in mental health services. Using a large patient population and advanced modeling techniques, a model with significantly higher predictive power compared to the existing literature is presented. However, some sections of the study suffer from issues such as a lack of clarity and inadequate methodological detail.

  1. The introduction section well highlights the lack of case-mix models in the literature in mental health, but the study's unique contribution could have been more clearly stated.
  2. Lines 75-80: In the Method section, the section on expanding the CCSR groups should be supported with more detailed examples.
  3. The use of total cost as the dependent variable in this study is appropriate, but it does not specify how the cost data was standardized.
  4. Lines 95-105: No details are provided regarding the hyperparameter optimization and training processes of the AI ​​models. The hyperparameter properties of the models used in the studies provide information about how they affect the study data and how they were analyzed. Therefore, the authors should focus on this issue.
  5. Lines 123-130: The models used in the study are defined as AI models. However, in the literature, Random Forest, LightGBM, XGBoost, and CatBoost models are generally considered subgroups of machine learning. The ranking is generally defined as a subgroup of AI, ML, and subgroups of ML as RF, SVM, GB, etc. If possible, authors should express this differently based on the literature.
  6. The tables in the Results section are understandable, but a visual representation of the comparative performance of the Box-Cox and CatBoost models would be more understandable at the benchmark.
  7. Lines 125-130: The results of the sensitivity analyses are insufficiently detailed; it is unclear which changes were tested.
  8. The study's innovative nature and contribution to science are not clearly stated; this deficiency makes it difficult for the reader to understand its value.
  9. The study should be revised for language. Too much (e.g., ....) is used.

Overall, the study presents a robust methodology and compelling dataset, making it a valuable contribution to resource allocation and quality improvement efforts in mental health services. This study would have achieved the desired quality if the authors had addressed the shortcomings mentioned above effectively.

Author Response

Dear Reviewer 2,

Comments 1: The introduction section well highlights the lack of case-mix models in the literature in mental health, but the study's unique contribution could have been more clearly stated.

Response 1: First, we greatly appreciate the Reviewer’s insightful and constructive comments and advice. Following this suggestion, we have revised the introduction section to more clearly articulate the study’s unique contribution, which is highlighted in yellow.

Comments 2: Lines 75-80: In the Method section, the section on expanding the CCSR groups should be supported with more detailed examples.

Response 2:

Since the mechanism of expanding the CCSR categories was already described in the manuscript, and the specific breakdown of CCSR categories depends on the characteristics of the study population, we are not sure of the value of presenting detailed breakdown groups. For example, MBD002 (depressive disorders) included 1,544,484 patients. Applying the expansion logic described in the manuscript, we broke MBD002 into 19 categories, as shown in the Table below.

We have added a short description of this expansion example as the Reviewer advised. If the Editor prefers, we would be pleased to include this table (showing the expanded category breakdown by ICD-10) in the manuscript as well.

Comments 3: The use of total cost as the dependent variable in this study is appropriate, but it does not specify how the cost data was standardized.

Response 3:

Thank you for confirming that total cost is an appropriate outcome to model. In our analysis, the cost data were not standardized but were transformed in the log-linear and Box-Cox models. However, after fitting the models, the transformed cost values were retransformed back to the raw scale using Duan’s nonparametric smearing estimator. If the Reviewer’s question pertains to the construction of the risk score from the predicted cost, we note that several approaches can be used depending on the application. For instance, dividing the predicted cost by the mean produces a risk score centered at 1.

Comments 4: Lines 95-105: No details are provided regarding the hyperparameter optimization and training processes of the AI ​​models. The hyperparameter properties of the models used in the studies provide information about how they affect the study data and how they were analyzed. Therefore, the authors should focus on this issue.

Response 4: As the Reviewer advised, we have added more detailed information about the hyperparameter settings and optimization procedures in the Methods and Results sections.

Comments 5: Lines 123-130: The models used in the study are defined as AI models. However, in the literature, Random Forest, LightGBM, XGBoost, and CatBoost models are generally considered subgroups of machine learning. The ranking is generally defined as a subgroup of AI, ML, and subgroups of ML as RF, SVM, GB, etc. If possible, authors should express this differently based on the literature.

Response 5: We agree with the Reviewer’s comment -- indeed, these models are typically classified as machine learning rather than artificial intelligence. However, increasingly, they are also described in the literature as “predictive AI models.” To reflect both perspectives, and following the Reviewer’s advice, we have revised the text in Section 2.3 to clarify this distinction.

Comments 6: The tables in the Results section are understandable, but a visual representation of the comparative performance of the Box-Cox and CatBoost models would be more understandable at the benchmark.

Response 6: We greatly appreciate the Reviewer’s suggestion to enhance the clarity of the results through visual presentation. While we initially felt that tables would be the most informative format for reporting predictive performance, we are open to alternative formats and would be happy to revise accordingly should the Reviewer have a preferred approach. 

Comments 7: Lines 125-130: The results of the sensitivity analyses are insufficiently detailed; it is unclear which changes were tested.

Response 7: As the Reviewer recommended, we have expanded the description of the sensitivity analyses in the Results section to clarify which parameters and changes were tested.

Comments 8: The study's innovative nature and contribution to science are not clearly stated; this deficiency makes it difficult for the reader to understand its value.

Response 8: We deeply appreciate the Reviewer’s constructive advice. We have revised the Introduction and Discussion sections to more clearly highlight the study’s novelty and scientific contribution.

Comments 9: The study should be revised for language. Too much (e.g., ....) is used.

Response 9: We have carefully revised the manuscript as the Reviewer suggested.

Finally, we would like to once again express our deep gratitude to the Reviewer for the insightful and constructive comments and advice.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors
  • The research is highly relevant, as mental illnesses are on the rise. Therefore, it is important to have models for assessing the disease burden and the severity of mental disorders to estimate the personnel needs in the healthcare system and ensure the quality of care.

 

  • It is positive to note that two main types of methods were used — four statistical models and four artificial intelligence models. This is scientifically valuable, as it allows for a comparison between traditional statistical approaches and more modern methods.

 

  • A so-called hybrid model, which combines traditional methods with modern artificial intelligence approaches, could be considered for future research to further improve predictability.

 

  • The large sample size (over 2 million patients) increases the reliability and significance of the study, which is a positive aspect to note.

 

  • The key points were presented very concisely on just a few pages, getting straight to the essence.

Author Response

Dear Reviewer 3,

Comments:

The research is highly relevant, as mental illnesses are on the rise. Therefore, it is important to have models for assessing the disease burden and the severity of mental disorders to estimate the personnel needs in the healthcare system and ensure the quality of care.

 It is positive to note that two main types of methods were used — four statistical models and four artificial intelligence models. This is scientifically valuable, as it allows for a comparison between traditional statistical approaches and more modern methods.

A so-called hybrid model, which combines traditional methods with modern artificial intelligence approaches, could be considered for future research to further improve predictability.

 The large sample size (over 2 million patients) increases the reliability and significance of the study, which is a positive aspect to note.

The key points were presented very concisely on just a few pages, getting straight to the essence.

Response:

We are deeply grateful to the reviewer for the thoughtful and encouraging comments. We appreciate the recognition of the study’s relevance, the large sample size, and the comparative approach between statistical and artificial intelligence models. We also value the insightful suggestion to explore hybrid modeling approaches that integrate traditional statistical methods with modern AI techniques to enhance predictive performance. This is an excellent direction for future research, which we plan to consider in our subsequent work, and we have incorporated this point into the Discussion section.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Acceptable in the present form

Author Response

Dear Reviewer 1,

Thank you very much for taking the time to review our work. We greatly appreciate your insightful and constructive comments and suggestions, which have made the manuscript more informative and easier to read.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have partially fulfilled my recommendations for this article. However, this article still has some shortcomings that would prevent its publication. First, the introduction, method, results, and conclusion sections of this article are insufficient. These sections need to be detailed. The hyperparameter properties of the algorithms should be provided in the Methods section of the study. Furthermore, a workflow diagram illustrating the methods used in the research or the stages of the study should be included in the Methods section. The article should be structured in a manner consistent with a scientific article.
I do not find it appropriate to accept the work in its current state.
However, I believe that the quality of the study will improve if the authors make changes to the sections and overall structure of the study.

Author Response

Dear Reviewer 2,

Comments 1: First, the introduction, method, results, and conclusion sections of this article are insufficient.

Response 1: We thank the Reviewer for taking the time to further review our response and the revised manuscript. We have made a sincere effort to clearly and concisely convey the study’s objectives, methods, and findings to Healthcare’s readership, as also acknowledged by Reviewer 3. That said, we agree that there is always room for improvement. If the Reviewer could kindly provide specific feedback or identify particular areas of concern, we would be more than happy to revise the manuscript further.

Comments 2: The hyperparameter properties of the algorithms should be provided in the Methods section of the study.

Response 2: It is important to clarify that the primary objective of this study is to demonstrate how the predictive power of mental health (MH) case-mix models can be enhanced by grouping patients into more homogeneous categories and by comparing the performance of various statistical and AI models. To this end, we employed established statistical and AI models as analytical tools, rather than studying the tools themselves. As noted in the manuscript, we adjusted hyperparameters to optimize predictive performance. Given the study’s focus and scope, we believe that a detailed discussion of hyperparameter properties is not typical for this type of article and falls outside its intended purpose. Nevertheless, if the Editor considers it necessary, we would be happy to provide additional details regarding the functions or properties of the hyperparameters of the models.

Comments 3: A workflow diagram illustrating the methods used in the research or the stages of the study should be included in the Methods section. The article should be structured in a manner consistent with a scientific article.

Response 3: We sincerely appreciate the Reviewer’s suggestion but respectfully disagree. This is a straightforward analytical project, and we do not believe that a workflow diagram would add meaningful value to the manuscript. However, if the Reviewer could provide an example of a workflow diagram that would be informative to readers, we would be happy to consider including it. Additionally, we would appreciate clarification regarding the comment that the manuscript does not align with the structure of a scientific article. If the Reviewer could provide a specific example or reference, we would gladly revise the manuscript accordingly.

Comments 4: I do not believe it is appropriate to accept this article in its current state. Firstly, this article utilizes AI tools extensively. Articles using AI tools often include words such as "this", "e.g.", "for example", and modal verbs (e.g., "should", "can", ...).

Response 4: The insinuation that our manuscript was generated by AI is profoundly unprofessional, irresponsible, and indicative of a complete lack of sound judgment. While the four models we employed (Random Forest, LightGBM, XGBoost, and CatBoost) are indeed machine learning or AI algorithms, we categorically affirm that no AI tools were used in the generation of text, data, graphics, study design, data collection, analysis, or interpretation.

Comments 5: The article's structure does not conform to the typical structure of a scientific article. Furthermore, it contains no scientific innovations.

Response 5: Regarding the article structure, please see our Response 3. As for scientific innovations, we have factually reported what we did and what we found. We trust the Reviewers and the Editor to evaluate the merit of the manuscript. While some of the Reviewer’s comments have perplexed us, we sincerely appreciate the time and effort dedicated to reviewing our work.

Back to TopTop