Next Article in Journal
Short-Term Inspiratory Muscle Training Enhances Functional and Metabolic Health in Older Adults
Next Article in Special Issue
A Systematic Review of Topic Modeling Techniques for Electronic Health Records
Previous Article in Journal
Proof-of-Concept Machine Learning Framework for Arboviral Disease Classification Using Literature-Derived Synthetic Data: Methodological Development Preceding Clinical Validation
Previous Article in Special Issue
Using Machine Learning Methods to Predict Cognitive Age from Psychophysiological Tests
 
 
Article
Peer-Review Record

AI-Enhanced Qualitative Analysis in Healthcare: Unlocking Insight from Interviews of Leadership at Top-Performing Academic Medical Centers

Healthcare 2026, 14(2), 248; https://doi.org/10.3390/healthcare14020248
by Triss Ashton 1,* and Seth Chatfield 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Healthcare 2026, 14(2), 248; https://doi.org/10.3390/healthcare14020248
Submission received: 12 December 2025 / Revised: 9 January 2026 / Accepted: 12 January 2026 / Published: 19 January 2026
(This article belongs to the Special Issue AI-Driven Healthcare Insights)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I have thoroughly reviewed your article titled “AI Enhanced Qualitative Analysis in Healthcare: Unlocking Insight from Interviews of Leadership at Top Performing Academic Medical Centers”. In their study, the authors investigate the use and effectiveness of LLM in analyzing large volumes of qualitative data in health management. The authors compared traditional manual analysis methods with artificial intelligence-assisted analysis processes. I would like to thank the authors for their efforts in conducting this study. With the correction of certain shortcomings in the article, it could be of greater benefit to the literature. Reviewing the following suggestions would be helpful.

  1. The introduction discusses the technical advantages of text mining and LLMs, but abruptly switches to a discussion of hospitals' financial margins on line 70. This shift should be linked to an additional explanation if necessary, or removed if not.
  2. “Bias-free” is a very bold term. More appropriate terms used in the literature for LLMs could be used instead. A term like “free from human bias” could be employed.
  3. “Many of the LLMs available today can perform a variety of analyses on text data, using many or most of the known text mining methods.” The specific methods should be indicated with a source and examples should be provided.
  4. Although the Donabedian model is not mentioned at all in the introduction, the most important part of the findings is based on this model.
  5. At the end of Section 2.1, it is necessary to clearly state why the Donabedian model is included in this study.
  6. Since LSA is already based on a dimensionality reduction technique, why was it necessary to apply dimensionality reduction again using PCA?
  7. Figure 2 in the Results section should be prepared in a clearer and more understandable way. The text is illegible, and there is no detailed explanation of the figure.
  8. The claim that the Donabedian model arose spontaneously should be supported, and it would be helpful to confirm that it is not a remnant of previous prompts or guidance.
  9. The article follows an approach of testing the classification done by artificial intelligence using artificial intelligence itself. The labeling process needs to be validated by humans using randomly selected samples.
  10. The article states that it is a continuation of another study. If the exact same dataset was used, the Donabedian model is the only contribution and does not constitute sufficient novelty for the article.
  11. The limitations of the article need to be identified and included. The risk of uploading data to the cloud should be mentioned as a limitation here.
  12. Furthermore, the ethical process regarding uploading data to commercial AI systems should be detailed.
  13. The abbreviations section table contains the same explanation for COTH and IRB. Bag0of0Words is misspelled. It needs to be reviewed and corrected.
  14. Tables 2, 3, and 4 should be prepared in a more readable format.

If the article is updated in line with the suggestions, it will make a meaningful contribution to the literature.

Author Response

Comment 1: The introduction discusses the technical advantages of text mining and LLMs, but abruptly switches to a discussion of hospitals' financial margins on line 70. This shift should be linked to an additional explanation if necessary, or removed if not.

Response 1: Thank you for your observations. This clearly needs a transition. The two paragraphs beginning on line 88 (below) work to provide the necessary context into the discussion of the financial analysis. Furthermore, the financial analysis is now focused explicitly on the sample and population of the hospitals utilized in this paper.

To study the applications and benefits of LLMs in healthcare, we examined data collected in a study of the teaching hospitals of some of the nation’s most successful academic medical centers. These hospitals were sampled from a population of Council of Teaching Hospital (COTH) member hospitals (n=277) based on data envelopment analysis scores identifying hospitals that most efficiently converted inputs into outputs. Outputs measured included case-mix-adjusted discharges, outpatient visits, teaching intensity, and value-based purchasing scores (process of care, HCAHPS, and mortality). Inputs measured included labor, capital, hospital beds, and service complexity. That study generated a corpus composed of interview responses from chief nursing officers [44, 25]. The objective of the interviews was to determine, based on self-reporting, what operational aspects made these medical centers so successful.

To assess whether sample hospitals (DEA-identified efficient) differed financially from other COTH member hospitals, we compared five-year mean operating margins. The sample hospitals averaged 2.56% compared to -10.42% for the remaining population of COTH hospitals. An independent samples t-test with unequal variances was carried out to determine statistical significance in the mean operating margin of the population and sampled hospitals. There was no homogeneity of variances for the five-year mean operating margin for the sampled versus non-sampled hospitals. Equality of variance was assessed by the F-test for equality of variances (a parametric test) and Moses test (a non-parametric test) [45]. Both tests agreed that there was a statistically significant difference in the variance of the two samples with p < 0.0001 for the F-test and p = 0.0033 for the Moses test. Finally, a t-test comparing the mean operating margin between the selected hospitals versus the non-selected group found a significant difference with p = 0.0102. This result suggests that the sample hospitals not only outperformed on all DEA output measures (case-mix-adjusted discharges, outpatient visits, teaching intensity, and VBP process of care, HCAHPS, and mortality), but also on operating margin.

Comment 2: “Bias-free” is a very bold term. More appropriate terms used in the literature for LLMs could be used instead. A term like “free from human bias” could be employed.

Response 2: The manuscript has been revised, including the proposed language change.

Comment 3: “Many of the LLMs available today can perform a variety of analyses on text data, using many or most of the known text mining methods.” The specific methods should be indicated with a source and examples should be provided.

Response 3: The methods are reviewed in section 2.3 of the literature review; however, the paragraph in which that sentence appears in the introduction requires expansion. As for the source, the paragraph has been revised to include research from Microsoft that outlines a new LLM framework for analyzing text. The new TnT-LLM framework is also related to classical text-mining methods, which were reviewed in the preceding paragraph. As for the source, the sentence has been revised to indicate that the AIs are disclosing the capability during chat sessions. It now reads,

Large language models (LLMs) in artificial intelligence (AI) introduce a phenomenal alternative.  For instance, in 2024, Microsoft researchers introduced a new two-phase text analysis framework called TnT-LLM (Taxonomy Generation and Text Classification with an LLM in both phases) [30]. In TnT-LLM, phase 1 is a taxonomy generator that categorizes or clusters the data, similar to the text mining effort described above. Then, in phase 2, the LLM produces a text classification or describes the clusters, similar to the previously mentioned cluster-defining process [30].  Many of the LLMs available today self-report that they can perform a variety of analyses on text data, using many or most of the known text mining methods.  During chat sessions, those AI systems reviewed in the literature review describe or propose analysis strategies that generally follow the TnT-LLM framework.

Comment 4: Although the Donabedian model is not mentioned at all in the introduction, the most important part of the findings is based on this model.

Response 4: The second contribution statement in the introduction has been revised, disclosing the emergence of the Donabedian model from the data. It reads:

Second, from the interview data, an unexpected variant of the Donabedian model emerged, comprising 10 factors and 24 subtopics. This model essentially defines how successful medical centers operate on a daily basis.  However, it is not a linear process that proceeds from structure → process → outcome, as the Donabedian is generally described. Instead, it is a cyclical systems model, composed of several interrelated pieces.

Comment 5: At the end of Section 2.1, it is necessary to clearly state why the Donabedian model is included in this study.

Response 5: In retrospect, we agree that the model’s inclusion in the literature review was awkward as written. With the revised contribution section in the introduction as a result of comment 4, the Donabedian model should not be so jarring in the literature review. The model’s review should flow naturally and be integral to the research.

Comment 6: Since LSA is already based on a dimensionality reduction technique, why was it necessary to apply dimensionality reduction again using PCA?

Response 6: First an understanding:

  • It is a bit beyond the scope of this research, but essentially, what appears to be emerging is that in the text-mining phase, the solutions, referred to as factors, can be interpreted in some instances as analogous to the literal scales used in survey instruments – they represent some expression that is in the form of a sentence, and in some instances, multiple sentences.
  • LSA uses a singular value decomposition (SVD) algorithm from linear algebra. The SVD model is A = UΣVT, whereas PCA factor analysis uses X = TPTT. In both instances, given their historical development in algebra, the solutions are referred to as “factors,” and they are performing a dimensionality reduction; however, the dimensions they expose differ.
  • In the classic dimensionality reduction of PCA, the variables (scale responses) are grouped together because, in theory, multiple scales measuring the same idea have similar scoring patterns. The data’s dimensionality is reduced, revealing the latent nature the scales were attempting to measure. The important thing to observe here is that the respondent scored a “scale,” which is a collection of words organized into a sentence that expresses an idea.
  • In contrast, in LSA text mining, large collections of organized words, the spoken or written language, are broken down into individual words, with their occurrence counted in an x-matrix. The SVD algorithm then performs dimensional reduction of the matrix, revealing the keywords that form the foundation for reconstructing a sentence or “scale” like item.
  • So, both PCA and LSA are performing dimensionality reduction. LSA reduces singular words into collections that infer something analogous to scale items; PCA reduces scale items into constructs.

To convey this idea, the following is added to section 3.3:

Mathematically, LSA uses a singular value decomposition (SVD) algorithm from linear algebra. The SVD model is A = UΣVT. Principal component analysis (PCA) factor analysis uses X = TPTT. In both cases, they perform dimensionality reduction, but the dimensions they expose are different.

In the classic dimensionality reduction of PCA, variables (scores against scales) are grouped together based on correlation because, in theory, multiple scales measuring the same idea tend to have similar scoring patterns. Dimensionality reduction reveals the latent construct structure that the scales were attempting to measure. The important thing to observe here is that the respondent scored a “scale,” were the scale is a collection of words organized into a sentence expressing an idea.

In contrast, LSA examines collections of word tokens (root terms) and groups them based on the correlations observed in their co-occurrences. LSA's dimensional reduction reveals the co-occurring term roots that form the foundation words for reconstructing a sentence or “scale” like item.

So, both PCA and LSA are performing dimensionality reduction. LSA reduces singular words into collections that infer something analogous to a scale item or a sentence; PCA reduces scale scores, where a scale is more or less a sentence, revealing constructs.

Comment 7:  Figure 2 in the Results section should be prepared in a clearer and more understandable way. The text is illegible, and there is no detailed explanation of the figure.

Response 7: Figures 1 and 2 are somewhat difficult to read. The AI that generates these images seems to intentionally distort text. In preparing a revision of Figure 2, and after conducting a little more research to support some of your other comments – in particular, comment 10-we began reexamining another version of the graphical frameworks that we had discarded. After drilling down into comment 10, it became apparent that the AI’s first attempt at drawing a framework was actually more accurate. The AI’s first attempt was dismissed because, at first impression, it was, to some degree, cartoonish. That, plus the intentional misspellings, detracted from its accuracy. That plus when asked to try again, the AI produced something that was more representative of a classic behavior research model, e.g., Figures 1 and 2.

Given what we have learned during the revision process, we are removing Figures 1 and 2 and installing a new Figure 1. It aligns far better with the new section 5.3 text.

Comment 8: The claim that the Donabedian model arose spontaneously should be supported, and it would be helpful to confirm that it is not a remnant of previous prompts or guidance.

Response 8: This is an excellent point. The following text has been added to section 5.2:

The Donabedian model naturally and independently emerged from the AIs' analysis. First, neither the raw corpus nor the X-matrix used in the section 3.2 analysis was ever exposed to the AI that performed the section 4 analysis. Second, the raw source data, and therefore the X-matrix as well, do not mention the term Donabedian. Third, the analysis in section 4 was not repeated – the results reported in section 4 were generated in the first and only run of the analysis and the AI's first exposure to the factor solutions. Finally, the researchers did not suggest any interpretation at any point. In fact, as reported in section 4.3, the AI stated, “With the addition of Factor 10, the study's scope on Quality Management is complete, covering the Structure, Process, and Outcome …”

After generating the graphics in 4.4, a final prompt was sent to the AI. It read,

[A]bove you offer a Final Framework content section in which you define the elements of the framework as consisting of structure/role, process, and outcome. Can you do a literature review of these using the academic literature?

It was at this point that the AI disclosed the Donabedian model. The full chat session may be accessed via footnote 3.

Comment 9: The article follows an approach of testing the classification done by artificial intelligence using artificial intelligence itself. The labeling process needs to be validated by humans using randomly selected samples.

Response 9: Thank you for your recommendation. A new paragraph has been added to section 5.1. It reads,

To validate the AI analysis results in Section 4, the researchers independently evaluated the keyword lists provided to the AI and compared them with the labels and descriptions generated by Gemini. Since two AI systems were used, LSA and Gemini, validation also included two researchers independently reviewing samples from the source interview transcripts. The results were then compared, with any disagreements reexamined jointly. The researchers are satisfied that LSA and Gemini 2.5 performed satisfactorily.  

Comment 10: The article states that it is a continuation of another study. If the exact same dataset was used, the Donabedian model is the only contribution and does not constitute sufficient novelty for the article.

Response 10:

  1. Where the paper does reutilize data previously reported on, we do not view this as, nor have we intended to imply this as, a continuation of the previous analysis. The previous analysis was manual. That process took many weeks, whereas this one was performed by AIs in an afternoon. The prior analysis and reporting had a different focus and, to some extent, a different audience. It was intended for consumption by healthcare system administrators, management, and nursing leadership, was constrained or collapsed into five factors, and was presented as leadership lessons from Chief Nursing Officers.
  2. In contrast, we view this paper as offering a much broader model, but also as illustrating an alternative analysis strategy that will require, dare I say, less analyst skill, less time, and is capable of producing more significant and impactful results. Given your comment, where we feel that we have certainly addressed this illustrative portion of the call, we have not addressed that portion of the call that asks for AI-driven discovery providing transformative insights. However, we have what you are looking for and simply became engaged with the first aspect of the call and ended up overlooking the second. To address this second part of the call, we offer the following:
    • First, in a revision of the conclusion to the introduction, we have added a paragraph. It states, “from the interview data, an unexpected variant of the Donabedian model emerged, comprising 10 factors and 24 subtopics. This model essentially defines how successful medical centers operate on a daily basis. However, it is not a linear process that proceeds from structure → process → outcome, as the Donabedian is generally described. Instead, it is a cyclical systems model, composed of several interrelated pieces.” Before this special issue opportunity on AI came along, we had no intention of reanalyzing this dataset. Because of the opportunity, we discovered this hidden content. And the model that emerges from the subject of chief nursing officers is unique.
    • Second, several new paragraphs have been added to the discussion, section 5.2, to address this previously missed portion of the special issue call and are as follows

The Donabedian model naturally and independently emerged from the AIs' analysis. First, neither the raw corpus nor the X-matrix used in the section 3.2 analysis was ever exposed to Gemini during the section 4 analysis. Second, the raw source data, and therefore the X-matrix as well, do not mention the term Donabedian. Third, the analysis in section 4 was not repeated – the results reported in section 4 were generated in the first and only run of the analysis and constitute the AI's first exposure to the factor solutions. Finally, the researchers did not suggest any interpretation at any point. In fact, as reported in section 4.3, the AI stated, “With the addition of Factor 10, the study's scope on Quality Management is complete, covering the Structure, Process, and Outcome …”

After generating the graphics for section 4.4, a final prompt was sent to the AI. It read,

[A]bove you offer a Final Framework content section in which you define the elements of the framework as consisting of structure/role, process, and outcome. Can you do a literature review of these using the academic literature?

It was at this point that the AI disclosed the Donabedian model. The full chat session may be accessed via footnote 3.

There appears to be a persistent tendency in the literature to oversimplify Donabedian’s conceptualization of healthcare quality by treating it as a reductionist, linear framework of structure, process, and outcomes rather than as an integrated system [46, 47, 48, 32, 49]. Evidence of this fragmented interpretation can be found in McCullough et al.’s 2023 [14] systematic review of primary healthcare nursing service evaluations using the Donabedian model. Their finding that most of the thirty-two reviewed studies focused primarily on outcomes at the exclusion of structures and processes suggests a tendency to fragment what Donabedian intended as an integrated conceptualization [14].

Berwick and Fox emphasize that “Donabedian was far from a reductionist” [50]. In his 1989 article, Berwick argued that measuring quality must include the interplay among structure, process, and outcomes [51]. Donabedian again articulated this integrated conceptualization when reflecting on the failure of American organizations to replicate Japanese quality improvement success, noting that quality is never the product of a single isolated intervention but emerges from “a whole constellation of factors” [52].

The present research provides empirical support for this more complex interpretation of the model. Importantly, the Donabedian model did not guide the study's design. No interview questions or interview responses explicitly mentioned the Donabedian model. Yet the fundamental dimensions of structure, process, and outcomes emerged naturally from participants’ narratives. This organic emergence strengthens the argument that the Donabedian model captures foundational dimensions of quality that manifest even when they are not explicitly sought.

Further, the statistical findings reinforce this interpretation. Cross-loadings observed in Table 3 among Factors 3, 4, and 6 reveal meaningful empirical overlap across what traditionally might be categorized as outcomes, processes, and structures.

    • Outcomes: Factor 3 – ‘Healthcare Quality Metrics and Data Infrastructure.’
    • Processes: Factor 4 – ‘Interprofessional Communication and Data-Driven Improvement.’
    • Structure: Factor 6 – ‘Systemic Learning and Knowledge Infrastructure.’

Rather than appearing as isolated constructs, these domains demonstrate statistical interdependence. This pattern directly reflects Donabedian’s insistence that quality should be understood as “…an unbroken chain of antecedent means followed by intermediate ends which are themselves the means to still further ends” [12 p. 694]. In these high-performing academic medical centers, the components of structure, process, and outcomes do not function as sequential stages; instead, they operate as simultaneously active, mutually reinforcing dimensions within a single integrated quality system.

Taken together, these findings suggest that quality improvement in successful healthcare organizations is not the product of optimizing isolated elements but instead emerges from dynamic interactions among structures, processes, and outcomes working in continuous concert. What makes this particularly compelling is that this ‘systems level’ interpretation emerged from chief nursing officers’ reflections without being theoretically imposed, indicating that such integration may represent how quality is actually enacted in practice rather than merely how it is theoretically idealized. Accordingly, this study not only challenges the persistent reductionist interpretation of the Donabedian model but also provides empirically grounded insight into how its components function as a living, recursive system within real-world, high-performing healthcare environments. The factors and themes identified here, therefore, offer a meaningful foundation for future work in healthcare quality management and for a renewed conceptualization of Donabedian’s framework as a complex system of feedback, adaptation, and mutual reinforcement rather than a linear evaluative tool.

Comment 11: The limitations of the article need to be identified and included. The risk of uploading data to the cloud should be mentioned as a limitation here.

Response 11: A new limitations section has been added. Its introduction and risks section reads as follows

6. Limitations

This work is not without limitations. The data only included observations from chief nursing officers. Input from nursing subdepartment leaders, floor charge nurses, and administrators outside the nursing area could illustrate unexpected facets to the model. The data was also collected exclusively from teaching hospitals with COTH membership. Successful proprietary organizations might as well contribute to the model.

Given that the data were collected in COTH member organizations and that the interview questions included specific COTH references, applying the results to other hospitals could prove challenging. With that said, the main model is grounded in quality improvement and might prove valuable.

In the analysis, the interview data were clustered with LSA. Sometimes, another text mining method (LDA, pLSA, NMF, k-means, etc.) might yield different results; however, given the narrow scope of the corpus (interview responses by CNOs), the likelihood of a radical difference is remote.

6.1 Upload Risks

The AI results (section 4) were obtained exclusively from Gemini 2.5. While the article was being written, Gemini 3.0 was deployed. No attempt was made to run the data in 3.0 or in another LLM. Since the Gemini models had not previously been exposed to the data, the AI solutions reported are as pristine as possible. In part, we were concerned about Google’s efforts to improve its models. Per Gemini, “in the free tier of Gemini, that data [uploaded data] is stored on Google’s secure servers. However, how it is handled depends entirely on your settings.” The default retention period is 18 months. Further, the data may be used to improve “machine learning technologies,” and samples from the uploaded data could be passed to a human for review. Note that each AI is different. Microsoft's Copilot reports, “I don’t personally store your data or keep a memory of it unless you explicitly ask me to remember something.” So, care must be exercised, particularly if the data is sensitive.

Comment 12: Furthermore, the ethical process regarding uploading data to commercial AI systems should be detailed.

Response 12: A new subsection is added to the limitations section. It reads as follows:

6.2 Ethical Considerations

There are ethical considerations with using publicly available AIs. Claude provides a nice outline. That full chat session can be reached using the URL in the footnote

Core Ethical Principles

    • Consent and Privacy: Before uploading any data, ensure you have proper authorization. Consider whether the data contains sensitive information like health records, financial details, or personally identifiable information (PII).
    • Data Minimization: Only upload what's necessary for your specific purpose.
    • Understanding Terms of Service: Carefully read the AI provider's terms regarding data handling.

Organizational Considerations

    • Internal Policies: Many organizations have established protocols for what data can be shared with third-party services
    • Risk Assessment: Evaluate potential harms if the data were to be exposed or misused.
    • Intellectual Property: Be cautious about uploading copyrighted materials, trade secrets, or proprietary code that could compromise competitive advantages or violate licensing agreements.

Comment 13: The abbreviations section table contains the same explanation for COTH and IRB. Bag0of0Words is misspelled. It needs to be reviewed and corrected.

Response 13: Thank you for catching this. It has been corrected.

Comment 14:  Tables 2, 3, and 4 should be prepared in a more readable format.

Response 14: The tables have been revised. Tables 2 and 4 are now set inside text boxes and rotated 90 degrees. The tables are not intimately discussed. Therefore, Table 2 has been reduced to display term roots for factors 1-10, 22, 23, and 24. The intermediate factors 11-21, which have been collapsed in this table, are fully shown in Table 4.

There is a lot here. The full reviewer response report is attached and might be more readable as new manuscript text will appear in green characters as they do in the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

I appreciate the opportunity to review your manuscript, which aim is to analyze how freely available large language model (LLM) artificial intelligence (AI) tools can efficiently and effectively analyze qualitative healthcare data and uncover aspects missed by traditional manual analysis.

I believe this research is of interest to the current scientific community, timely, and relevant.

Below, I will offer a series of comments/suggestions to try to improve the quality of your manuscript:

I think it would be interesting to include more information in Introduction section about the benefits of using AI in data analysis in healthcare sector or other disciplines, perhaps by showing other studies that have already performed qualitative analysis using these tools.

It is suggested that tables not be cut off when moving from one page to the next, so that they are easier for the reader to interpret.

In the study, you include as an objective to explore interviews with chief nursing officers at some of the most successful academic medical centers in the country. The objective of the interviews was to discover what made those hospitals so successful, as perceived by the nursing leaders. To address this objective, it would be interesting to include in Results section literal extracts from the interviews, quoted verbatim, not just the topics discussed, to better understand what the interviewees meant when asked about the success of their hospitals. Furthermore, I believe it would be advisable to include in Discussion section more bibliographic references to other studies that have analyzed these quality issues, in order to compare them with the existing literature.

Inclusion of limitations, as well as future lines of research, is recommended.

Thank you very much.

Author Response

Please see the attached, beginning on page 11. For reviewer 1, I loaded the comments and responses in this window and then ended up providing an attachment because there was so much material as to make the window difficult to review. So I did not intend to offend, but to be effective.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

After reviewing the revised article, I see that the authors have put in a great deal of effort, and the article has reached a level where it will make a significant contribution to the literature. I congratulate the authors for their hard work.

Back to TopTop