Review Reports
- Weihao Huang 1,†,
- Yiyang Wu 1,† and
- Xueling Yang 1,3,*
- et al.
Reviewer 1: Ariana Cordos Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- Brief Summary (one short paragraph)
This manuscript reports a single-blind randomized controlled trial comparing AI-delivered online cognitive behavioral therapy (AI‑CBT), human-delivered online CBT (participants told it was AI), and a no-intervention control in young adults with anxiety. Using a mixed-methods approach, the authors assess changes in anxiety, sleep quality, exercise self-efficacy, perceived psychotherapy benefit, and qualitative user experiences over a four-week intervention. The study’s main strengths lie in its innovative deception-based comparison design, adherence to CONSORT-AI reporting, and the integration of quantitative and qualitative findings. The work contributes valuable empirical evidence on the current limitations of AI-delivered psychotherapy, particularly regarding emotional responsiveness and perceived benefit.
- General Comments on Scientific Content
Overall assessment
- The manuscript is clear, relevant, and well structured, following a logical IMRAD format.
- The topic is highly relevant to digital mental health and AI-assisted psychotherapy.
- The mixed-methods design is appropriate and strengthens interpretability.
Scientific soundness and hypothesis testing
- The hypotheses are clearly stated and testable.
- The randomized controlled design is appropriate; however, the exploratory nature of the study is not always sufficiently emphasized in the interpretation of outcomes, especially in the Results and Discussion.
References
- The majority of references are recent (last 5 years) and relevant.
- Self-citation does not appear excessive.
- Some foundational references (e.g., CBT mechanisms) are older but acceptable given their canonical status.
Reproducibility
- The Methods section is detailed overall.
- Reproducibility is limited by:
- Lack of a fully predefined intervention protocol (non-manualized CBT).
- Absence of intention-to-treat analysis.
- Limited transparency regarding chatbot prompt evolution during sessions.
Figures and tables
- Tables and figures are appropriate and generally well explained.
- Statistical methods are mostly appropriate and clearly reported.
- Some effect sizes are reported, but clinical relevance could be discussed more explicitly.
Ethics and data availability
- Ethical approval, informed consent, and debriefing are clearly described.
- Data availability statement is adequate and justified given sensitivity.
- Consolidated Scientific Comments and Chapter-Level Evaluation
Introduction
Strengths
- The rationale for comparing AI-delivered CBT with both human-delivered CBT and a no-intervention control is strong and well justified.
- The manuscript is well positioned within the current AI and CBT literature, with clearly articulated aims and hypotheses.
- Epidemiological framing and clinical relevance are clearly established.
Opportunities for improvement
- The conceptual distinction between AI-delivered CBT and human-delivered CBT presented as AI should be introduced more explicitly earlier in the Introduction, as it is central to the interpretation of results.
- The deception-based design could be highlighted sooner to improve conceptual clarity.
- The research gap could be sharpened by emphasizing mechanisms of action and user experience, not only comparative effectiveness.
Methodology
Strengths
- The study adheres to CONSORT and CONSORT-AI reporting guidelines.
- Recruitment procedures, instruments, and interventions are described in detail.
- Ethical safeguards, participant safety protocols, and debriefing procedures are thoughtfully implemented.
Opportunities for improvement
- Lines 215–217: The absence of a structured, session-by-session CBT protocol limits internal validity; clarification is needed on how treatment fidelity was monitored beyond supervision.
- Line 407: The lack of an intention-to-treat analysis should be more explicitly justified, ideally including a sensitivity discussion.
- AI intervention description (Lines 229–256): It remains unclear whether the prompt engineering framework was static or adaptive across sessions and participants, which affects reproducibility.
- The innovative participant blinding strategy raises ethical and interpretive considerations that could be more critically acknowledged.
- The implications of using non-manualized CBT should be discussed more explicitly.
Results
Strengths
- Primary and secondary outcomes are clearly reported.
- Statistical analyses are appropriate and transparently described.
- Quantitative and qualitative findings are effectively integrated.
Opportunities for improvement
- Table 2: The plateau effect observed in the AI group is a key finding and could be further supported by explicitly reporting interaction contrasts or slope comparisons.
- GEE analyses (Table 3): Weekly measurements are appropriate; however, a brief justification for choosing GEEs over mixed-effects models would strengthen methodological transparency.
- Greater emphasis on effect sizes and clinical relevance would complement statistical significance.
- Clarification is needed on whether observed changes meet minimal clinically important differences.
- The frequency counts used in qualitative analyses may give a quasi-quantitative impression; their interpretive purpose should be clarified.
- Redundancy between quantitative and qualitative result descriptions could be reduced.
Discussion
Strengths
- The discussion integrates quantitative and qualitative findings in a balanced and coherent manner.
- There is strong linkage between identified qualitative themes and quantitative outcomes.
- Study limitations are acknowledged transparently.
Opportunities for improvement
- Interpretations regarding AI limitations (e.g., empathy deficits) could be more cautiously framed as user-perceived limitations rather than inherent technological constraints.
- A clearer distinction between exploratory findings and confirmatory claims would strengthen interpretive rigor.
- Implications for clinical practice could be more clearly separated from speculative future developments.
- Discussion could be expanded to address implications for hybrid AI–human care models.
Conclusions
Strengths
- Conclusions are consistent with the presented evidence.
- The manuscript remains appropriately cautious regarding current AI capabilities.
Opportunities for improvement
- The claim that AI-CBT may serve as a “first-line support” should be framed more conservatively, given the lack of superiority over the control group.
- The exploratory nature of the study should be explicitly restated.
- Directions for methodologically stronger, prospectively registered confirmatory trials should be highlighted.
Author Response
Response to Reviewer #1
We are very grateful to Reviewer #1 for the positive, thoughtful, and highly constructive feedback. The suggestions have been invaluable in helping us enhance the clarity, rigor, and interpretative depth of our manuscript. We have carefully addressed all points as detailed below.
- Regarding the Study's Exploratory Nature and Methodological Rigor
Comment: The reviewer noted that "the exploratory nature of the study is not always sufficiently emphasized" and pointed out the "Absence of intention-to-treat analysis."
Response:
We thank the reviewer for these critical points. We completely agree and have made two fundamental revisions throughout the manuscript.
- First, we have thoroughly revised the manuscript to consistently and transparently frame it as an exploratory, hypothesis-generating study, as per the reviewer's suggestion. This is now explicitly stated in theAbstract (Page 1, Line 35-41), Introduction (Page 3, Lines 125-127), Methods (Page 5, Lines 174-179), Discussion ( Page 17, Lines 661-664), and Conclusions ( Page 18, Lines 697-701).
- Second, we have now conducted a full intention-to-treat (ITT) analysis for our primary outcome and have made this the primary analysis, as suggested. The Data Analysis section (Page 10, Lines 417-422) has been updated to detail this approach, and the Results section (Page 11, Lines 456-470) now presents the ITT findings, along with a per-protocol sensitivity analysis.As a direct consequence of these more conservative ITT findings, we have also substantially re-written our Discussion and Conclusions (Page 15-16, Lines 582-595; Page 17-18, Lines 690-701) to reflect a more cautious and balanced perspective, directly addressing the reviewer's later comment on this topic.
- Regarding the Clarity of the Introduction and Intervention Description
Comment: The reviewer suggested that the "deception-based design could be highlighted sooner," the "research gap could be sharpened by emphasizing mechanisms," and that there was "Limited transparency regarding chatbot prompt evolution."
Response:
We thank the reviewer for these excellent suggestions to improve the framing of our study.
- We have revised the final paragraph of the Introduction (Page 3, Lines 122-123)to more explicitly highlight our innovative deception-based design and to frame our research gap around "mechanisms of action and user experiences," not just comparative effectiveness .
- We have also clarified in the Methods section(Page 6, Lines 227-234) that the AI's prompt engineering framework was static, not adaptive, throughout the study.
- Regarding the Discussion of Clinical Relevance and Other Findings
Comment: The reviewer recommended that "clinical relevance could be discussed more explicitly" and that "Interpretations regarding AI limitations... could be more cautiously framed as user-perceived limitations."
Response:
These are invaluable points that have significantly improved our discussion.
- We have added a new paragraph in the Discussion section (Page 16, Lines 596-602) that is dedicated to analyzing the clinical relevance of our findings, using the concept of Minimally Clinically Important Difference (MCID).
- We have carefully revised the Discussion section(Page 16, Lines 605-608;610-611;621-623) to ensure that all interpretations of the AI's limitations (e.g., empathy, formulaic responses) are framed as "participant-perceived" limitations, reflecting the subjective nature of our qualitative data.
- Regarding the Discussion and Conclusions Structure
Comment: The reviewer suggested that "Implications for clinical practice could be more clearly separated from speculative future developments," recommended expanding on "hybrid AI-human care models," and advised framing the "first-line support" claim more conservatively.
Response:
We thank the reviewer for this excellent suggestion on improving the structure of our discussion. To more clearly separate implications for current clinical practice from speculative future developments, we have refined the structure of our Discussion section.
- The discussion of immediate clinical implications, such as the potential role for AI within a "stepped-care model", is now presented alongside our main findings (Page 16, Lines 648-653).
- Speculative future developments and directions for future research are now primarily consolidated within the Limitations section (Page 16, Lines 654-689), where each limitation naturally points towards a future research avenue (e.g., the need for prospectively registered trials, exploring different AI models, etc.).
- Finally, we have completely rewritten the Conclusions section (Page 18, Lines 690-701) to adopt a much more conservative tone, reflecting the ITT results. We no longer frame AI as "first-line support" but rather as a "scalable support tool" whose claims of clinical efficacy require significant caution.
We believe this revised structure creates the clearer distinction that the reviewer suggested.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript presents a timely and methodologically ambitious randomized controlled trial comparing AI-delivered CBT to human-delivered CBT (with deception) and a no-intervention control for young adults with anxiety. While the topic is highly relevant and the mixed-methods design is a strength, the study suffers from fundamental issues that undermine the validity and interpretability of its primary conclusions. The most serious concern is the discrepancy between the stated pre-registration and the post-hoc application of specific inclusion/exclusion criteria to a larger recruited sample. This introduces a high, unquantifiable risk of selection bias, effectively transforming the study from a confirmatory trial into an exploratory, hypothesis-generating analysis. Consequently, the authors’ framing of the results—particularly the conclusion that AI-CBT was not superior to control—must be substantially tempered. Without a prospectively registered protocol for this specific analysis, the reported effect sizes and p-values cannot be interpreted as confirmatory evidence. The manuscript requires major revisions to address this foundational limitation, along with several other methodological and reporting shortcomings.
Abstract and Introduction
The abstract accurately summarizes the study’s design and primary findings but overstates the clinical implications given the methodological limitations noted above. Specifically, the statement that “AI’s clinical efficacy was not superior to a no-intervention control group” should be qualified as an exploratory finding requiring replication. The introduction provides a competent review of the literature on CBT, barriers to care, and AI chatbots. However, it lacks a clear statement of the exploratory nature of this trial given the post-hoc selection procedure. The authors should explicitly acknowledge early in the introduction that this analysis was not pre-specified, reframing the study’s objectives as hypothesis-generating rather than hypothesis-testing. Additionally, the rationale for using peer counselors rather than licensed therapists in the human comparison group is mentioned but not critically examined—this choice substantially limits the generalizability of comparisons to human-delivered CBT as typically practiced.
Methods – Study Design and Participant Selection
The methods section reveals several critical flaws. The description of participant selection is particularly problematic: the authors state that participants were drawn from a larger pre-registered pool through “secondary screening” using criteria not prospectively recorded. This is not merely a minor deviation—it fundamentally alters the interpretability of all subsequent analyses. The authors must provide a detailed justification for why this anxiety-focused subsample was not pre-specified, including the date and method of determining the inclusion/exclusion criteria relative to data access. Without transparency on whether outcome knowledge influenced criterion selection, readers cannot rule out bias. The sample size calculation, while technically correct, is undermined by this selection issue; the authors should recalculate or justify the achieved power given the actual analytic sample. The inclusion cutoff of GAD-7 >5 (mild anxiety) is also questionable, as this threshold may capture subclinical distress rather than clinically significant anxiety, limiting the clinical relevance of findings.
Methods – Interventions and Blinding
The intervention design contains several elements that warrant critical scrutiny. The human group’s instruction that their therapist was AI, combined with peer counselors trained to simulate AI communication styles, introduces an unnatural therapeutic dynamic that likely constrained the very qualities (empathy, flexibility, personalized responsiveness) that differentiate human from AI therapy. The qualitative data confirm this, with participants noting mechanization of the human therapists’ responses. This design choice, while intended to control for expectancy effects, may have systematically attenuated the human intervention’s effectiveness, biasing the comparison against the AI group. The authors should discuss this as a major limitation rather than a methodological strength. Furthermore, the AI chatbot used ChatGPT-3.5, which by the time of manuscript submission had been superseded by more advanced models; the authors should specify the exact version and date of the model’s knowledge cutoff. The lack of a structured, session-by-session protocol for either intervention, while described as allowing flexibility, introduces substantial variability that reduces replicability.
Methods – Outcomes, Blinding, and Data Analysis
The blinding procedures are described but with important gaps. While participants were told both interventions were AI, no formal manipulation check was administered to all participants at study completion; the pilot check with six participants is insufficient to confirm that blinding was maintained throughout the main trial. The authors should report the proportion of main trial participants who, during debriefing, expressed suspicion about their therapist’s identity. The decision not to use intention-to-treat analysis is a serious flaw, as it assumes missing data are completely at random—an assumption violated if attrition was related to treatment dissatisfaction or worsening symptoms. Given the low attrition (3.3%), ITT should be feasible and must be performed. The use of GEEs for secondary outcomes while using repeated-measures ANOVA for the primary outcome is inconsistent; the authors should justify why a single analytic framework (e.g., mixed models or GEEs for all longitudinal outcomes) was not applied. The qualitative analysis, while informative, does not specify how many coders were used, whether inter-rater reliability was calculated, or how disagreements were resolved.
Results
The quantitative results are presented clearly, but their interpretation is compromised by the selection bias risk. The within-group reduction in the AI group (p=0.003) alongside non-significance compared to control (p=0.011 for the between-group effect at post-intervention, with the AI-control pairwise comparison not reaching significance) is a nuanced pattern that the authors handle reasonably. However, the claim that the AI group showed a “plateau effect” is not directly tested; a formal test of linear versus quadratic trend would strengthen this assertion. The GEE results for secondary outcomes show no group-by-time interactions, which is correctly interpreted as comparable effects. However, the authors do not report whether these GEE models were adjusted for baseline values or used an exchangeable correlation structure; these details should be added. The PBS difference between groups (t=-3.55, p<0.001) is substantial, but the scale’s psychometric properties in this sample are not reported.
Discussion and Conclusions
The discussion appropriately acknowledges many limitations, but the framing of the primary conclusion remains too strong given the pre-registration deviation. The authors state that “AI-CBT was not found to be superior to a no-intervention control”—this should be rephrased as “in this exploratory, post-hoc defined subsample, AI-CBT did not demonstrate statistically superior effects compared to control, a finding that requires prospective replication.” The discussion of the plateau effect and the AI’s limitations in empathy and personalization is well-supported by the qualitative data and represents the manuscript’s strongest contribution. However, the authors speculate about mechanisms (e.g., reduced motivation, formulaic questioning) that were not measured; they should either add process measures or temper these claims. The comparison of peer counselors to AI is also overgeneralized to “human-delivered therapy” throughout; the discussion should consistently refer to “human peer counselor-delivered CBT” rather than implying equivalence to licensed psychotherapists.
Specific Section-by-Section Recommendations for Revision
For the abstract, revise the conclusion to state explicitly that findings are exploratory and require prospective replication. In the introduction, add a paragraph acknowledging the post-hoc nature of this analysis and reframe study aims as hypothesis-generating. In the methods, provide a detailed chronology of when the secondary screening criteria were established relative to data access and analysis; report a formal manipulation check for the main trial; replace per-protocol analysis with ITT; and specify the exact AI model version. For the results, report sensitivity analyses comparing ITT to per-protocol findings; add trend tests for the plateau effect; and provide qualitative coding reliability metrics. In the discussion, consistently refer to “human peer counselors” rather than “human therapists”; add a paragraph on how the deception and AI-simulation training may have constrained the human intervention; and explicitly recommend a prospectively registered replication trial as the necessary next step. The conclusion should be rewritten to emphasize that AI-CBT shows within-group promise but that the current evidence does not support claims of non-inferiority or clinical equivalence to human-delivered care.
Author Response
Response to Reviewer #2
We sincerely thank Reviewer #2 for the thorough, critical, and exceptionally insightful feedback. The comments have pushed us to fundamentally re-evaluate our methodological framing and interpretation, and we believe the manuscript is substantially more rigorous and transparent as a result. We have addressed all ajor and minor points as detailed below.
Comment 1: "The most serious concern is the discrepancy between the stated pre-registration and the post-hoc application of specific inclusion/exclusion criteria to a larger recruited sample. This introduces a high, unquantifiable risk of selection bias, effectively transforming the study from a confirmatory trial into an exploratory, hypothesis-generating analysis... The authors should explicitly acknowledge early in the introduction that this analysis was not pre-specified, reframing the study's objectives as hypothesis-generating rather than hypothesis-testing."
Response to Comment 1:
We thank the reviewer for this crucial and insightful critique. We completely agree that the post-hoc nature of our sample selection is a fundamental limitation that positions our study as exploratory rather than confirmatory. We have taken this comment very seriously and have thoroughly revised the entire manuscript to consistently and transparently frame the study as an exploratory, hypothesis-generating investigation.
Specifically, we have made the following key revisions:
- In the Introduction:We have added a new, dedicated paragraph (Page 3, Lines 125-127) to declare the study's methodological context upfront. This paragraph now explicitly states: "However, it is essential to note that this trial constitutes a post-hoc, exploratory secondary analysis of a subsample from a larger pre-registered project. This fundamentally positions our objectives as hypothesis-generating. "
- In the Methods section:To enhance transparency, we have provided a detailed chronology of when the secondary screening criteria were established relative to data analysis (Page 4, Lines 174-179). We clarify that these criteria were finalized prior to any formal data analysis or accessing of the outcome data.
- In the Discussion (Limitations section):We have strengthened the discussion of this issue, making it the first and foremost limitation (Page 16, Lines 661-664). We now use stronger language to emphasize the "high and unquantifiable risk of selection bias" and the need for "significant caution" in interpreting the findings.
- In the Abstract and Conclusions:We have consistently qualified our findings as "exploratory" and "preliminary," and have explicitly recommended the need for a "prospectively registered confirmatory trial" to validate our results (Page 1, Line 35-41; Page 18, Lines 697-701).
We believe these comprehensive revisions fully address the reviewer's concerns regarding the study's framing.
Comment 2: "Additionally, the rationale for using peer counselors rather than licensed therapists in the human comparison group is mentioned but not critically examined—this choice substantially limits the generalizability of comparisons to human-delivered CBT as typically practiced."
Comment 3: "The human group's instruction that their therapist was AI, combined with peer counselors trained to simulate AI communication styles, introduces an unnatural therapeutic dynamic that likely constrained the very qualities... that differentiate human from AI therapy. ... The authors should discuss this as a major limitation rather than a methodological strength."
Response to Comments 2 & 3:
We thank the reviewer for these two related and highly insightful comments. We agree completely that our choice to use peer counselors and the deception-based design are significant methodological features that limit the generalizability of our findings and warrant critical discussion. We have made two major revisions to address these points.
First, to address the issue of generalizability (Comment 2), we have carefully revised the manuscript throughout to use more precise terminology.
- We have replaced general terms like "human therapists" or "human-delivered therapy" with the specific and accurate terms "peer counselors" or "peer-counselor-delivered CBT" wherever appropriate. This change has been systematically applied in theAbstract (e.g., Page 1, Line 19), Introduction (e.g., Page 3, Line 129), Discussion (e.g., Page 16, Lines 596, 601), and Conclusions (e.g., Page 18, Line 693).
Second, in direct response to the reviewer's point about the unnatural therapeutic dynamic (Comment 3), we have added a new, dedicated discussion of this issue as a major limitation.
- We now explicitly acknowledge in the Limitations section (Page 17, Lines 667-678) that the deception-based design and AI-simulation training likely constrained the effectiveness of the peer-counselor intervention. We discuss how this may have suppressed qualities like spontaneous empathy and rapport-building, and state that:"The qualitative data, where some participants in the peer-counselor group noted a 'mechanistic' style, supports this interpretation. Therefore, the observed efficacy difference between the two intervention groups might represent an underestimation of the true potential gap between genuine human therapy and current AI capabilities."
Comment 4: "The decision not to use intention-to-treat analysis is a serious flaw, as it assumes missing data are completely at random—an assumption violated if attrition was related to treatment dissatisfaction or worsening symptoms. Given the low attrition (3.3%), ITT should be feasible and must be performed. ... replace per-protocol analysis with ITT; and specify the exact AI model version. For the results, report sensitivity analyses comparing ITT to per-protocol findings..."
Response to Comment 4:
We thank the reviewer for this crucial methodological directive. We completely agree that an intention-to-treat (ITT) analysis is the most rigorous approach for analyzing RCT data and that our previous omission was a significant flaw. We have now conducted a full ITT analysis for our primary outcome (GAD-7) and have made this the primary analysis throughout the manuscript.
The following key changes have been made:
- In the Methods section (Data Analysis, Page 10, Lines 417-422):We now explicitly state our analytical approach based on the ITT principle, detailing the use of the Last Observation Carried Forward (LOCF) method and the role of the per-protocol (PP) analysis as a sensitivity analysis.
- In the Results section (Quantitative Analysis, Page 11, Lines 456-470):The main results reported for the primary outcome are now based on the ITT analysis. We clearly present the finding that in this primary analysis, neither intervention was statistically superior to the control group. A new subsection, "Sensitivity Analysis," has also been added to transparently report the PP results (Page 11, Lines 471-477).
- In the Discussion and Conclusions(Page 15-16, Lines 582-595; Page 17-18, Lines 690-701) : As a direct consequence of these more conservative ITT findings, we have substantially re-written our interpretation and conclusions to reflect a more cautious and balanced perspective.
Comment 5: "The intervention design contains several elements that warrant critical scrutiny... the AI chatbot used ChatGPT-3.5, which by the time of manuscript submission had been superseded by more advanced models; the authors should specify the exact version and date of the model's knowledge cutoff. The lack of a structured, session-by-session protocol for either intervention... introduces substantial variability that reduces replicability."
Comment 6: "The blinding procedures are described but with important gaps... no formal manipulation check was administered... the pilot check with six participants is insufficient... The authors should report the proportion of main trial participants who... expressed suspicion... "
Comment 7: "The use of GEEs for secondary outcomes while using repeated-measures ANOVA for the primary outcome is inconsistent... The qualitative analysis, while informative, does not specify how many coders were used, whether inter-rater reliability was calculated, or how disagreements were resolved."
Response to Comments 5, 6, & 7:
We thank the reviewer for these specific and important questions regarding our methodological details. We have now revised the Methods section extensively to provide greater transparency on all these points.
- Regarding the AI Model and Intervention Protocol (Comment 5):
- AI Model Version:We have now specified the exact AI model details in the "AI Group" subsection (Page 6, Lines 237-240). The text now reads: "The chatbot was powered by Microsoft Copilot (formerly Bing Chat), which utilized the GPT-3.5-Turbo model. The specific version accessed during the study was the one publicly available in October 2023, with the underlying model having a knowledge cutoff of September 2021."
- Intervention Protocol:We acknowledge the non-manualized nature of our intervention. We have clarified the measures taken to ensure treatment fidelity in the "Intervention" section (Page 6, Lines 228-234), and have also discussed the limitations of this approach in the Limitations section (Page 17, Lines 667-669).
- Regarding the Blinding and Manipulation Check (Comment 6):
We agree that the manipulation check data from the main trial is essential. We have now added this information to the "Study Procedure" subsection (Page 9, Lines 403-406), reporting that none of the 27 participants in the Human group spontaneously expressed suspicion, suggesting the blinding was successfully maintained.
- Regarding Statistical and Qualitative Analysis Details (Comment 7):
- Rationale for Different Analytical Frameworks:We have now added a justification to the Data Analysis section (Page 10, Lines 424-431) for using two different analytical frameworks. We explain that ANOVA was chosen for the primary outcome's simpler structure (3 time points), while GEEs were chosen for the secondary outcomes' more complex structure (4 time points) and their robustness.
- GEE Model Specifications:We appreciate the reviewer's request for more detail on our GEE models.
- Regarding baseline adjustment:We would like to respectfully clarify that this was handled implicitly within our model structure. As noted in Table 3 (Note. b), the first time point (Week 1) was set as the reference category. This approach effectively adjusts for baseline levels by modeling all subsequent changes relative to the initial measurement point. We believe this is a standard and appropriate method for this type of longitudinal analysis.
- Regarding the correlation structure:We confirm that we specified an 'unstructured' working correlation matrix for all models, which is the most flexible approach.
- Regarding ITT for secondary outcomes: The ITT analysis was performed for the primary outcome only. As stated in the Data Analysis section (Page 10, Lines 424-426), the three participants who dropped out had no data whatsoever for the secondary outcomes (sleep and exercise), as their withdrawal occurred before the first weekly data collection point for these measures. Therefore, it was not feasible to include them in the GEE analyses, which were necessarily conducted on the available complete cases (n=87).
- Qualitative Analysis Rigor:We have added a detailed description of our qualitative coding process to the "Data Analysis" section (Page 10, Lines 434-439). This text specifies that coding was done by two independent researchers, and reports a Cohen's Kappa coefficient of 0.83, indicating substantial inter-rater reliability. We also describe the consensus process for resolving disagreements.
We hope that these additions and clarifications provide the necessary methodological transparency and fully address the reviewer's concerns.
Comment 8: "The quantitative results are presented clearly, but their interpretation is compromised by the selection bias risk. The within-group reduction in the AI group (p=0.003) alongside non-significance compared to control... is a nuanced pattern that the authors handle reasonably. However, the claim that the AI group showed a “plateau effect” is not directly tested; a formal test of linear versus quadratic trend would strengthen this assertion. ... The PBS difference between groups (t=-3.55, p<0.001) is substantial, but the scale's psychometric properties in this sample are not reported."
Response to Comment 8:
We thank the reviewer for these valuable suggestions on strengthening our results presentation. We have made revisions to address both points.
- Formal Test for the "Plateau Effect": We agree that our claim of a plateau effect in the AI group required direct statistical support. We have now provided this in the Results section (Page 11, Lines 480-483) by elaborating on our post-hoc pairwise comparisons. The revised text now highlights that while anxiety decreased significantly from baseline to week 2 (p= 0.003), there was no significant further reduction from week 2 to week 4 (p = 1.000). We explicitly state that this lack of improvement in the second half of the intervention provides direct statistical support for a plateau effect.
- Psychometric Properties of the PBS: We thank the reviewer for this comment. We would first like to respectfully clarify that the excellent internal consistency of the total PBS score (Cronbach's α = 0.962) was reported in the Methods section (Page 8, Lines 348-350), confirming the measure's reliability in our sample.
Regarding a potential subscale analysis, we appreciate the suggestion but have deliberately chosen to focus on the total score for two main reasons. First, this aligns with our study's exploratory goal of establishing an overall difference in perceived benefit, rather than delving into sub-dimensions. Second, conducting post-hoc analyses on three separate subscales would introduce issues of multiple comparisons and increase the risk of Type I errors, which we believe is not appropriate given our sample size. Therefore, we have retained the analysis of the robust total score as the most statistically sound approach. We hope the reviewer finds this reasoning acceptable.
Reviewer 3 Report
Comments and Suggestions for AuthorsReport attached.
Comments for author File:
Comments.pdf
Author Response
Response to Reviewer #3
We thank Reviewer #3 for the careful reading and the valuable, constructive comments, which have helped us improve the methodological rigor and interpretive depth of our manuscript. We have addressed all points as detailed below.
- Regarding the Study's Exploratory Nature due to Pre-registration Issues:
Comment: The reviewer pointed out that the study's methodological rigor is weakened by issues related to pre-registration and post-hoc sample selection, making the findings more exploratory.
Response:
We thank the reviewer for this critical feedback. We completely agree that this methodological context positions our study as exploratory. In response, we have thoroughly revised the manuscript to consistently and transparently frame it as an exploratory, hypothesis-generating investigation. This is now explicitly stated in the Introduction (Page 3, Lines 125-127), Methods (Page 5, Lines 174-179), Discussion (Page 17, Lines 661-664), Abstract and Conclusions (Page 1, Line 35-41; Page 18, Lines 697-701). We believe this reframing provides a more accurate and rigorous context for interpreting our findings.
- Regarding the Operationalization of the "Human Therapy" Condition:
Comment: The reviewer raised conceptual concerns about the human therapy condition, asking for clarification on counselor training and the non-structured nature of the sessions.
Response:
We appreciate the reviewer's request for more detail.
Counselor Training and Terminology: We have clarified in the Methods section (Page 7, Lines 290-321) the rigorous three-month training and selection process for our peer counselors. To avoid overgeneralization, we have also revised the manuscript throughout to use the precise term "peer counselors" instead of "human therapists."
Non-structured Sessions: We have expanded the "Intervention" section (Page 6, Lines 223-234) to describe the semi-structured, flexible nature of the interventions and have detailed the specific measures we took to ensure treatment fidelity (e.g., monthly supervision, transcript review).
- Regarding the Interpretation of Secondary Outcomes:
Comment: The reviewer astutely noted that the improvements in secondary outcomes (sleep, exercise) without between-group differences might be due to non-specific factors and requested a more robust discussion, potentially including state-of-the-art literature.
Response:
This is an excellent point. We have now added a new, dedicated paragraph in the Discussion section (Page 16, Lines 627-640) to provide a multi-faceted interpretation of this finding. In this new section, we acknowledge the potential influence of non-specific factors (like the Hawthorne effect) and propose a complementary interpretation: that the improvements may be an indirect consequence of the primary reduction in anxiety. We have incorporated the literature suggested by the reviewer (e.g., Cox et al., 2015; Cox & Olatunji, 2016) to support the strong link between anxiety's cognitive components and sleep, thereby reinforcing our argument.
- Regarding the Rigor of the Qualitative Analysis:
Comment: The reviewer pointed out that the qualitative methodology lacked detail on rigor and justification for using frequency counts.
Response:
We thank the reviewer for this important feedback.
- We have added a detailed description of our coding process to the Data Analysis section (Page 10, Lines 434-439). This includes details on the two independent coders, the calculation of inter-rater reliability (Cohen's Kappa = 0.83), and the consensus process for resolving disagreements.
- We have also clarified the methodological purpose of the frequency counts in the note for Table 4 (Page 14, Lines 533-538). The note now explicitly states that these numbers serve as a "descriptive heuristic" to indicate the relative salience of a theme, and are not intended as a form of quasi-quantification, to address the reviewer's valid concern.
Round 2
Reviewer 2 Report
Comments and Suggestions for Authorsthe authors have addressed every comment mentioned in the previous round.... I recommend approval of the manuscript
Reviewer 3 Report
Comments and Suggestions for AuthorsAll accomplished.