Review Reports - Patient Outcomes Under Varying Engagement Patterns on Real-World Lifestyle-Supported Pharmacological Weight-Loss Therapy

Round 1

Reviewer 1 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

The study addresses an important and timely topic, making good use of a large real-world dataset; however, several methodological and interpretative issues currently limit the clarity and robustness of the findings. In particular, the operationalisation of the estimands, reliance on unvalidated self-reported data, and challenges in generalisability require more careful explanation. A substantial revision of the discussion especially regarding causal language, engagement metrics, and population limitations would greatly strengthen the manuscript.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

The manuscript is generally understandable, but the quality of English requires careful revision. Several sentences are overly complex, terminology is sometimes inconsistent, and there are numerous formatting artefacts and typographical errors that interrupt the flow. A thorough language and style edit would substantially improve readability.

Author Response

The manuscript presents a retrospective cohort study examining patient outcomes within a UK-based, unsubsidised digital weight-loss service that integrates lifestyle support with semaglutide therapy. The authors make good use of a large real-world dataset and adopt an estimand framework aligned with ICH E9(R1), which reflects a clear eƯort toward methodological rigour. The study is ambitious in scope and addresses a highly relevant and timely topic, especially in light of the growing reliance on digital platforms for obesity management. At the same time, there are several substantial issues that, in my view, limit the interpretability and robustness of the findings and therefore warrant major revision before the manuscript can be considered for publication.

Comment 1

A first major concern relates to how the estimands are conceptualised and implemented. While the distinction between eƯicacy and treatment estimands is theoretically appropriate, their operationalisation introduces important challenges. The eƯicacy estimand relies on a highly selected sub-cohort that excludes the majority of initiators, raising questions about selection bias and external validity.

Response 1: Thank you for this insightful observation regarding selection bias. We agree that the efficacy estimand represents a highly motivated subgroup. We have now added several lines to the limitations paragraph acknowledging that these results reflect an 'optimal' rather than 'average' initiator experience:

“Finally, the efficacy estimand, while providing insight into the drug’s potential under optimal conditions, was subject to selection bias. Patients who adhere to a 12-month regimen are likely to possess higher baseline motivation and health literacy than those who discontinue early. Consequently, these results may overstate the outcomes achievable by the general population of initiators.”

Comment 2

Conversely, the treatment estimand uses baseline observation carried forward as the primary strategy for handling missing 12-month weights. Although this approach is supplemented by sensitivity analyses, it risks generating overly conservative and potentially misleading estimates, particularly given the high attrition that is typical of commercial weight-loss services.

Response 2: Thank you for the suggestion. We have added a methodological justification for the use of multiple imputation in the methods (section 2.7) and expanded section 3.5 of the results to emphasize that the primary BOCF analysis likely underestimates the treatment strategy's effectiveness.

2.7: While BOCF was utilized as a conservative 'floor' for real-world effectiveness, MI was included to provide a more robust estimate by accounting for the potential weight loss trajectories of patients who discontinued treatment.

3.5: Consequently, the MI estimate (8.25%) likely provides a more plausible reflection of the real-world impact for the total cohort by adjusting for the conservative bias inherent in the BOCF method.

Comment 3:

The characteristics of the study population also limit the generalisability of the findings. Participants are predominantly women from a single ethnic group, and there is limited information on socioeconomic status or comorbidities. Given that the service is unsubsidised, financial factors are likely to influence both enrolment and adherence, introducing confounding that is only partially acknowledged in the current version of the manuscript. These limitations, while not undermining the value of the work, should be articulated more clearly and prominently in both the abstract and the discussion.

Response 3: Thank you for highlighting these demographic and clinical limitations. We have expanded our limitations section to explicitly acknowledge that the lack of diversity in ethnicity, sex, and socioeconomic status, combined with the absence of detailed comorbidity data, restricts the generalizability of our findings.

Firstly, the study’s sample was predominantly Caucasian (82.8%) and female (91.2%), which, combined with a lack of data on socioeconomic status and specific comorbidities, limits the generalizability of these findings across more diverse populations. Given that this is an unsubsidized service, it is likely that individuals from lower socioeconomic backgrounds are underrepresented, further narrowing the external validity of the results.

Comment 4:

Measurement issues further complicate the interpretation of the results. All weight outcomes are self-reported and unvalidated, and are therefore vulnerable to social desirability and reporting bias.

Response 4:
Thank you for this insightful recommendation. We agree that the self-reported nature of the data is a key limitation. We have updated the limitations paragraph to explicitly mention social desirability bias and the lack of clinical validation via calibrated scales, as suggested:

“Secondly, all weight and side effect data were self-reported; consequently, they may have been affected by social desirability and reporting biases. Unlike traditional clinical settings, these measurements lack external validation from calibrated scales or independent clinical verification.”

Comment 5:

Medication adherence is inferred solely from prescription orders, which is at best a proxy and cannot distinguish between purchasing medication and actually using injections as prescribed.

Response 5: Thank you for this important observation. We have now added the following sentences to the limitations paragraph of the discussion:
“Fifthly, Medication adherence was proxied by order frequency. While this confirms the patient possessed the medication, it does not guarantee clinical administration or adherence to the injection schedule.”

Comment 6:

Engagement is operationalised through the frequency of weight entries, yet other behavioural metrics mentioned in the introduction are neither analysed nor explicitly reported as unavailable. Greater transparency about which engagement indicators were available, how they were selected, and what their limitations are would substantially strengthen the methodological narrative.

Response 6:

Thank you for the suggestion for greater transparency. We have added a sentence to methods (2.5) explaining that while other engagement metrics exist (e.g., coaching logs), weight tracking was selected as the primary metric due to its quantitative consistency across the 12-month period:

“While the Juniper platform includes coaching messages and meal logging, weight tracking frequency was selected as the primary engagement metric for this analysis because it provided the most consistent and quantifiable longitudinal data across the entire cohort.”

Comment 7:

The analytical strategy is largely appropriate, but some inconsistencies and ambiguities remain. Non-parametric tests are correctly used for skewed distributions, but the presentation sometimes blends means, medians, and model-based adjusted differences in a way that could confuse readers. In addition, several p-values and formatting elements contain errors and need careful revision.

Response 7: Thank you for this excellent suggestion. We have now corrected formatting errors on lines 346, 275 and 481. And we have changed all central tendency reporting of Non-parametric tests to medians.

Comment 8:

The exploratory regression model explains only a modest proportion of the variance, which the authors rightly acknowledge; however, the interpretation of this model would benefit from further moderation, given the likelihood of unmeasured confounding and the risk of reverse causality between engagement and weight loss. The interpretation of the findings more broadly would benefit from a more cautious tone.

Response 8:

Thank you for this essential correction regarding the language used in our discussion. We have thoroughly revised Section 4 to replace causal assertions with associative framing. We have removed phrases such as 'causal inference' and 'attributable to' in favor of language that reflects the observational and associative nature of the data. Specifically, we have edited the following sentences (changes in bold):

“To more robustly account for confounding, we conducted a PSM analysis. The success of the PSM in balancing all measured covariates reduces the likelihood that baseline characteristics or adherence behaviors drove the observed difference, suggesting that the weight loss advantage for Wegovy may be related to the medication/dosage profile and not to patient characteristics or baseline adherence behaviors.”

‘The model's value lies in identifying the independent associations between specific program levers (such as medication count and weight tracker use) and weight loss outcomes rather than achieving high predictive power.’

‘They also suggest that the maximum semaglutide dose (Wegovy vs Ozempic) is significantly associated with weight loss for patients who progress through to their eighth order.’

‘This suggests that the observed variation in outcomes between the brands may be present even at equivalent low-to-mid doses, potentially driven by formulation differences or other subtle factors, rather than being solely dependent on the high-dose ceilings.’

Comment 9:

The marked discrepancy between the efficacy and treatment estimates highlights the central importance of adherence and attrition, and this point could be more deeply integrated into the conceptual discussion.

Response 9: Thank you for this insightful observation. We agree that the gap between estimands warrants a deeper conceptual interpretation. We have added a new paragraph to the Discussion highlighting that the 7.79 percentage point discrepancy between efficacy and treatment outcomes underscores how the public health impact of pharmacotherapy is moderated by real-world attrition and barriers to persistence. This is the second paragraph of the discussion and reads as follows:

“The first discovery of interest was the marked discrepancy between the efficacy and treatment estimates (15.67% vs 7.88%) highlights the central importance of adherence and attrition in real-world digital obesity care. This gap suggests that while the pharmacological intervention is highly effective under ideal conditions, its clinical impact is heavily moderated by the challenges of long-term program retention and the various financial or behavioral barriers to persistence inherent in an unsubsidized service.”

Comment 10:

The observed associations between higher weight tracking frequency or greater medication ordering and weight outcomes are interesting and clinically relevant, but the manuscript should avoid causal language given the observational design and the possibility of bi-directional or unmeasured motivational factors. Similarly, comparisons between wegovy and ozempic should be framed as associative rather than causal, particularly in light of limited adjustment for baseline characteristics and contextual factors such as supply and access.

Response 10: Thank you for highlighting these unmeasured factors. We have added a paragraph to the Discussion acknowledging that the relationship between engagement and weight loss may be bi-directional (where early success drives engagement). Furthermore, we have added a limitation regarding the role of medication supply and access in the UK as a contextual factor that may have influenced medication assignment. This paragraph (penultimate) reads as follows:

“It is also important to consider the potential for reverse causality or bi-directional motivation regarding engagement; while frequent weight tracking may facilitate weight loss through accountability, it is equally plausible that patients experiencing early success are more motivated to log their progress. Furthermore, while the PSM analysis adjusted for baseline patient characteristics, medication assignment was also influenced by contextual factors such as global supply constraints and product access in the UK during the study period. These unmeasured motivational and logistical factors should be considered when interpreting the associations between engagement, medication brand, and clinical outcomes.”

Comment 11:

From a presentational perspective, the manuscript would also benefit from substantial editorial refinement. There are several structural and stylistic issues, including stray formatting notes and inconsistently numbered tables and figures, which distract from the scientific content. The discussion is generally thoughtful and comprehensive, but at times somewhat repetitive. It could be streamlined to emphasise three key contributions: the gap between ideal-adherence and real-world outcomes, the gradient in outcomes associated with engagement, and the nuanced comparison of diƯerent medication profiles.

Response 11:

Thank you for this excellent observation. We appreciate this constructive feedback on the manuscript’s presentation. We have performed a thorough editorial refinement to improve clarity and professionalism. Specifically:

Formatting: We have removed all stray formatting notes (e.g., the bracket in the MICE section) and corrected the numbering of tables and figures to ensure a consistent sequence.
Discussion Streamlining: We have significantly restructured the Discussion to reduce repetition and focus on the three areas suggested: (1) the gap between efficacy and treatment estimands, (2) the engagement-outcome gradient, and (3) the comparison of medication profiles. This has resulted in a more concise and impactful narrative.
Removed redundant sentences:

“Arguably the most significant discoveries of this study were those pertaining to semaglutide type, medication count, and weight tracker use.”

“As expected, weight loss outcomes in the treatment estimand were inferior to the efficacy estimand. The fact that both the mean figure (7.88%; ±8.46) and proportion of patients reaching the ≥5% milestone (54.21%) in the treatment estimand were comparable to other real-world semaglutide studies [25, 20] suggests that the on the whole, there was nothing remarkable about the Juniper intervention. “

Comment 12:
Finally, the handling of conflicts of interest requires further clarity. Given the authors’ affiliations with the commercial provider, it is important to describe explicitly how analytic independence and the integrity of the publication process were safeguarded. As the manuscript relies heavily on patient-reported measures and behavioural indicators, the authors might also consider connecting their work more explicitly to recent researh in bariatric candidates that has used validated psychometric instruments to capture psychological and psychosocial outcomes in a systematic way. This would reinforce the methodological rationale for incorporating structured psychological assessment into multidisciplinary obesity treatment pathways and help anchor the discussion of patientreported outcomes within a broader biopsychosocial framework, highlighting the importance of integrating psychological and behavioural dimensions into obesity care. In conclusion, the manuscript addresses an important topic with a rich dataset and clear methodological intent. However, substantial revisions are needed to clarify the estimand definitions, strengthen analytic transparency, improve interpretability, and refine the narrative. On this basis, I recommend a major revision, with particular attention to methodological justification, the careful moderation of causal language, and a more explicit articulation of the study’s conceptual and practical implications.

Response 12:

Thank you for this comment. We have added 6 lines to the Conflicts of Interest section to explicitly describe the methodological safeguards used, including the adoption of the ICH E9(R1) framework for objective reporting. We have also ensured that the manuscript remains transparent about the high attrition rates and the discrepancy between efficacy and treatment outcomes, demonstrating a commitment to an unbiased presentation of real-world data.

We also appreciate the suggestion to anchor our behavioral findings in a broader framework. We have added a paragraph to section 4.3 that connects our findings on digital engagement to research in bariatric candidates using validated psychometric instruments. This addition highlights the importance of integrating psychological and behavioral dimensions into digital obesity care to create a more robust, multidisciplinary treatment pathway.

Reviewer 2 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

After thoroughly reviewing this manuscript, I find that it falls substantially short of the methodological and conceptual rigor required for publication, to the extent that the limitations undermine the credibility of nearly every major claim. The study relies heavily on retrospectively collected, self-reported data without adequately addressing the inherent biases this introduces, and key variables essential for interpreting weight-loss effectiveness, such as diet quality, physical activity, socioeconomic status, and adherence motivation, are completely unmeasured or dismissed, leaving major confounding unresolved. The analytical strategy is overly complicated yet poorly justified, with inappropriate reliance on non-parametric tests for large samples, insufficient transparency in the handling of missing data, and questionable use of propensity score matching given the profound structural differences between comparison groups. The presentation of results repeatedly overstates causal interpretation in a context where only highly limited observational inference is possible, and several of the conclusions extend far beyond what the data can defensibly support. Moreover, the manuscript appears to function partly as a promotional narrative for a commercial digital weight-loss service rather than as an independent scientific investigation, raising concerns about framing and neutrality.

Author Response

We thank the reviewer for their critical and rigorous evaluation of our manuscript. We acknowledge the concerns raised regarding the limitations of retrospective, real-world data and the potential for perceived bias in a commercially supported service. We have taken this feedback seriously and have implemented a major revision to improve the transparency, analytical clarity, and neutrality of the paper. Our point-by-point responses to the primary concerns are detailed below:

Commercial Framing and Neutrality: We take the concern regarding a "promotional narrative" very seriously. To ensure an objective and independent scientific investigation, the primary data analysis was conducted by independent academic researchers. Furthermore, we have intentionally highlighted the significant real-world challenges of the service, including the 77% attrition rate and the 7.79 percentage point gap between efficacy and treatment outcomes. We believe that by emphasizing these "real-world" failures alongside the successes, we provide a balanced and intellectually honest evaluation. The conflict of interest section has been expanded to explicitly describe these safeguards for analytic independence.
Unmeasured Confounders (Diet, Activity, SES): We fully agree that factors such as diet quality, physical activity, and socioeconomic status (SES) are critical unmeasured confounders in this study. As these data were not captured in the routine clinical service repository, we have significantly expanded our limitations paragraph (end of discussion) to clarify that our results reflect associations within a specific digital care context and cannot be interpreted as isolated pharmacological effects. We have explicitly noted that in an unsubsidized service, financial capacity (SES) is a major latent confounder for both adherence and outcomes.
Causal Interpretation: We have conducted a global revision of the manuscript to remove causal language. Terms such as "attributable to," "resulted in," or "causal inference" have been replaced with associative terminology like "associated with," "observed variation," or "suggested association." This ensures that our conclusions do not extend beyond what observational data can defensibly support.
Analytical Strategy (Non-parametric Tests & PSM): Regarding the use of non-parametric tests, while we acknowledge the reviewer’s point regarding large sample sizes, our decision was driven by the significant skewness and "long-tail" distribution common in weight loss data (where some patients achieve extreme loss while others gain weight). We believe prioritizing medians provides a more conservative and robust estimate that is less susceptible to outliers. Regarding the Propensity Score Matching (PSM), we have added a disclaimer to Section 4.3 acknowledging that PSM can only balance measured variables and that contextual factors (such as global medication supply and access) likely influenced medication assignment and outcomes.
Transparency in Missing Data: To address the concern regarding missing data handling, we have included a Sensitivity Analysis (Table 8). This table compares our primary, conservative Baseline Observation Carried Forward (BOCF) approach with Multiple Imputation (MI) and Complete Case Analysis (CCA). This allows readers to transparently evaluate how different missing data assumptions impact the treatment estimand results.
Conceptual Integration: Following your recommendation, we have restructured the discussion to follow a more rigorous three-pillar framework: (1) The Adherence Gap, (2) The Engagement-Outcome Gradient, and (3) Comparative Medication Profiles. We have also anchored our behavioral findings within a broader biopsychosocial framework, referencing established research in bariatric surgery candidates to emphasize the need for integrated psychological and behavioral assessments in obesity care.

In summary, we believe these substantial revisions have clarified the estimand definitions, strengthened the analytic transparency, and moderated the narrative to ensure the manuscript meets the standards of a rigorous clinical investigation.

Round 2

Reviewer 1 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

The authors have provided transparent, and constructive responses to all comments. The key methodological and interpretative concerns have been appropriately acknowledged and addressed, particularly with respect to estimand definition and implementation, handling of missing data, limitations in generalisability, moderation of causal language, and overall narrative clarity. The revisions substantially strengthen the methodological rigor, interpretability, and balance of the manuscript. In its current form, the study represents a meaningful contribution to the literature and is suitable to proceed in the review and editorial decision process.

Comments on the Quality of English Language

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear authors, thank you for the submission focusing on ‘Patient outcomes under varying engagement patterns on real world lifestyle supported pharmacological weight loss therapy’. Please see detailed feedback for consideration to improve scientific soundness.

Juniper health is a brand name and does not need to be included in the manuscript as Digital weight loss service (DWLS) is sufficient.

Authors

Why does author 2 include a gmail email address as an affiliation when they work at a university?

Does the 4^th author have an academic affiliation also?

Key Word, please delete real world service.

Introduction

Line 61 please change stress to something else more scientific.

Materials and Methods

While a lower BMI was taken into account for Non Caucasian populations the storage of adipose tissue in the visceral area was not considered and is important factor for health outcomes rather than looking at weight loss alone in obesity and humans.

Given the population and large amount of woman in the data, was pregnancy, breastfeeding or miscarriage included as an exclusion criteria or screened for? This was not included in the exclusion criteria in the manuscript.

Line 102 ethics approval number should be included in the manuscript.

How was weight measured in a DWLS? Do participants wear the same clothes, is there a protocol for weight measurement? Is it taken into account when women are going through menstruation? How is height measured to take BMI?

Statistical Analysis

There is no mention of ANOVA or Tukeys Post Hoc Analysis however is included in the results section.

Results

There is no control group in the study design.

Tables can be improved to be more reader friendly.

BMI 2 should be in superscript.

The results section written section shouldn’t be in italics.

Tables headings should be above the tables and figures headings below the headings.

12 month weight loss% by tracker usage should be deleted and incorporated in the title for the figure.

Body composition was not monitored which is important for comparing different ethnicities, despite weight loss there can be changes in adiposity for example Asian populations that have higher VAT stores relating to disease outcome when compared with Caucasian populations. Hip to Waist Ration and Waist circumference are all valuable markers that can be limitations when DWLS are utilised.

Discussion

Line 345 Please delete Authors

In the discussion authors discuss Asian individuals having less weight loss compared to Caucasian individuals. Please take into account the VAT distribution is specific ethnicities https://pmc.ncbi.nlm.nih.gov/articles/PMC7803598/

References

Line 526 the journal isn’t in italics

Line 562 and 563 the journal title is in italics and the name is in name is not?

Line 566 and 567 the journal title is in italics and the name is in name is not?

Line 568 and 569 the journal title is in italics and the name is in name is not?

Line 571 and 572 the journal title is in italics and the name is in name is not?

Line 574 and 576 the journal title is in italics and the name is in name is not?

Line 577 and 578 the journal title is in italics and the name is in name is not?

Abbreviations are in different font sizes.

Author Response

Reviewer 1

Comment 1: Juniper health is a brand name and does not need to be included in the manuscript as Digital weight loss service (DWLS) is sufficient.

Response 1: Thank you for your attention to maintaining generalizability, but we must respectfully retain the name Juniper Health in the manuscript.

While we agree that the abbreviation DWLS provides helpful context, specifying the platform is essential for transparency and interpretation of the results for several key reasons:

Contextual Specificity: This study reports on the outcomes of a single, specific clinical program run by a named provider. As the effectiveness of DWLS is highly dependent on their unique clinical protocols (e.g., specific multidisciplinary team composition, coaching content, and pricing models), reporting the results without the source program would make the findings non-replicable and functionally meaningless to other researchers.
Conflict of Interest Acknowledgment: Since several authors are affiliated with Juniper Health, the naming of the platform is a necessary part of transparently declaring the source of the data and acknowledging the potential for conflict of interest.
Real-World Applicability: Our study contributes to the literature on unsubsidized Naming the program grounds our findings in the context of its publicly known price and service model, providing crucial context for readers interested in the economic barriers we discuss in the Discussion section.

Comment 2: Why does author 2 include a gmail email address as an affiliation when they work at a university?

Response 2: Thank you for identifying this. We have now inserted author 2’s university email address.

Comment 3: Does the 4^th author have an academic affiliation also?

Response 3: No author 4 is not currently affiliated with an academic institution.

Comment 4: Key Word, please delete real world service.

Response 4: Thanks for this suggestion. The term has now been removed.

Introduction

Comment 5: Line 61 please change stress to something else more scientific.

Response 5: Thank you for this recommendation. We have now replaced the word ‘stress’ with ‘advise’.

Materials and Methods

Comment 6: While a lower BMI was taken into account for Non Caucasian populations the storage of adipose tissue in the visceral area was not considered and is important factor for health outcomes rather than looking at weight loss alone in obesity and humans.

Response 6: Thank you for this excellent observation. Unfortunately data on body composition was not available. We have now added this as a limitation (lines 1463-1466):
“Thirdly, quality of life and body composition data were not systematically collected by Juniper, which prevented investigators from extending upon the clinical relevance of the study’s findings.”

Comment 6: Given the population and large amount of woman in the data, was pregnancy, breastfeeding or miscarriage included as an exclusion criteria or screened for? This was not included in the exclusion criteria in the manuscript.

Response 6: Thank you for recognising this oversight. We have now added ‘and current or planned pregnancy’ to the list of exclusion criteria (line 173).

Comment 7: Line 102 ethics approval number should be included in the manuscript.

Response 7: Thanks for noticing this. We have now included the ethics approval number on line 134 (IREC015).

Comment 8: How was weight measured in a DWLS? Do participants wear the same clothes, is there a protocol for weight measurement? Is it taken into account when women are going through menstruation? How is height measured to take BMI?

Response 8: Thanks for requesting this specificity. We have now added the following sentence to the first paragraph of the ‘2.2: program overview’ subsection: ‘All patients are required to submit 2 photos (1 front-on view; 1 side-on view) of themselves in ‘activewear or swimwear’ where their abdomen, face and knees are visible.’

Statistical Analysis

Comment 9: There is no mention of ANOVA or Tukeys Post Hoc Analysis however is included in the results section.

Response 9: Thank you for this important observation. We have now replaced the ANOVA and Tukey HSD with Kruskal-wallis and Holm-Bonferroni post hoc tests (table 5). These changes are now all aligned with the methods section.

Results

Comment 10: There is no control group in the study design.

Response 11: Thank you for this comment. Investigators did not have the means to fund a control group in this study. This is a common limitation of research on real-world interventions. We have encouraged future researchers to explore a comparison between ‘unsubsidized and subsidized patients of the same medicated real-world DWLS‘ in the penultimate paragraph of the discussion.

Comment: Tables can be improved to be more reader friendly.

Response: Thank you for this excellent suggestion. We have now removed two redundant columns from Table 3. Realigned columns in table 5, and removed all borders in all 5 tables except for horizontal borders that separate titles and variable themes.

Comment: BMI 2 should be in superscript.

Response: Thank you for this important comment. All BMI kg/m² referencesare now use ‘2’ as superscripted ²

Comment: The results section written section shouldn’t be in italics.

Response: Thank you for noticing this error. We have removed italics as the main font for the results section.

Comment: Tables headings should be above the tables and figures headings below the headings.

Response: Thank you for identifying this. We have now applied these changes.

Comment: 12-month weight loss% by tracker usage should be deleted and incorporated in the title for the figure.

Response: Thank you for this sharp observation. We have now removed the title from the visualization.

Comment: Body composition was not monitored which is important for comparing different ethnicities, despite weight loss there can be changes in adiposity for example Asian populations that have higher VAT stores relating to disease outcome when compared with Caucasian populations. Hip to Waist Ration and Waist circumference are all valuable markers that can be limitations when DWLS are utilised.

Response: Thank you for noticing this limitation. We have now added this to our list of limitations at the end of the discussion. The sentence (lines 1463-1466) now reads as follows:

“Thirdly, quality of life and body composition data were not systematically collected by Juniper, which prevented investigators from extending upon the clinical relevance of the study’s findings.”

Discussion

Comment: Line 345 Please delete Authors

Response: Thank you for noticing this. We have now deleted this typographical error.

Comment: In the discussion authors discuss Asian individuals having less weight loss compared to Caucasian individuals. Please take into account the VAT distribution is specific ethnicities https://pmc.ncbi.nlm.nih.gov/articles/PMC7803598/

Response: Thanks for this insightful recommendation. We have now revised the following sentences: lines 1278-1281):

“Similarly, the discovery that patients of Asian ethnicity lost, on average, less weight than Caucasian patients aligns with the results from a large clinical study [29]. The latter suggested that lower initial BMI may explain this trend, but body composition or visceral adipose tissue may also be important factors.”

Comment:

References

Line 526 the journal isn’t in italics

Line 562 and 563 the journal title is in italics and the name is in name is not?

Line 566 and 567 the journal title is in italics and the name is in name is not?

Line 568 and 569 the journal title is in italics and the name is in name is not?

Line 571 and 572 the journal title is in italics and the name is in name is not?

Line 574 and 576 the journal title is in italics and the name is in name is not?

Line 577 and 578 the journal title is in italics and the name is in name is not?

Response: Thank you very much for noticing these discrepancies. We have now put all study titles in normal text and journal names in Italics.

Comment: Abbreviations are in different font sizes.

Response: Thank you for noticing this error. It has now been corrected.

Reviewer 2 Report

Comments and Suggestions for Authors

Major Revisions: This manuscript addresses an important topic—engagement and weight-loss outcomes in a real-world digital program supported by semaglutide—but it suffers from substantial methodological and statistical flaws that undermine the credibility and generalizability of its conclusions. The retrospective cohort design is weakened by exclusion of patients who paused treatment for more than 90 days or had more than 15 medication orders, introducing selection bias and overestimating effectiveness in real-world conditions. The division into efficacy and treatment estimands is insufficiently justified and not framed within a recognized framework such as ICH E9(R1), and the lack of sensitivity analyses—particularly intention-to-treat approaches—prevents understanding of the impact of the exclusion criteria on the outcomes. Reliance on self-reported weight data without validation introduces significant measurement bias, and the arbitrary definition of a 90-day pause is not supported by clinical or methodological rationale. The use of parametric tests (t-tests, ANOVA, linear regression) despite significant departures from normality in Shapiro–Wilk tests (p < 0.05), without evidence of variance homogeneity testing or use of robust alternatives (e.g., GLMs, quantile regression), undermines the validity of statistical inferences. The regression model’s weak explanatory power (adjusted R² = 0.17) suggests that critical predictors were omitted or that relationships are non-linear, yet this is neither discussed nor addressed. Furthermore, multiple post-hoc comparisons were conducted without adjustment for multiplicity (e.g., Bonferroni, FDR), inflating the risk of Type I errors. The study’s external validity is severely limited by the predominantly female (91%) and Caucasian (82%) sample and the omission of socioeconomic variables, despite the high cost of the program likely being a key determinant of adherence. Excluding patients with prolonged pauses may have inflated treatment effect estimates and underestimated adverse events, thus biasing results in favor of the intervention. The introduction lacks a strong conceptual framework on adherence and engagement in digital programs and does not reference the estimand approach or behavioral and economic barriers to adherence, which are central to the study’s claims; furthermore, the omission of patient-centered outcomes such as quality of life limits the clinical relevance of the findings.

Minor Revisions: The manuscript should report standardized effect sizes (e.g., Cohen’s d, partial η², odds ratios) alongside p-values to enhance clinical interpretability. Figures, especially those depicting subgroup differences (e.g., Figure 2), should include confidence intervals or error bars to convey precision. Missing data handling is not described and should be clarified, particularly given the retrospective nature of the study. Engagement, measured through medication orders and weight-tracker use, should be explored with mediation or moderation analyses to better understand the causal pathway linking engagement to outcomes. Finally, the introduction would benefit from integrating literature on digital engagement as a determinant of adherence and weight loss, as well as providing a more robust justification for the use of estimands in real-world observational research.

Author Response

Comment 1: The retrospective cohort design is weakened by exclusion of patients who paused treatment for more than 90 days or had more than 15 medication orders, introducing selection bias and overestimating effectiveness in real-world conditions.

Response 1: Thank you for this insightful comment. We have now included those patients in the treatment estimand. All figures have been updated in tables 2 and 3, and the corresponding text in the results section. This sentence has been removed from sub section 2.5 (Endpoints):

‘Patients who received more than 15 orders and/or paused treatment for over 90 days were excluded from all groups.'

We also clarified how missing 12-month weight data were imputed in the treatment estimand by inserting the following sentence into the final paragraph of subsection 2.5 (endpoints):

‘Missing 12-month weight data in the treatment estimand were imputed using the Last Observation Carried Forward (LOCF) method. For patients who had no post-baseline weight submission, the initial (baseline) weight was carried forward, resulting in 0% weight loss for these individuals.’

We have also updated the description of the different groups at the bottom of Table 2, so they read as follows:

Efficacy Estimand (Adherent plus patients): received between 8-15 medication orders, recorded their weight within 341-379 days post initiation, and did not pause treatment for any longer than 90 days.

Adherent patients: received at least 8 medication orders but did not record their weight within a 341-379 days post program initiation (includes those who paused treatment for longer than 90 days); or received between 8-15 medication orders, recorded their weight within 341-379 days post initiation, but paused treatment for longer than 90 days. .

Remaining patients: received less than 8 medication orders or more than 15 orders;

Treatment estimand: all three groups combined.

All commentary around this exclusion limitation has been removed from the discussion section. For example, we removed the sentence “A further 2858 (39.26%) patients were excluded because they paused the program for longer than 90 days.”

Comment 2: The division into effiacy and treatment estimands is insufficiently justified and not framed within a recognized framework such as ICH E9(R1), and the lack of sensitivity analyses—particularly intention-to-treat approaches—prevents understanding of the impact of the exclusion criteria on the outcomes.

Response 2: Thank you for this important recommendation. We have now added the following paragraph to the end of the introduction (with the appropriate citation):

In line with the International Conference on Harmonisation (ICH) E9(R1) addendum on estimands [18], this study aims to retrospectively assess 12-month weight loss and adherence from a cohort of patients from the Juniper UK DWLS. Specifically, we define two estimands: an efficacy estimand (representing efficacy under ideal adherence) and a treatment estimand (representing effectiveness in a real-world context, including non-adherence). The study will compare the outcomes of both the ideal adherence cohort and a modified full cohort, i.e., including those who deviated from the suggested clinical pathway or discontinued early. The study will also examine how weight loss is affected by demographic factors, program pauses, weight tracker engagement levels, and semaglutide brand (Wegovy and Ozempic).

We also frame our methods within the ICH E9 (R1) framework in subsection 2.5:

To generate meaningful and robust findings, patient outcomes were assessed based on two primary estimands, defined in accordance with the International Conference on Harmonisation E9(R1) framework:

Efficacy estimand: This estimand reflects the biological efficacy of the treatment under the hypothetical condition of full protocol adherence. It corresponded to patients who received between 8 and 15 medication orders, reported weight measurements within a 12-month post-initiation assessment window (341–379 days), and did not pause treatment for any longer than 90 days.
Treatment (Intention-to-Treat) estimand: This estimand reflects the treatment strategy's effectiveness in a real-world setting. It included all patients in the Efficacy Estimand plus those who demonstrated limited adherence (received less than 8 or above 15 orders , paused for longer than 90 days, or received between 8-15 orders but did not track weight within a 12-month post-initiation assessment window (341-379) days.

Comment 3: Reliance on self-reported weight data without validation introduces significant measurement bias, and the arbitrary definition of a 90-day pause is not supported by clinical or methodological rationale.

Response 3: We acknowledge both of these limitations:

Self-Reported Weight: We recognize the inherent measurement bias associated with self-reported weight. This is a common and unavoidable limitation in digital weight loss service (DWLS) studies, as objective, clinician-measured data is not routinely collected. This limitation is retained and explicitly discussed in the limitations paragraph at the end of our discussion section.
90-Day Pause Rationale: We agree that the rationale for the 90-day cut-off was previously missing. This has now been addressed in subsection 2.5 (lines 232-242):

“The estimand cohort was selected to reflect the optimal efficacy under conditions of sustained adherence, requiring exclusion criteria to mitigate the impact of major intercurrent events. Patients were required to have received between 8 and 15 medication orders over the 12-month period. The minimum of 8 orders ensured adequate treatment exposure, while the maximum of 15 orders accounted for logistical variations such as pre-ordering for travel, without including excessive ordering that would suggest protocol deviation. A single or cumulative pause in medication supply exceeding 90 days resulted in exclusion from this estimand. This 90-day cutoff serves as an established determinant of non-adherence over a 12-month assessment period in real-world studies [19], and was necessary to minimize the confounding risk associated with prolonged periods of non-exposure to the study medication.”

Comment 4: The use of parametric tests (t-tests, ANOVA, linear regression) despite significant departures from normality in Shapiro–Wilk tests (p < 0.05), without evidence of variance homogeneity testing or use of robust alternatives (e.g., GLMs, quantile regression), undermines the validity of statistical inferences. The regression model’s weak explanatory power (adjusted R² = 0.17) suggests that critical predictors were omitted or that relationships are non-linear, yet this is neither discussed nor addressed. Furthermore, multiple post-hoc comparisons were conducted without adjustment for multiplicity (e.g., Bonferroni, FDR), inflating the risk of Type I errors.

Response 4: Thank you for identifying this critical aspect of our statistical methodology. We fully concur that the initial reliance on parametric tests, despite significant departures from normality, undermined the validity of the inferences.

We have addressed this by implementing the following changes, as reflected in the revised Subsection 2.6 (Statistical Analysis):

Test Validity and Robustness: We have replaced all initial primary comparisons using t-tests and ANOVA with non-parametric alternatives (Mann–Whitney U test and Kruskal–Wallis test). This eliminates the reliance on the normality assumption. Furthermore, for the critical medication comparison, we implemented the highly robust Propensity Score Matching (PSM) analysis combined with the Mann-Whitney U test, strengthening the causal inference.
Multiplicity Correction: We have confirmed that all post-hoc pairwise comparisons now utilize the Holm-Bonferroni method to rigorously control the Family-wise Error Rate (FWER) and prevent the inflation of Type I errors.

Subsection 2.6 now reads as follows:

“Data distribution normalcy was evaluated using quantile-quantile plots and Shapiro–Wilk tests. Given the significant departures from normality (p < 0.05) observed in the Shapiro–Wilk tests, non-parametric methods were employed for primary inference involving continuous outcomes, with means and standard deviations retained for descriptive purposes only. The effect of continuous independent variables on weight loss percentage were assessed using Spearman’s rank correlation test. Categorical independent variables (excluding the medication comparison) were analyzed using the Mann–Whitney U test (for binary variables) or the Kruskal–Wallis test (for multi-level variables). Post-hoc pairwise comparisons following the Kruskal–Wallis test were corrected using the Holm-Bonferroni method to control the Family-wise Error Rate (FWER). Categorical dependent and independent variables, such as the proportion of patients achieving weight loss milestones (≥ 5\%, ≥10%, ≥15%) versus medication brand or patient group, were compared using Chi-Square tests.

To robustly compare the effect of Wegovy versus Ozempic on 12-month weight loss while adjusting for potential confounding variables (e.g., adherence behaviors and patient selection), Propensity Score Matching (PSM) was employed. Propensity scores were estimated using logistic regression, modeling the probability of receiving Wegovy (the treatment) based on the following covariates: age, initial BMI, initial weight, ethnicity, and pause incidence (yes/no). Nearest neighbor matching (1:1) with a caliper of 0.2 of the logit standard deviation was used to create the final matched cohort. The primary comparison of weight loss percentage within the matched cohort was then conducted using the Mann–Whitney U test, and the magnitude of the adjusted difference was reported using Cohen’s d and the 95% Confidence Interval of the mean difference from a linear model. All visualizations and statistical analyses were conducted using RStudio, version 2023.06.1+524 (RStudio: Integrated Development Environment for R, Boston, MA, USA).”

Regarding your concern about the low explanatory power of our regression model, we have added a clarifying statement to the discussion section(lines 1353-1059) to contextualize this finding, arguing that the model's value lies in identifying key, significant predictors available in the service data (like medication count and weight tracker use), rather than achieving high predictive power:

“We recognize the modest explanatory power of the Multiple Linear Regression model (Adjusted R² = 0.171). This is expected in a retrospective, real-world cohort study utilizing routine service data. Unlike tightly controlled clinical trials, the model inherently omits critical, unmeasured confounders that drive weight loss, such as specific dietary intake, physical activity levels, and patient motivation. The model's value lies in isolating the unique, statistically significant effects of the available program levers (like medication count and weight tracker use) rather than achieving high predictive power.”

Comment 5: The study’s external validity is severely limited by the predominantly female (91%) and Caucasian (82%) sample and the omission of socioeconomic variables, despite the high cost of the program likely being a key determinant of adherence. Excluding patients with prolonged pauses may have inflated treatment effect estimates and underestimated adverse events, thus biasing results in favor of the intervention.

Response 5: Thank you for this important comment. As explained in response to comment 1, we have now included patients with extended pauses in the analysis. We have subsequently removed that limitation from the final paragraph of the discussion and replaced it with the following limitation:

“Finally, investigators did not have access to patient income data and therefore could not determine whether socioeconomic status influenced program outcomes or adherence. Despite the absence of these data, it is likely that the high program cost led to underrepresentation among lower socioeconomic groups.”

In addition to this, we have modified the third last paragraph of the discussion (lines 1385-1389) to stress the disparity between the outcomes in the new (complete) treatment estimand and efficacy estimand:

“The fact that both the mean figure (7.88%; ±8.46) and proportion of patients reaching the ≥5% milestone (54.21%) in the treatment estimand were comparable to other real-world semaglutide studies [25, 20] suggests that the on the whole, there was nothing remarkable about the Juniper intervention.”

Comment 6: The introduction lacks a strong conceptual framework on adherence and engagement in digital programs and does not reference the estimand approach or behavioral and economic barriers to adherence, which are central to the study’s claims;

Response 6: Thank you for this astute observation. We have now added three paragraphs to the introduction. The first discusses the importance of engagement and adherence in digital health interventions with 4 additional citations (lines 76-84):

‘While DWLSs improve access to care, their long-term effectiveness hinges on sustained digital engagement and behavioral adherence to in-app tools, rather than just medication uptake. Numerous studies within digital weight management have established that frequent interaction with the platform—such as regular weight tracking, goal logging, and message exchange with coaches—is strongly and positively correlated with superior weight loss outcomes [13, 14]. This active, ongoing engagement is vital as it directly facilitates the recommended lifestyle interventions that underpin successful GLP-1 RA therapy, effectively translating the passive receipt of medication into an adherent behavioral strategy for chronic disease management [15, 16].’

The other two paragraphs explain how an analysis of adherence and engagement in DWLSs benefits from a treatment/efficacy estimand division, referring to the (ICH)E9(r1) framework. These were inserted at the end of the section (lines 107-123):

“Given the inherent variability and non-randomized nature of observational studies, patients inevitably deviate from the suggested treatment pathway. To rigorously and transparently quantify the impact of such deviations, we frame our research question using the Estimand framework defined by the International Conference on Harmonisation (ICH) E9(R1) addendum on estimands [22]. This approach allows for the precise definition of two treatment effects: the efficacy estimand (representing efficacy under ideal adherence) and the treatment estimand (representing effectiveness of the strategy in the real-world, including non-adherence and dropout).

In line with this framework, this study aims to retrospectively assess 12-month weight loss and adherence from a cohort of patients from the Juniper UK DWLS. Specifically, we define two estimands: an efficacy estimand (representing efficacy under ideal adherence) and a treatment estimand (representing effectiveness in a real-world context, including non-adherence). The study will compare the outcomes of both the ideal adherence cohort and a modified full cohort, i.e., including those who deviated from the suggested clinical pathway or discontinued early. The study will also examine how weight loss is affected by demographic factors, program pauses, weight tracker engagement levels, and semaglutide brand (Wegovy and Ozempic).”

Comment 7: furthermore, the omission of patient-centered outcomes such as quality of life limits the clinical relevance of the findings.

Response 7: Thank you for noticing this key omission. We have now added this as a limitation to the final paragraph of the discussion section. The sentence reads as follows:

“Thirdly, quality of life data were not systematically collected by Juniper, which prevented investigators from extending upon the clinical relevance of the study’s findings.”

Minor Revisions:

Comment: The manuscript should report standardized effect sizes (e.g., Cohen’s d, partial η², odds ratios) alongside p-values to enhance clinical interpretability.

Response: Thank you for this insightful observation. We have now reported Cohen’s d and IQR in our PSM cohort at the bottom of subsection 3.2:

“In the PSM cohort, participants treated with Ozempic experienced a significantly lower median weight loss percentage (Median = 14.0%; IQR [9.6%; 19.0]) than those who received Wegovy (Median = 17.0; IQR [12.0%; 23.0%]). The adjusted median difference was 3.0 percentage points. The effect size was medium and favored Wegovy (Cohen's d} = 0.38; 95% CI [0.26, 0.51]).”

We have also reported IQR and epsilon squared values throughout our revised Table 5.

Comment: Figures, especially those depicting subgroup differences (e.g., Figure 2), should include confidence intervals or error bars to convey precision.

Response: Thank you for your valuable feedback regarding figure presentation and statistical validation. We agree that figures must convey precision. We have updated our box-and-whisker plot (figure) so that it no longer contains mean figures. Our updated plot displays the median, IQR (the box), and range, making it the statistically appropriate figure for our non-parametric analysis (consistent with table 5).

Comment: Missing data handling is not described and should be clarified, particularly given the retrospective nature of the study.

Response: Thank you for this important comment. We have now clarified how missing 12-month weight data were imputed in the treatment estimand by inserting the following sentence into the final paragraph of subsection 2.5 (endpoints):

Comment: Engagement, measured through medication orders and weight-tracker use, should be explored with mediation or moderation analyses to better understand the causal pathway linking engagement to outcomes.

Response: Thank you for this excellent suggestion. We agree that these approaches are valuable tools for establishing causal mechanisms. We believe that our current methodology sufficiently explores the relationships in the following ways:

Confounding: We used Propensity Score Matching (PSM) to robustly handle the primary concern of confounding between medication brand and adherence.
Multivariable Adjustment: Our Multiple Linear Regression model (Table 4) already explores the unique, independent contributions of engagement (Weight Tracker Use) and exposure (Medication Count) simultaneously on the outcome. The model demonstrates that Weight Tracker Use is a highly significant predictor, independent of the dose-based exposure metric.
Medication count-Response: The Kruskal-Wallis post-hoc analysis (Table 5) clearly maps the non-linear response relationship between increasing adherence (orders and tracking frequency) and clinical outcomes.

Comment: Finally, the introduction would benefit from integrating literature on digital engagement as a determinant of adherence and weight loss, as well as providing a more robust justification for the use of estimands in real-world observational research.

Response:

Thank you for this important suggestion. We have now added three paragraphs to the introduction. The first discusses the importance of engagement and adherence in digital health interventions, referencing 4 additional studies (lines 76-83):

Reviewer 3 Report

Comments and Suggestions for Authors

First of all, i would like to thank the authors for the opportunity to review this manuscript.

This manuscript addresses an important and timely question on how patient engagement patterns influence weight-loss outcomes in real-world, lifestyle-supported pharmacological programs. The topic is clinically meaningful, the dataset is rich.

That said, several important methodological and reporting aspects need clarification or refinement before the findings can be interpreted with confidence. The following points are offered in a constructive spirit to strengthen the rigor, transparency, and reproducibility of the work.

Comments for author File: Comments.pdf

Author Response

Reviewer Report The manuscript “Patient Outcomes Under Varying Engagement Patterns on Real-World Lifestyle-Supported Pharmacological Weight-Loss Therapy” addresses a timely and clinically relevant question: how engagement patterns influence weight-loss outcomes under realworld, lifestyle-supported pharmacological programs. The topic is important, the dataset is large, and the authors tackle a meaningful aspect of treatment implementation. However, several important analytical choices and definitions may limit the interpretability and external validity of the findings. The following comments are intended to be constructive and to help strengthen the methodological transparency and scientific robustness of the work. I recommend major revisions.

Comment 1: The efficacy estimand includes only patients with > 8 orders and available weight between 341–379 days. The treatment estimand further excludes those with > 90-day pauses or > 15 orders removing nearly 3,000 participants from the initial cohort. This is no longer a treatment estimand in an ITT/real-world sense but rather a selected subgroup, introducing potential attrition and selection bias that may inflate or distort the average outcomes. I think that a clearer and more transparent analytical strategy would include: -Presenting a true ITT analysis including all 7279 patients with an appropriate handling of missing data. -Separating per-protocol and as-treated analyses without excluding participants with > 90-day gaps from the main models. -Providing a strong a priori rationale for the > 90-day threshold (which currently appears post-hoc) and conducting sensitivity analyses that include these participants.

Response 1: Thank you for this insightful recommendation. We have now included all patients with extended pauses who were previously excluded. We have updated subsection 2.5 (endpoints) to reflect this, including an explanation as to why 90-day pause length was chosen as the threshold for one of the efficacy estimand criteria. The first two paragraphs of this subsection now read as follows:

“To generate meaningful and robust findings, patient outcomes were assessed based on two primary estimands, defined in accordance with the International Conference on Harmonisation E9(R1) framework:

Efficacy estimand: This estimand reflects the biological efficacy of the treatment under the hypothetical condition of full protocol adherence. It corresponded to patients who received between 8 and 15 medication orders, reported weight measurements within a 12-month post-initiation assessment window (341–379 days), and did not pause treatment for any longer than 90 days.
Treatment (Intention-to-Treat) estimand: This estimand reflects the treatment strategy's effectiveness in a real-world setting. It included all patients in the Efficacy Estimand plus those who demonstrated limited adherence (received less than 8 or above 15 orders, paused for longer than 90 days, or received between 8-15 orders but did not track weight within a 12-month post-initiation assessment window (341-379 days).

The estimand cohort was selected to reflect the optimal biological efficacy under conditions of sustained adherence, requiring exclusion criteria to mitigate the impact of major intercurrent events. Patients were required to have received between 8 and 15 medication orders over the 12-month period. The minimum of 8 orders ensured adequate treatment exposure, while the maximum of 15 orders accounted for logistical variations such as pre-ordering for travel, without including excessive ordering that would suggest protocol deviation. A single or cumulative pause in medication supply exceeding 90 days resulted in exclusion from this estimand. This 90-day cutoff serves as an established determinant of non-adherence over a 12-month assessment period in real-world studies [19], and was necessary to minimize the confounding risk associated with prolonged periods of non-exposure to the study medication.”

We have also updated the description of the groups at the bottom of table 2 of the results:

Efficacy Estimand: received between 8-15 medication orders, recorded their weight within 341-379 days post initiation, and did not pause treatment for any longer than 90 days.

Adherent patients: received at least 8 medication orders but did not record their weight within a 341-379 days post program initiation (includes those who paused treatment for longer than 90 days); or received between 8-15 medication orders, recorded their weight within 341-379 days post initiation, but paused treatment for longer than 90 days. .

Remaining patients: received less than 8 medication orders or more than 15 orders;

Treatment estimand: all three groups combined.

Given the size of the treatment estimand has now increased, we have updated all results concerning that cohort throughout the manuscript.

Comment 2: Differences observed in favor of Wegovy may be confounded by engagement-related variables such as pause frequency and pause duration, as well as by dose availability, cost, and product access. These factors are likely confounders and not accounted for by simple students t test. In my opinion a more robust approach would involve a model adjusting for confounders (DAG analysis recommended). Alternatively, applying propensity score techniques (matching or weighting) or IPTW for brand comparison. Reporting effect sizes and 95% confidence intervals alongside p.

Response 2: We strongly agree that the unadjusted comparison was insufficient given the potential for confounding, particularly concerning adherence behaviors (pause incidence). We have implemented the recommended Propensity Score Matching technique, addressing the need for a more robust causal inference approach. This is now explained in detail in the second paragraph of subsection 2.6 (lines 304-313):

“To robustly compare the effect of Wegovy versus Ozempic on 12-month weight loss while adjusting for potential confounding variables (e.g., adherence behaviours and patient selection), Propensity Score Matching (PSM) was employed. Propensity scores were estimated using logistic regression, modelling the probability of receiving Wegovy (the treatment) based on the following covariates: age, initial BMI, initial weight, ethnicity, and pause incidence (yes/no). Nearest neighbour matching (1:1) with a caliper of 0.2 of the logit standard deviation was used to create the final matched cohort. The primary comparison of weight loss percentage within the matched cohort was then conducted using the Mann–Whitney U test, and the magnitude of the adjusted difference was reported using PSMd and the 95% Confidence Interval of the mean difference from a linear model.”

The PSM analysis successfully created 536 matched pairs by adjusting for key confounders (age, BMI, initial weight, pause incidence, cost proxy, and initial titration). This process eliminated baseline imbalances, with the Standardized Mean Difference (SMD) for all covariates dropping below the critical 0.1 threshold. The results are reported in subsection 3.2, as follows:

“Among patients in the efficacy estimand, 536 (31.94%) received Wegovy and 1142 (68.05%) received Ozempic (Table 1). Initial unadjusted analysis revealed a significant baseline imbalance in patient characteristics, notably in pause incidence (Wegovy 13.11% vs. Ozempic 40.76%; Std. Mean Diff. = -0.82).To account for this confounding, Propensity Score Matching (PSM) was performed, creating 536 matched pairs that were highly comparable across all measured covariates (all Std. Mean Diff. < 0.1).

The subsequent non-parametric (Mann-Whitney U) test on the matched cohort revealed that the difference in 12-month weight loss percentage between participants using Ozempic and those using Wegovy remained highly statistically significant (W = 111385, p = 1.893 -10).

In the PSM-matched cohort, participants treated with Ozempic experienced a significantly lower median weight loss percentage (Median = 14.0%; IQR [9.6%; 19.0]) than those who received Wegovy (Median = 17.0; IQR [12.0%; 23.0%]). The adjusted median difference was 3.0 percentage points. The effect size was medium and favoured Wegovy (**Cohen's d} = 0.38; 95% CI [0.26, 0.51]).

Correspondingly, chi-square tests on the unadjusted efficacy cohort revealed that a statistically higher proportion of Wegovy users reached key weight loss thresholds (Lost > 5%, > 10%, >15%, all p < 0.001). No significant difference was observed between the two groups in the frequency of patients who experienced weight stability or increase (W = 2.05% vs. O = 2.71, p = 0.5).”

Comment 3: Although the Shapiro–Wilk test indicated non-normality, the manuscript continues to rely on t-tests for robusteness and reports means (and SD) while using Mann–Whitney tests for nonindependent groups. If non-parametric tests are used, medians and IQRs should be reported (means may be retained descriptively). Please also clarify whether multiple-comparison corrections (such as Holm for example) were applied.

Response 3: Thank you for your critical assessment of our statistical methodology. We agree that reliance on parametric tests (like t-tests and ANOVA) was inappropriate given the significant non-normality demonstrated by the Shapiro–Wilk tests, and we acknowledge the necessity of controlling the Family-wise Error Rate (FWER) for multiple post-hoc comparisons.

We have now addressed these issues by implementing the following changes throughout the manuscript, particularly in sub-section 2.6 and the Results section:

Non-Parametric Inference: All primary comparisons involving the continuous outcome of weight loss percentage now utilize robust non-parametric methods (Mann–Whitney U test and Kruskal–Wallis test).
Reporting: All continuous results derived from non-parametric tests now report Medians and Interquartile Ranges (e.g., in Table 5 and the PSM analysis in Section 3.2), with means retained only for descriptive context.
Multiplicity Correction: We have clarified in the Methods (Section 2.6) that all subsequent post-hoc comparisons (e.g., comparing age and order count groups) were subjected to the Holm-Bonferroni method to rigorously control the risk of Type I errors.

Comment 4: A transparent description of missingness patterns and the chosen strategy to address them is essential. The current inclusion requirement of “weight within 341–379 days” risks informative censoring, as adherent participants are precisely those more likely to report their weights. Please include sensitivity analyses with a large time window, multiple imputation or pattern approaches and conservative scenarios (worst-case imputatuons).

Response 4: Thank you for this important comment. We have now added five lines to the final paragraph of sub-section 2.5 (endpoints) to explain how we managed missing data in the updated treatment estimand. These lines (236-240) read as follows:

“Missing 12-month weight data in the treatment estimand were imputed using the Last Observation Carried Forward (LOCF) method. For patients who had no post-baseline weight submission, the initial (baseline) weight was carried forward, resulting in 0% weight loss for these individuals.”

Comment 5: The rationale behind using 8 orders as a threshold for reasonable adherence and 3 months as an exclusion cut-off is unclear for me. A pre-specified clinical or organizational justification or references are needed. Alternatively, consider modelling number of orders and pause length as continuous exposures.

Response 5: Thank you for requesting this crucial clarification regarding the exclusion criteria used to define our efficacy estimand. We agree that a formal justification is vital for methodological rigor.

We have addressed this in two ways:

We have provided the requested formal rationale by adding a new paragraph to subsection 2.5 of the Methods. This rationale grounds the 90-day pause limit in established adherence literature and provides the organizational/clinical justification for the order count parameters.

“The estimand cohort was selected to reflect the optimal biological efficacy under conditions of sustained adherence, requiring exclusion criteria to mitigate the impact of major intercurrent events. Patients were required to have received between 8 and 15 medication orders over the 12-month period. The minimum of 8 orders ensured adequate treatment exposure, while the maximum of 15 orders accounted for logistical variations such as pre-ordering for travel, without including excessive ordering that would suggest protocol deviation. A single or cumulative pause in medication supply exceeding 90 days resulted in exclusion from this estimand. This 90-day cutoff serves as an established determinant of non-adherence over a 12-month assessment period in real-world studies [19], and was necessary to minimize the confounding risk associated with prolonged periods of non-exposure to the study medication.”
We confirm that we already modelled both factors as continuous variables, satisfying the reviewer's suggested alternative analysis. In our Multiple Linear Regression model (Table 4) and supplementary Kruskal-Wallis analysis (Table 5), we explicitly assessed the impact of Medication Order Count and Longest Pause Length as continuous/categorical exposures on weight loss percentage across the efficacy estimand.

Comment 6: The observed association between higher side-effect incidence and greater weight loss likely reflects exposure intensity (I think that patients who escalate doses experience more aes and also lose more weight). This warrants a confounding analysis.

Response 6: Thanks for this important comment. We agree with the premise that the association between side-effect incidence and weight loss is often confounded by exposure intensity (cumulative dose/treatment duration). The potential confounding relationship between side effects and exposure intensity was addressed in the Multiple Linear Regression model (Table 4). The model explicitly included both Side Effect Incidence and Medication Count (our primary proxy for exposure intensity) as independent predictors.

The fact that Side Effect Incidence (p < 0.001) remains a significant predictor after controlling for the effect of Medication Count (p < 0.001) confirms that the association is not solely due to treatment duration or dose escalation. Instead, it suggests that patients who experience side effects may be a distinct subgroup (perhaps more metabolically responsive) or are more compliant and engaged, leading to greater weight loss irrespective of the number of orders they received.

Comment 7: For example, Table 5 reports a mean loss of 11.83% with 10 orders, while the Conclusions mention 10.83%. A systematic audit of tables and figures is needed to ensure numerical accuracy throughout.

Response 7: Thank you for identifying the critical numerical error. We have completed a systematic audit across all figures, tables, and the text and confirm that the correct value for the median (following the methodological change) weight loss for the 10 Orders group is 11.87%.

This correction has been applied universally across all relevant sections, including the Results (Section 3), Table 5, and the Conclusions, ensuring complete numerical consistency in the final manuscript.

Comment 8: The sample may limits generalizability. More importantly, the stated BMI threshold adjustment for non-Caucasians participants requires clarification and appropriate referencing in order to not over-adjusting.

Response 8: Thank you for raising this critical point regarding the potential for selection bias and the need for clarification on the BMI threshold.

We acknowledge that the sample's lack of diversity limits generalizability, which is noted in the limitations section. More importantly, we agree that the adjustment of the BMI threshold for non-Caucasian participants requires explicit justification, as it is a key methodological decision.

To address this concern, we have added the following 3 sentences to subsection 2.2 (with a citation):
“Inclusion criteria included a BMI of 30kg/m² for the general population. However, reflecting clinical guidelines that account for heterogeneous disease risk, the threshold was lowered to ≥27kg/m² for patients with at least one weight-related comorbidity (e.g., symptomatic cardiovascular disease, disruptive sleep apnoea) or patients of non-Caucasian ethnicity. The lower BMI threshold for non-Caucasian individuals is justified by evidence indicating that certain ethnic groups (e.g., South Asian populations) have an elevated risk for cardiometabolic complications at lower BMI values [24].”

Comment 9: Ethics approval postdates the recruitment period. Please clarify the timing and retrospective nature of data analysis and confirm that data use was covered by pre-existing ethical policies. Given that several authors are employees or consultants of the service provider, additional safeguards are advisable, if feasible please share R pipelines and analysis plans.

Response 9: Thank you for your detailed attention to our ethical and methodological framework. We agree that these points are critical for maintaining the integrity and interpretability of our findings in a real-world, service-provider setting.

Ethics Approval and Retrospective Design

You accurately note that the Ethics Approval date (11 August 2025) postdates the patient recruitment period (1 January 2023 – 1 May 2024). This is a function of the study's retrospective design.

Data Collection Policy: All patient data were collected for the purpose of clinical auditing, care provision, and continuous quality improvement under the service provider's pre-existing standard operating procedures and privacy policy. The service provider’s procedures relating to data protection and information security management have been accredited to the ISO27001 standard. Patient consent for the use of de-identified, aggregated data for service improvement and research was obtained at program initiation.
Retrospective Approval: The retrospective study protocol was submitted to and approved by the Just Reasonable Independent Research Ethics Committee (IREC015) to ensure the specific use of the de-identified and aggregated dataset for this academic publication met all necessary ethical standards. This ex-post facto review and approval is the required standard practice for retrospective studies utilizing existing clinical registry data.

Conflict of Interest (COI) and Safeguards

We confirm the affiliation of several authors with the service provider (Juniper Health) and fully acknowledge the potential for perceived conflict of interest. To mitigate any bias and ensure the scientific integrity of the analysis, the following safeguards were implemented:

IREC Oversight: The study design, objectives, and ethical use of the data were reviewed and approved by the Independent Research Ethics Committee.
Data Aggregation and Anonymization: The dataset used for analysis was fully de-identified and aggregated prior to being made available to the authors, ensuring patient confidentiality throughout the research process.
Objective Endpoints: The primary and secondary endpoints (e.g., mean percentage weight loss, weight loss milestones, orders received) are objective, quantitative metrics minimizing subjective interpretation.

R Pipelines: In the spirit of open science and to ensure full reproducibility, the R code and analysis pipelines used for the descriptive statistics, ANOVA, T-tests, and the Multiple Linear Regression model will be made available upon manuscript acceptance, via a publicly accessible data repository (e.g., GitHub or OSF).

Minor Revisions (editorial or methodological refinements)

Add histograms or density plots for %WL number of orders tracking frequency and pause duration to justify non-parametric methods.

Response: We appreciate the request for methodological transparency and agree that providing visual evidence is the best way to justify our switch to non-parametric methods. We have created a Supplementary Figure (Figure S1) includes density plots and histograms for the four requested variables: 12-month % weight loss, number of orders, tracking frequency, and pause duration.

These plots (especially the non-normal distributions of the primary outcome and the heavy skewness of the engagement metrics) serve as the formal justification for moving away from parametric tests (ANOVA, t-tests) and utilizing Kruskal-Wallis and Propensity Score Matching (PSM) in the final analysis.

Consider time-to-event analyses for reaching study objectives.

Response: Thanks for this comment. Time-to-event methods are indeed valuable tools for understanding the velocity of weight loss. However, we respectfully believe these analyses are not central to the primary objective of this retrospective study.

Define side-effect severity, reporting window, and denominator (e.g., per personmonth).

Response: Thank you for this excelent request. We have now added the following text to the end of subsection 2.2:
“Patients are instructed to report side effects and associated severity levels to their MDT whenever they arise. Side effect severity was determined by patients, who were given the following matrix to guide their assessment:

Mild: Side effect is tolerable and easy to manage
Moderate: Side effect is noticeable but ok to manage
Severe: Side effect is uncomfortable and hard to manage

And these 2 sentences to the end of subsection 2.5:

“Side effect incidence was assessed as a binary variable (yes/no), recording whether a patient experienced any side effects during the 12-month program. Investigators also analyzed the distribution of side effect severity based on each patient’s single highest severity rating reported during that time.”

If feasible please explore cost as a potential dropout determinant.

Response:

Thanks for highlighting the critical importance of cost as a determinant of dropout in unsubsidized DWLSs. We have revised our final sentence of the discussion section, which now reads as follows:

“investigators did not have access to patient income data and therefore could not determine whether socioeconomic status influenced program outcomes or adherence. Despite the absence of these data, it is likely that the high program cost led to underrepresentation among lower socioeconomic groups.”

Provide de-identified scripts and data dictionaries to enhance reproducibility; “available upon reasonable request” is insufficient given the conflicts of interest.

Response: Thank you for this critical recommendation. We hereby, confirm that we will not use the phrase "available upon request." Instead, upon acceptance of the manuscript, we commit to making the following resources publicly available in a reputable, open-access repository (e.g., OSF or GitHub) to ensure full reproducibility:

Statistical Analysis Plan (SAP): The final, pre-specified plan governing the analyses.
De-identified Code: The complete R pipeline used for data cleaning, Propensity Score Matching (PSM), non-parametric testing, and the Multiple Linear Regression model will be shared.
Data Dictionary: A detailed dictionary defining all variables used in the final analysis (e.g., max_order_count, ethnicity_binary, weight_loss_perc_at_12), along with their origin and transformation method.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for the revision. My recommendation remains Major Revision. Please address point-by-point in the attached file.

Comments for author File: Comments.pdf

Author Response

Comment 1: Your current text defines the treatment estimand as including patients with prolonged pauses and those missing 12-month weights, while the efficacy estimand excludes pauses > 90 days. In limitations section you state that patients with extended pauses were also excluded from the treatment estimand. These positions are

contradictory and create confusion throughout the Results and tables. Please provide

a single unambiguous definition for each estimand and ensure it is applied

consistently in eligibilit the flow diagram and all figures and tables;

Response 1: Thank you for identifying this confusion of the definition of our two estimands. To address this, we have now removed the two insignificant columns from table 2 (now table 4), which deconstructed patients into ‘adherent’ and ‘remaining’ (ultimately subgroups of the treatment estimand). We have updated the descriptors underneath the table to read as follows (for some reason Word wouldn’t allow us to track the change of removing the 2 columns):

Efficacy Estimand: received between 8-15 medication orders, recorded their weight within 341-379 days post initiation, and did not pause treatment for any longer than 90 days.

Treatment Estimand: any patient who commenced the Juniper DWLS within the study period, irrespective of medication order count, pause duration, and weight data submissions. Missing data were imputed via BOCF.

We believe these are now consistent with the definitions in subsection 2.5 of our methods:
2.5 Endpoints

To generate meaningful and robust findings, patient outcomes were assessed based on two primary estimands, defined in accordance with the International Conference on Harmonisation E9(R1) framework:

Efficacy estimand: This estimand reflects the effectiveness of the treatment under the hypothetical condition of full protocol adherence. It corresponded to patients who received between 8 and 15 medication orders, reported weight measurements within a 12-month post-initiation assessment window (341–379 days), and did not pause treatment for any longer than 90 days.
Treatment (Intention-to-Treat) estimand: This estimand reflects the treatment strategy's effectiveness in a real-world setting. It included all patients in the efficacy estimand plus those who demonstrated limited adherence (received less than 8 or above 15 orders , paused for longer than 90 days, or received between 8-15 orders but did not track weight within a 12-month post-initiation assessment window (341-379 days).

We confirm that our previously revised limitations section made no claim about the exclusion of patients with prolonged pauses from the treatment estimand. We believe the contradiction arose from the complex definitions originally stated in table 2 and the corresponding data on ‘adherent’ and ‘remaining’ patients (which has now been removed).

“The study contained several limitations. Firstly, the study’s sample was predominantly Caucasian (82.8%) and female (91.2%), which may limit the generalizability of findings across diverse populations. Secondly, all weight and side effect data were self-reported, and may have been affected by various biases. Thirdly, quality of life and body composition data were not systematically collected by Juniper, which prevented investigators from extending upon the clinical relevance of the study’s findings. Fourthly, discontinuation reason data was not available for this study and thus investigators were not able to enrich treatment estimand findings with discontinuation categories. Finally, investigators did not have access to patient income data and therefore could not determine whether socioeconomic status influenced program outcomes or adherence. Despite the absence of these data, it is likely that the high program cost led to underrepresentation among lower socioeconomic groups.”

Comment 2: Multiple places still contain impossible values or corrupted numbers (79,124 109.8795% excluded for missing weight data; garbled entries and extreme

percentages in Table 2 and related narrative). Please perform a line-by-line numerical

audit of the manuscript and supplements and submit a clean, corrected set of tables

with a reconciliation appendix;

Response 2: Thank you for detecting some inconsistencies in the numbers throughout our manuscript. We have now changed multiple figures in the section underneath table 2 (now table 4). They are in bold text below:

‘The average number of medication orders was notably higher in the efficacy group compared to the treatment group (E 12.98; T 6.97), a difference that was statistically significant (W = 5618985, p < 0.001). Treatment pause prevalence was significantly lower in the efficacy group (E 31.94%; T 57.02%) with a mean pause length of 14.17 days versus 106.57 days in the treatment group. Side effects were reported in 75.98% of patients in the efficacy group and 54.07% in the treatment group. However, no statistically significant difference was observed in the distribution of time to first side effect between groups with mean number of days in 43.87 days for the efficacy group and 36.18 days for the treatment group. ‘

Regarding your comment “…corrupted numbers (79,124 109.8795% excluded for missing weight data”, we are sorry but we could not find these numbers anywhere in our manuscript.

Comment 3: Your Methods state that normality tests were significant, subsequently

nonparametric methods were the primary approach. Yet Section 3.2 runs an

independent-samples t-test on the same contrast, arguing robustness with large n.

Later you also state the t-test is inappropriate when groups are not independent

(treatment vs efficacy). This is inconsistent and risks fishing effect for significance.

Please declare one primary test per contrast (parametric or non-parametric), justified

a priori, and treat any alternative as a sensitivity analysis.

Response 3:

Thank you for identifying this inconsistency. We agree that the manuscript lacked a clear, singular declaration of the primary statistical method and appreciate the opportunity to clarify our approach and eliminate any potential perception of "fishing" for significance. We have now clarified in section 2.6 that:

‘non-parametric methods were declared a priori as the primary approach for inference involving continuous outcomes’
We have also added the following sentence to the middle of the section to elucidate the statistic used for our Mann Whitney results:
‘The Mann-Whitney U test results are reported using the Wilcoxon Rank-Sum statistic (W), where W is the sum of the ranks for the smaller group.’

In section 3.3, we have removed the confusing sentence on independent t-tests (‘Although quantile-quantile plots suggested approximate normality in both efficacy and…’) and replaced the parenthesized mean and sd figures with median and IQR stats

‘A Mann-Whitney U test revealed that mean 12-month weight loss percentage was significantly higher in the efficacy group (Median = 15.0%; IQR [10.0%; 21.0%]) compared to the treatment group (Median = 5.9%; IQR [0.6%;13.0%]) (W = 4902466, p < 0.001).’

Comment 4: You note that all covariates were balanced and formed 1:1 matched pairs (n=536) with significant differences favoring Wegovy. To assess the credibility of this causal contrast, please provide a love plot and a table of SMDs pre- and post-match for all covariates; distribution/overlap plots of the propensity scores; matching caliper and

any discarded units (% and reasons). Also a clear statement of missing-data handling

for covariates used in the PS model and clarify whether your primary effect is on the

median (Mann–Whitney / Hodges–Lehmann) or mean (Cohen’s d). Reporting both

without prespecification is confusing.

Response 4: Thank you for this important observation. We concur that full diagnostics are essential for validating our Propensity Score Matching (PSM) analysis and the causal inference drawn from the subgroup comparison. We have now added a new dedicated section, 3.2.1. Propensity Score Matching Diagnostics, which reads as follows:

3.2.1 Propensity Score Matching Diagnostics

The PSM successfully achieved covariate balance. Table 2 presents the Standardized Mean Differences (SMD) for all covariates before and after matching; all post-match SMDs were below 0.1. The visual improvement in balance is demonstrated by the Love Plot (Figure 1), where covariate balance is achieved across all measured variables. Furthermore, the overlap of propensity score distributions, shown in the PS Distribution Plot (Figure 2), indicates satisfactory common support across the treatment and control groups post-matching.

We have also now clarified in Section 2.6 that all covariates used in the PSM model had complete data, removing the need for imputation prior to PS estimation (line 258: ‘All covariates used in the PSM model had complete data.’) And we have added that:

“The d statistic is presented to provide a standardized effect size for the difference in means, complementing the primary inference drawn from the non-parametric test (Mann-Whitney U) on medians.” (lines 263-265)

Furthermore, we have Added reference to a new Table 2 (SMDs pre/post-match) and a new Love Plot (Figure 2), confirming post-match SMD < 0.1.

At the bottom of section 3.2, we clarified the use of the 0.2 logit caliper and reported the exclusion of patients:

“The matching process resulted in the exclusion of 606 patients (36.1%) who were outside the common support region or could not be matched within the 0.2 caliper.”

Comment 5: The treatment estimand imputes missing 12 month weights via LOCF, carrying baseline forward (at zerp percent loss) for participants without any post-baseline weight. This approach can bias treatment effects and inflate variance. Please add prespecified sensitivity analyses for exaple multiple imputation under plausible

mechanisms and/or mixed-effects models. Also quantify how conclusions change

across these scenarios.

Response 5: Thank you for your critical assessment of our missing data handling. We fully agree that the primary analysis using only LOCF for the treatment estimand was overly conservative and potentially biased. We have now changed LOCF to baseline observation carried forward (BOCF – which is what we had actually used to calculate the mean weight loss figure for the treatment estimand – 7.88%) and removed the now redundant sentence:
‘For patients who had no post baseline weight submission, the intital (baseline weight was carried forward, resulting in 0% weight loss for these individuals.

Moreover, we have now added a sensitivity analysis to the manuscript, including the following methods section 2.7:

2.7: Sensitivity Analysis for Missing Outcome Data

To assess the robustness of the primary treatment estimand findings, which employed the conservative BOCF method, a prespecified sensitivity analysis was performed comparing the BOCF results to two alternative scenarios: Complete Case Analysis (CCA) and Multiple Imputation (MI). CCA included only those patients with observed (non-missing) 12-month weight loss data. MI was conducted under the assumption of Missing at Random (MAR) using the Multiple Imputation by Chained Equations (MICE) framework [26]. Twenty imputed datasets were generated, and pooled results were calculated using Rubin's rules [27]. The imputation model included the primary outcome variable (12-month weight loss percentage) and the following prognostic factors and auxiliary variables: age, initial BMI, initial weight, ethnicity (binary), pause incidence (binary), product type, and total medication orders. The analysis compared the median weight loss percentage and the proportion achieving clinical milestones (≥5%) across the three scenarios (BOCF, CCA and MI) to quantify how conclusions change under different missing data assumptions.

And an additional subsection (3.5) and table (4) in the results section:

3.5 Sensitivity Analysis of the Treatment Estimand

The sensitivity analysis was conducted to assess the robustness of the primary treatment estimand outcomes, which employed the conservative BOCF method. Results were compared against CCA and MI under the MAR assumption (Table 4). The analysis confirms that the primary BOCF method was highly conservative. The median weight loss percentage increased substantially from the BOCF estimate of 5.90% to 7.40% (CCA) and 8.25%(MI). Similarly, the proportion of patients achieving the 5% milestone increased from 54.21% (BOCF) to 62.40% (CCA) and 57.59% (MI). The increase in both median weight loss and milestone achievement observed under the MI scenario suggests that the primary (BOCF) analysis substantially underestimated the effectiveness of the treatment strategy. Nevertheless, the overall finding that outcomes in the treatment estimand are significantly inferior to the efficacy estimand remains consistent across all scenarios

Table 4: Sensitivity Analysis of the treatment estimand outcome

Missing Data Handling Method	Median Weight Loss %	Achieving ≥5 % Milestone (%)
BOCF	5.90	54.21
CCA	7.40	62.40
MI (MAR Assumption)	8.25	57.59

Comment 6: Because dose ceilings differ between products in routine care, the brand contrast is susceptible to confounding by indication/adherence. Please: a) Explicitly describe how brand was assigned in practice (clinical criteria, availability, patient preference, cost).; b) State whether Ozempic use in non-diabetic patients in your se?ing

constituted off-label prescribing and, if so, how that influenced eligibility, dose limits,

and monitoring (context only; no external justification needed here) and c) Provide a

sensitivity analysis restricted to comparable dose ranges using dose as a covariate/stratification variable, in addition to your PSM.

Response 6: Thank you for raising these crucial points regarding confounding by indication and dose in our medication brand contrast. We agree that rigorous disclosure of indication/dosing and a dose-restricted sensitivity analysis are essential.

We have addressed your requests through explicit disclosures in the Methods and the addition of a new sensitivity analysis.

We have now added the following three lines paragraph to Section 2.3 clarifying that brand assignment was determined by product availability and patient preference, not clinical indication:

“The choice between Wegovy and Ozempic was primarily determined by product availability at the time of prescribing. When both products were available, patient preference served as the assignment criterion. Neither medication was prescribed off label.”

In section 2.2 (line 147) we state that patients with “type 1 or type 2 diabetes,” were excluded from the program.

Regarding your request for adose-restricted sensitivity analysis, we have now added this subsection to the methods:

2.8 Dose-Restricted Sensitivity Analysis

To assess the robustness of the medication comparison against potential confounding by dose, a prespecified sensitivity analysis was conducted. This analysis was restricted to the largest comparable dose range used by both products. The cohort was filtered to include only those patients in the Efficacy Estimand who never received a weekly dose exceeding 1.0mg (i.e., they remained on or below the maximum shared dose between Wegovy and Ozempic). The primary outcome (12-month weight loss percentage) in this dose-restricted cohort was compared using the Mann-Whitney U test.

And this section and table to the results:
3.2.2 Dose-Restricted Sensitivity Analysis

The sensitivity analysis restricted the cohort to 567 patients who met all efficacy estimand criteria but never received a weekly dose exceeding 1.0mg (Table 3). The analysis confirmed that the weight loss difference between the brands remained highly statistically significant (W = 20954, p = <0.001) even when the confounding effect of high doses was removed. In this restricted cohort, patients treated with Wegovy achieved a median weight loss of 17.0% (IQR [13.00%; 23.00%]), and those treated with Ozempic achieved a median weight loss of 14.00% (IQR [8.52%; 21.0%]). The adjusted mean difference favouring Wegovy was 3.55 percentage points (95% CI: 1.86 to 5.24), representing a medium effect size (Cohen's d = 0.42).

Table 3: Dose-Restricted Sensitivity Analysis and PSM Medication Type Comparison

*Outcome metric*	*Full PSM cohort*	*Restricted Dose Cohort*
Cohort Size	1,072 matched pairs	567 patients
Wegovy Median WL % (IQR)	17.0% (12.0% - 23.0%)	17.0% (13%-23.0%)
Ozempic Median WL % (IQR)	14.0% (9.6%-19.0%)	14.0% (8.52%-21.0%)
Adjusted Mean Difference (Wegovy – Ozempic)	3.0 percentage points	3.55 percentage points
Cohen’s d (Magnitude)	0.38	0.42
Mann Whitney U Test p-value	<0.001	<0.001