Review Reports - Natural Language Processing-Driven Insights from Social Media: Topic Modeling and Sentiment Analysis of Healthcare Sustainability Discourse

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

Congratulations for a well structured manuscript that presents mixed-methods analysis of social media discourse on healthcare sustainability, combining topic modeling, sentiment analysis, and qualitative thematic analysis. Your methodology is is adequate, clearly described, and generally reproducible.

The results are coherent and provide a useful overview of dominant themes and sentiment patterns in sustainability-related healthcare discussions on Twitter. The integration of computational and qualitative methods strengthens the interpretive depth of the findings.

However, the manuscript could be improved by more explicitly articulating its novelty relative to prior NLP-based analyses of health-related social media discourse.

I suggest the following:

Add 1–2 sentences at the end of the Introduction explicitly stating the study’s unique contribution (e.g., integration of sustainability framing with sentiment stratification, or implications for policy communication).
Briefly justify the keyword strategy and discuss potential bias introduced by keyword-based sampling.
Add a short explanation (e.g., coherence scores, interpretability, trial iterations) to support the selected model configuration.
Increase figure resolution where possible, simplify legends or move explanatory text to captions, consider condensing Table 1 by merging closely related categories or reducing descriptive redundancy.
Keep the Results section strictly descriptive and move interpretive or policy-oriented statements fully into the Discussion.
Add a short paragraph explicitly addressing practical implications for healthcare communication or sustainability strategy.
Add more reference could be added for the work as it would increase the soundness

Comments on the Quality of English Language

You could improve English by removing: occasional verbosity, repetitive phrasing in Results and Discussion, some long sentences that could be tightened for clarity.

Author Response

We thank Reviewer 1 for the positive assessment and constructive suggestions. Each comment has been addressed as follows.

Comment 1: Add 1-2 sentences at the end of the Introduction explicitly stating the study's unique contribution (e.g., integration of sustainability framing with sentiment stratification, or implications for policy communication).

Response: Thank you for this suggestion. We have added a novelty statement at the end of the Introduction (after line 67 of the original manuscript) that reads: "Unlike prior NLP studies of health-related social media that focus on disease surveillance or public health crises, this study uniquely integrates sustainability framing with sentiment stratification across computationally derived topics, offering a mixed-method lens that bridges environmental discourse analysis with healthcare communication strategy. By combining LDA topic modeling, VADER sentiment analysis, and qualitative thematic analysis within a single framework, this study provides a more comprehensive characterization of public perceptions than purely computational or qualitative approaches alone, with direct implications for policymaker messaging and stakeholder engagement."

Comment 2: Briefly justify the keyword strategy and discuss potential bias introduced by keyword-based sampling.

Response: We agree that this was insufficiently described. We have substantially expanded Section 2.1 to include the full Boolean search string used, the rationale for keyword selection (derived iteratively from preliminary literature review and pilot searches), and an explicit acknowledgement that keyword-based sampling may introduce selection bias by excluding tweets that discuss sustainable healthcare concepts without using the specific search terms. The revised text also clarifies that the Twitter Academic Research API (full-archive search endpoint) was used, that retweets were excluded while quote tweets and replies were included, and that bot filtering relied on Twitter's default quality filters. Please see the revised Section 2.1, paragraphs 1 and 2.

Comment 3: Add a short explanation (e.g., coherence scores, interpretability, trial iterations) to support the selected model configuration.

Response: Thank you for this important point. We have expanded the LDA model selection description in Section 2.2 to report that models were trained across a range of k = 2 to k = 20, that the C_v coherence score (Roder et al., 2015) was computed for each model, and that k = 8 was selected based on peak coherence (C_v = 0.52) and qualitative assessment of topic distinctiveness at k = 6, 8, and 10. We have also added justification for the hyperparameters alpha = 0.1 and beta = 0.1 as symmetric priors appropriate for short-text corpora. A new paragraph at the end of Section 2.2 discusses LDA's known limitations with short-text data, acknowledges alternative approaches such as the Biterm Topic Model and BERTopic, and explains why LDA was retained for interpretability and comparability with prior studies.

Comment 4: Increase figure resolution where possible, simplify legends or move explanatory text to captions, consider condensing Table 1 by merging closely related categories or reducing descriptive redundancy.

Response: We have made the following changes:

(a) Figure 2 has been redesigned as eight separate faceted panels (one per topic), replacing the original overlapping line chart that was difficult to read. Resolution has been increased to 300 DPI.

(b) Figure 3 has been revised with increased resolution (300 DPI), the sentiment categories reordered to Positive (bottom), Neutral (middle), and Negative (top) for more intuitive reading, and the legend placed at the bottom of the figure. Percentage labels are displayed inside each bar segment.

(c) Table 1 has been condensed by reducing keywords from 20 to 10 per topic and excluding the term "healthcare," which appeared as the top keyword in all topics due to the sampling strategy and does not differentiate between topics. A footnote explains this exclusion.

Comment 5: Keep the Results section strictly descriptive and move interpretive or policy-oriented statements fully into the Discussion.

Response: We agree with this observation. We have relocated the following interpretive passages from the Results to the Discussion:

(a) Section 3.1, lines 180-198 (from "This topical focus aligns with..." through "...multi-sectoral action.") have been moved to the Discussion and replaced with a concise descriptive summary of topic prevalence.

(b) Section 3.2, lines 218-222 (from "This overarching positivity suggests..." through "...without overt emotive language.") have been moved to the Discussion.

(c) Section 3.2, lines 229-237 (from "This likely reflects the alarm..." through "...advancing healthcare sustainability.") have been moved to the Discussion.

The Results section now contains strictly descriptive reporting of findings.

Comment 6: Add a short paragraph explicitly addressing practical implications for healthcare communication or sustainability strategy.

Response: We have added a new subsection (4.1 Practical Implications) in the Discussion that addresses actionable implications across four areas: (i) leveraging techno-optimism in AI and technology framing to enhance public engagement, (ii) pairing crisis messaging with solutions-oriented content around cost savings and efficiency, (iii) using professional conferences and awards as vehicles for normalizing sustainability practices, and (iv) integrating sustainability literacy into healthcare professional education curricula. Please see the revised Discussion section.

Comment 7: Add more reference could be added for the work as it would increase the soundness.

Response: We have added several new references throughout the manuscript to strengthen the evidence base:

Zhao et al. (2011), "Comparing Twitter and Traditional Media Using Topic Models" (for LDA on Twitter data)
Ribeiro et al. (2016), "SentiBench: A Benchmark Comparison of State-of-the-Practice Sentiment Analysis Methods" (for VADER validation)
Yan et al. (2013), "A Biterm Topic Model for Short Texts" (for short-text topic modeling limitations)
Pichler et al. (2019), "International Comparison of Health Care Carbon Footprints" (for healthcare environmental footprint)
Roder et al. (2015), "Exploring the Space of Topic Coherence Measures" (for coherence metrics)
Tang et al. (2014), "Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis" (for token threshold justification)

Reviewer 2 Report

Comments and Suggestions for Authors

This study looks at how healthcare sustainability is discussed in public conversations on social media and how NLP methods can help uncover key themes and sentiments over time. I find the topic both timely and highly relevant. The paper clearly addresses an important gap by systematically analysing public perceptions of healthcare sustainability, which are still rarely examined at scale. In my view, the main contribution lies in the mixed-method NLP approach. Bringing together topic modeling, sentiment analysis, and thematic analysis provides a richer and more nuanced understanding than most existing studies in this area. Overall, the methodology is solid and well implemented. The paper could be further strengthened by briefly explaining key LDA parameter choices and by adding a short reflection on potential limitations of Twitter data. I find the conclusions clear, convincing, and well aligned with the results. They directly respond to the research question and offer useful insights for healthcare sustainability research and practice. The reference list appears appropriate and well balanced. The tables and figures are clear and easy to follow, and they effectively support the main findings.

Author Response

We thank Reviewer 2 for the positive evaluation and supportive comments. We address the two suggested improvements below.

Comment 1: The paper could be further strengthened by briefly explaining key LDA parameter choices.

Response: Thank you for this suggestion. As described in our response to Reviewer 1 (Comment 3), we have expanded Section 2.2 to include detailed reporting of the coherence curve (k = 2 to 20, C_v metric), the rationale for selecting k = 8, justification for the alpha and beta hyperparameters, and a discussion of LDA's limitations for short-text data alongside alternative approaches. Please see the revised Section 2.2.

Comment 2: Adding a short reflection on potential limitations of Twitter data.

Response: We have added several paragraphs addressing Twitter-specific limitations in the revised Discussion (Section 4.3 Limitations). These include: the uneven temporal distribution of tweets across the 2006-to-2024 period, the platform-specific nature of Twitter's user demographics and character constraints, and a discussion of how these findings may or may not generalize to other digital platforms such as Facebook, LinkedIn, and Reddit. We have also added a paragraph in Section 2.1 noting the substantially lower tweet volume in early years (2006 to 2012) and cautioning that early-year trends may reflect small-sample variability. Please see the revised Sections 2.1 and 4.3.

Reviewer 3 Report

Comments and Suggestions for Authors

A brief summary

This article presents an analysis of social media texts related to sustainable healthcare. LDA topic modeling and sentiment analysis were used. A dataset of 15976 tweets was analyzed.

The aim of this paper is to analyze tweets related to sustainable healthcare using topic modeling and sentiment analysis. In particular, the authors wanted to find answers to the questions: What are the dominant topics in social media discourse on sustainable healthcare? How is sentiment distributed in identified topics, and finally, what practical lessons can NLP analysis provide to improve sustainability in healthcare?

The main contribution is the identification of eight themes: [(i) Eco-Friendly Healthcare Access and Product Innovation, (ii) Net-Zero Implementation and Sustainable Care, (iii) Climate Change and Environmental Impact, (iv) Global Emissions and Waste Management, (v) Critical Challenges and Solutions, (vi) Education and Community Development, (vii) AI and Technology Innovation, (viii) Events and Research Collaboration]. While economic and environmental sustainability predominate, these themes also cover workforce development, infrastructure, technology, and sustainable access to healthcare. Sentiment is generally positive or neutral, with some negative points regarding financial challenges and polarization on climate issues.

Broad comments

1) The method used to obtain the analyzed tweets was not explained. This has significant implications for their content. It is necessary to clarify the selection process for the analyzed dataset.

2) The most significant keyword in each topic is always "healthcare." This is obvious, and likely stems from the way tweets were selected for analysis. Since this word doesn't differentiate topics, it should probably be removed from the analysis and the remaining 10 keywords should be considered.

3) Another issue is the use of keywords appearing in the topic title. They often also appear as keywords. This fact requires additional discussion on whether they should be included as keywords, as it seems obvious that they also appear as keywords.

Specific comments:

l. 211 - Figure 2 is completely unreadable. The topics should be divided into eight facets.

l. 227 - In my opinion, it would be more appropriate to display the proportions of sentiment categories in the sequence of positive, neutral, and negative in Figure 3.

Author Response

We thank Reviewer 3 for the careful reading and specific suggestions. Each comment is addressed below.

Broad Comment 1: The method used to obtain the analyzed tweets was not explained. This has significant implications for their content. It is necessary to clarify the selection process for the analyzed dataset.

Response: We agree that this was insufficiently described. We have substantially expanded Section 2.1 to include the specific API endpoint used (Twitter Academic Research API, full-archive search), the complete Boolean search string, the rationale for keyword selection, the handling of retweets (excluded), quote tweets and replies (included), bot filtering (Twitter's default quality filters), and deduplication (based on tweet ID). We have also added discussion of temporal coverage and the uneven distribution of tweets across the 2006-to-2024 timeframe. Please see the revised Section 2.1.

Broad Comment 2: The most significant keyword in each topic is always "healthcare." This is obvious, and likely stems from the way tweets were selected for analysis. Since this word doesn't differentiate topics, it should probably be removed from the analysis and the remaining 10 keywords should be considered.

Response: This is an excellent observation. We have added a paragraph in Section 3.1 explicitly acknowledging that "healthcare" appears as the highest-weighted keyword across all eight topics as an expected artifact of the keyword-based data collection strategy, and that it does not meaningfully differentiate between topics. We have revised Table 1 to exclude "healthcare" and display only the top 10 remaining discriminative keywords per topic. A supplementary analysis excluding "healthcare" confirmed that the remaining terms adequately distinguish topics from one another. A footnote beneath the table explains this exclusion.

Broad Comment 3: Another issue is the use of keywords appearing in the topic title. They often also appear as keywords. This fact requires additional discussion on whether they should be included as keywords, as it seems obvious that they also appear as keywords.

Response: We have addressed this in the new paragraph added to Section 3.1, which explains that certain keywords appearing in topic labels (e.g., "climate" in the Climate Change topic) naturally recur as high-weight terms, confirming the labeling validity rather than representing circular reasoning. The topic labels were assigned post hoc based on the keyword distributions, so the overlap between label terms and keywords is expected and reflects the coherence of the topic assignments.

Specific Comment 1: l. 211 - Figure 2 is completely unreadable. The topics should be divided into eight facets.

Response: We agree. Figure 2 has been completely redesigned as eight separate faceted panels (one per topic), each showing temporal prominence on its own y-axis. This makes individual topic trends clearly readable. The figure caption has also been updated, and the x-axis has been corrected to end at 2024 (not 2025). A footnote warns about small-sample variability in early years.

Specific Comment 2: l. 227 - In my opinion, it would be more appropriate to display the proportions of sentiment categories in the sequence of positive, neutral, and negative in Figure 3.

Response: We agree with this suggestion. Figure 3 has been revised so that the stacked bars display sentiment categories in the order: Positive (bottom), Neutral (middle), and Negative (top). The legend has been updated accordingly and placed at the bottom of the figure, consistent with the original layout. Resolution has been increased to 300 DPI, and percentage labels are now displayed inside each bar segment.

Reviewer 4 Report

Comments and Suggestions for Authors

This article addresses a contemporary and socially significant issue, the relevance of which is increasing in the context of global climate change.

In this study, 15,976 English-language tweets (2006–2024) related to sustainable healthcare were analyzed using natural language processing (NLP). Using latent Dirichlet allocation (LDA), eight key themes were identified, including sustainable access, zero-emissions implementation, climate impact, emissions, costs and waste, education, infrastructure, and green technologies.

A thematic analysis of 800 tweets allowed the authors to identify six cross-cutting themes, including environmental responsibility in healthcare, health benefits, the need for climate action, and optimism regarding technological solutions.

The authors clearly described the methodology, created clear visualizations of the results, and presented a thorough discussion of the limitations.

It is also worth noting that the work has high practical significance. The authors demonstrate how the results can be used by policymakers and healthcare system managers.

Overall, the article addresses a highly important topic and demonstrates potential for publication.

However, to enhance the clarity and comprehensiveness of the study's findings, it is recommended that the authors consider the following revisions.

Comments and Suggestions for Authors:

Please deepen the analysis of causal relationships and explain why the discourse is shaped this way.
Please explain why the study was limited to English-language tweets and how to overcome the limitations of the data source.
Please explain the keyword selection.
Please explain why no distinction was made between users' personal opinions and messages from official organizations, artificial entities, etc.
Please explain how the results and research can be applied to other digital platforms (Facebook, LinkedIn, etc.).

Author Response

We thank Reviewer 4 for the positive assessment and thoughtful suggestions. Each comment is addressed below.

Comment 1: Please deepen the analysis of causal relationships and explain why the discourse is shaped this way.

Response: We have added a new subsection (4.2 Drivers of Discourse Patterns) in the Discussion that examines causal factors shaping the observed discourse patterns. This includes: (i) the temporal correspondence between climate and emissions theme prominence and landmark policy developments such as the Paris Agreement (2015) and the NHS England Net Zero commitment (2020), (ii) the potential influence of technological solutionism narratives on the high positivity around AI and technology topics, (iii) the role of crisis communication framing in driving negative sentiment around climate topics, and (iv) the institutional and professional composition of infrastructure and policy discussion participants explaining the neutral tone. The section concludes by noting the importance of understanding these causal dynamics for policy and resource allocation decisions. Please see the revised Discussion, Section 4.2.

Comment 2: Please explain why the study was limited to English-language tweets and how to overcome the limitations of the data source.

Response: We have expanded the limitations discussion in Section 4.3 to explain that the English-language restriction was necessary to ensure consistency in NLP preprocessing, sentiment lexicon applicability (VADER is calibrated for English), and thematic coding. The revised text suggests that future studies could address this limitation by employing multilingual NLP pipelines, cross-lingual topic models, or parallel analyses of tweets in other high-volume languages (e.g., Spanish, Mandarin, French) to assess whether the thematic and sentiment patterns identified here hold across linguistic and cultural contexts. Please see the revised Section 4.3.

Comment 3: Please explain the keyword selection.

Response: As described in our response to Reviewer 3 (Broad Comment 1), we have expanded Section 2.1 to include the complete Boolean search string and the rationale for keyword selection, explaining that keywords were derived iteratively from preliminary literature review and pilot searches, balancing specificity to the healthcare sustainability domain against recall of relevant discourse. Please see the revised Section 2.1.

Comment 4: Please explain why no distinction was made between users' personal opinions and messages from official organizations, artificial entities, etc.

Response: We have added a new paragraph in Section 2.1 and a corresponding limitations paragraph in Section 4.3 acknowledging that this study did not distinguish between individual users (e.g., patients, clinicians) and organizational accounts (e.g., hospital systems, NGOs, advocacy groups). We explain that while differentiating user types could provide additional analytical depth, for example revealing whether institutional messaging differs in sentiment or framing from personal opinion, this classification was beyond the scope of the current study and would require robust account-type classification methods. We note this as a limitation and suggest it as an avenue for future work.

Comment 5: Please explain how the results and research can be applied to other digital platforms (Facebook, LinkedIn, etc.).

Response: We have added a new paragraph in the revised Discussion (Section 4.3) addressing cross-platform generalizability. The text notes that our findings are platform-specific to Twitter (now X), which has distinct user demographics, character constraints, and discourse norms. It discusses how LinkedIn may host more institutional and industry-oriented sustainability discourse, while Facebook community groups may capture grassroots patient perspectives. We recommend that future research replicate this analytical framework across platforms to assess convergence or divergence in themes and sentiment, and to build a more comprehensive map of the digital sustainability discourse landscape. Please see the revised Section 4.3.

Reviewer 5 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review this manuscript. The paper addresses an important and timely question, namely how sustainable healthcare is discussed on Twitter and how thematic patterns relate to overall sentiment. The mixed-method approach is a promising choice for capturing both broad trends and more nuanced interpretations. At the same time, several aspects of the data collection, preprocessing, and analytic reporting need to be clarified and strengthened to improve reproducibility and to ensure that the main conclusions are well supported. My comments below focus on the changes that would most improve methodological transparency, robustness, and interpretability.

Main strengths

The research questions are clear and relevant to the intersection of sustainable healthcare and public communication.

Combining quantitative methods (topic modeling and sentiment analysis) with qualitative thematic analysis can be an appropriate framework to capture multiple layers of the discourse.

Major comments

The data collection and sampling description is (currently) not detailed enough to make the study completely reproducible. It is unclear which Twitter API endpoint and access level were used, what the exact query strategy was, how retweets, duplicates, quote tweets, and replies were handled, and whether any bot or spam filtering was applied. Because the time window spans 2006 to 2024, it is especially important to clarify how coverage was ensured for early years and what biases might result from uneven availability over time.

The length-based filtering decision requires stronger justification. The topic modeling uses only tweets with at least 30 tokens (5,681 tweets). This threshold is high relative to typical tweet length and likely biases the sample toward longer, stylistically different posts. Please justify this choice.

The rationale for the LDA model selection and parameterization currently feels incomplete. The manuscript states that topic number selection was based on coherence and that k = 8 was chosen, but it does not report the tested range of k values, the coherence measure used, or the shape of the coherence curve. There is also no justification for priors such as alpha and beta. Given that Twitter data is short-text and sparse, the authors should either compare LDA to short-text oriented alternatives (for example, biterm-based approaches or embedding-based topic models) or provide a stronger discussion of LDA’s limitations in this context and why it remains appropriate here.

The sentiment analysis interpretation seems stronger than what VADER can safely support in this domain. VADER is useful for fast lexicon-based estimates, but sustainability discourse often includes irony, technical phrasing, and news-sharing, all of which can reduce accuracy. The manuscript would benefit from a small manual validation sample or another robustness check, and the narrative should be more cautious about normative interpretations of “positivity” and “negativity.”

The stated time window is 2006 to 2024, yet 2007 to 2025 appears elsewhere in the text and figure captions (Figure 2.)

Conclusion

In its current form, I recommend major revision. I believe the study has clear potential, because the topic is relevant and the general design could yield publishable insights once the methodological foundations are made more transparent and the results are validated more carefully. In particular, clearer documentation of data acquisition and filtering, improved preprocessing, stronger justification and reporting of the topic-model selection, and a more explicit qualitative coding procedure would substantially increase confidence in the findings. If the authors address these points and correct the internal inconsistencies in the manuscript, the paper would be much stronger and more credible.

Author Response

Comment 1: The data collection and sampling description is (currently) not detailed enough to make the study completely reproducible. It is unclear which Twitter API endpoint and access level were used, what the exact query strategy was, how retweets, duplicates, quote tweets, and replies were handled, and whether any bot or spam filtering was applied. Because the time window spans 2006 to 2024, it is especially important to clarify how coverage was ensured for early years and what biases might result from uneven availability over time.

Response: We agree that this was a significant gap. Section 2.1 has been substantially expanded to address all of these points:

(a) API endpoint: The Twitter Academic Research API (full-archive search endpoint) was used, providing access to the complete history of public tweets.

(b) Query strategy: The complete Boolean search string is now reported, along with the rationale for keyword selection.

(c) Retweets, duplicates, quotes, replies: The query was restricted to original tweets (excluding retweets). Quote tweets and replies were included. Deduplication was performed based on tweet ID.

(d) Bot/spam filtering: No automated bot or spam filtering was applied beyond Twitter's default quality filters; this is acknowledged as a limitation.

(e) Temporal coverage: A new paragraph discusses the substantially lower tweet volume in early years (2006 to 2012) and cautions that early-year data points should be interpreted with caution as they may reflect idiosyncratic posts rather than stable trends.

Please see the revised Section 2.1.

Comment 2: The length-based filtering decision requires stronger justification. The topic modeling uses only tweets with at least 30 tokens (5,681 tweets). This threshold is high relative to typical tweet length and likely biases the sample toward longer, stylistically different posts. Please justify this choice.

Response: We have revised the relevant paragraph in Section 2.1 (originally lines 85-90) to provide a detailed justification. The 30-token threshold was chosen because LDA relies on word co-occurrence patterns within documents, and very short texts provide insufficient co-occurrence signal, often leading to poor topic assignments and unstable models (Tang et al., 2014). We acknowledge that this threshold biases the topic modeling subset toward longer, potentially more substantive posts, and note that shorter conversational tweets may reflect different thematic patterns. Sensitivity analyses using thresholds of 20 and 25 tokens yielded similar topic structures, supporting the robustness of our chosen cutoff. Please see the revised Section 2.1.

Comment 3: The rationale for the LDA model selection and parameterization currently feels incomplete. The manuscript states that topic number selection was based on coherence and that k = 8 was chosen, but it does not report the tested range of k values, the coherence measure used, or the shape of the coherence curve. There is also no justification for priors such as alpha and beta. Given that Twitter data is short-text and sparse, the authors should either compare LDA to short-text oriented alternatives (for example, biterm-based approaches or embedding-based topic models) or provide a stronger discussion of LDA's limitations in this context and why it remains appropriate here.

Response: We have comprehensively addressed this concern. The revised Section 2.2 now reports:

(a) The tested range of k = 2 to k = 20.

(b) The specific coherence measure used (C_v, per Roder et al., 2015).

(d) Qualitative assessment of topic distinctiveness at k = 6, 8, and 10.

(e) Justification for alpha = 0.1 and beta = 0.1 as symmetric priors appropriate for short-text corpora.

Additionally, a new paragraph at the end of Section 2.2 explicitly discusses LDA's limitations with short-text data, acknowledges alternative approaches (Biterm Topic Model by Yan et al., 2013; BERTopic), explains why LDA was retained (interpretability, probabilistic framework, comparability with prior studies), and notes that the 30-token threshold partially mitigates the short-text limitation. Future studies comparing LDA against short-text-optimized models are recommended. Please see the revised Section 2.2.

Comment 4: The sentiment analysis interpretation seems stronger than what VADER can safely support in this domain. VADER is useful for fast lexicon-based estimates, but sustainability discourse often includes irony, technical phrasing, and news-sharing, all of which can reduce accuracy. The manuscript would benefit from a small manual validation sample or another robustness check, and the narrative should be more cautious about normative interpretations of "positivity" and "negativity."

Response: We agree with this concern. We have added a new paragraph at the end of Section 2.3 reporting a manual validation exercise: a random sample of 200 tweets (25 per topic) was independently coded for sentiment by two researchers. Inter-rater agreement was substantial (Cohen's kappa = 0.74), and agreement between manual coding and VADER classifications was 78.5%, with most discrepancies occurring in tweets containing irony or mixed sentiment. We have also moderated the interpretive language throughout, noting that sentiment distributions should be understood as indicative of broad attitudinal patterns rather than precise measures of public opinion. A dedicated paragraph in the revised Discussion (Section 4) reiterates these caveats. Please see the revised Sections 2.3 and 4.

Comment 5: The stated time window is 2006 to 2024, yet 2007 to 2025 appears elsewhere in the text and figure captions (Figure 2.)

Response: Thank you for catching this inconsistency. We have corrected all references to ensure consistency throughout the manuscript. The data collection period is 2006 to 2024. The temporal analysis in Figure 2 begins at 2007 (the earliest year with tweets in the dataset) and ends at 2024. All text references, figure captions, and axis labels have been updated to "2007 to 2024." The Abstract confirms "2006 to 2024" as the collection window.

Comment 6 (implicit, regarding qualitative coding procedure): A more explicit qualitative coding procedure would substantially increase confidence in the findings.

Response: We have added a new paragraph at the end of Section 2.4 describing the coding procedure in detail. The thematic analysis was conducted by two independent coders (RS and AG). Initial codes were generated independently, followed by a consensus meeting to resolve discrepancies and agree on a shared codebook. The codebook was applied to the full 800-tweet sample, with disagreements resolved through discussion. Inter-coder reliability was assessed on a randomly selected 20% subset, yielding Cohen's kappa of 0.81, indicating strong agreement. Theme labels and definitions were reviewed by all authors prior to finalization. Please see the revised Section 2.4.

Round 2

Reviewer 5 Report

Comments and Suggestions for Authors

The authors have addressed my previous comments carefully, and I would like to thank them for the thorough revisions. In particular, the added discussion of limitations and the clarified and expanded methodological details significantly improve the manuscript’s transparency and interpretability. Based on these changes, I believe the manuscript is now suitable for publication in its current form.