From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives

Rani, Saima; Ahmed, Khandakar; Subramani, Sudha

doi:10.3390/app14041547

Open AccessArticle

From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives

by

Saima Rani

^*,

Khandakar Ahmed

and

Sudha Subramani

Intelligent Technology Innovation Lab (ITIL), Institute for Sustainable Industries & Liveable Cities (ISILC), Victoria University, Melbourne 3011, Australia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(4), 1547; https://doi.org/10.3390/app14041547

Submission received: 18 December 2023 / Revised: 11 February 2024 / Accepted: 12 February 2024 / Published: 15 February 2024

(This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare System)

Download

Browse Figures

Versions Notes

Abstract

Mental illness is increasingly recognized as a substantial public health challenge worldwide. With the advent of social media, these platforms have become pivotal for individuals to express their emotions, thoughts, and experiences, thereby serving as a rich resource for mental health research. This paper is devoted to the creation of a comprehensive dataset and an innovative data annotation methodology to explore the underlying causes of these mental health issues. Our approach included the extraction of over one million Reddit posts from five different subreddits, spanning the pre-pandemic, during-pandemic, and post-pandemic periods. These posts were methodically annotated using a set of specific criteria, aimed at identifying various root causes. This rigorous process produced a richly categorized dataset, invaluable for detailed analysis. The complete unlabelled dataset, along with a subset that has been expertly annotated, is prepared for public release, as outlined in the data availability section. This dataset is a critical resource for training and fine-tuning machine learning models to identify the foundational triggers of individual mental health issues, offering valuable insights for practical interventions and future research in this domain.

Keywords:

social media; mental health; Natural language processing; dataset; Machine Learning; COVID-19

1. Introduction

Mental health disorders represent one of the most prevalent illnesses globally, and are closely linked to an increased risk of suicide [1,2]. According to the World Health Organization (WHO), mental health issues cost approximately USD 2.5 trillion in 2010, with an estimated increase of USD 6.0 trillion predicted by 2030, as more than 350 million people are impacted by depression [3]. In Australia alone, half of the population faces mental health challenges, with about 3000 individuals tragically ending their lives each year [4]. From 2011 to 2021, suicide rates in males and females increased from 16.2 to 18.6 and from 5.1 to 5.8 deaths per one hundred thousand, respectively, [5]. Considering these statistics, it is evident that mental health problems have impacted the world and that only a fraction of people have received timely and ample treatment. This clearly indicates future economic strain on governments and requires new modes of intervention and prevention strategies to reduce mental illness and suicide.

The early detection of root causes is a crucial step in employing preventive measures [6]. Traditional clinical settings have primarily relied on designed questionnaires and conventional techniques [7]. However, this approach often captures data only after the full development of the illness, at which point many individuals may be reluctant to seek treatment [8]. The limitations of these traditional methods, along with the rising frequency of undetected cases despite decades of research, emphasise the need for more proactive strategies in mental health diagnostics and treatment [9].

The advent of social media has opened new avenues for analyzing mental health disorders. In our interconnected world, where personal narratives are shared openly, social media platforms have become a valuable resource for understanding the daily stresses and concerns of people globally. Particularly during the COVID-19 pandemic, social media has provided unprecedented insights into mental health trends.

Globally, approximately one in eight individuals grapple with some form of mental disorder, including but not limited to anxiety, depression, Post-Traumatic Stress Disorder (PTSD), and bipolar disorder [10]. These disorders often have deep-rooted causes, extending beyond immediate or obvious factors. Recognizing the complexity of these disorders, Health Direct Australia identifies several key root causes [11]. These include:

Personality factors;
Drug and alcohol abuse;
Trauma and stress factors;
Early life environment;
Biological factors;
Genetic factors.

Our core hypothesis posits that social media platforms offer a rich source for understanding the root causes of mental health issues. The format of social media posts enables in-depth textual sharing of mental health narratives, thereby creating a dataset not attainable through traditional surveys. This dataset provides the opportunity to train natural language processing (NLP) models for the effective and accurate identification of mental health root causes from regular conversations.

A secondary aspect of our hypothesis involves the potential shifts in mental health root causes across different time periods. By extracting data from pre-pandemic, pandemic, and post-pandemic timelines, we aim to investigate any significant changes in factors affecting mental health. The exploration of data across various time periods will enrich the analysis, offering a broader perspective on how these root causes may have evolved due to significant global events like the COVID-19 pandemic.

To explore this hypothesis, our research poses the following key questions:

RQ1:: How can a semantically comprehensive corpus be created that captures factors or sentiments related to mental health during pre-pandemic, pandemic, and post-pandemic time periods?
RQ2:: What methodologies can be developed to accurately annotate and categorize mental health-related discussions on social media?

Our research presents a novel approach, utilizing the vast repository of personal narratives on Reddit to identify the root causes of mental health disorders as expressed in social media posts. While existing research has significantly advanced in detecting mental health issues using machine learning (ML) on social media [12,13], there is a gap in providing actionable information about root cause identification. Our study aims to bridge this gap by presenting a corpus of social media texts aimed at identifying the root causes of mental health issues. We anticipate that this corpus will accelerate the development of ML models to tackle this complex problem. The potential applications of this research are diverse, ranging from diagnosing mental health disorders through conversations to monitoring the psychological effects of disastrous events.

2. Background

Previous research in the field of mental health has predominantly focused on the detection of issues rather than exposing the underlying causes. The importance of early detection and intervention is vital. Recent studies have recognised the wealth of data available on social media as crucial for thoroughly analyzing mental health conditions. This development has brought NLP and ML to the forefront as essential tools for policymakers and healthcare providers, offering strategies to protect those most at risk [14,15]. Mental health research demands rigorous data collection and analysis to identify accurate indicators and enhanced outcomes [16].

2.1. Mental Health Detection on Social Media

The COVID-19 pandemic has intensified the reliance on social media as a support system, highlighting the medium’s potential for public health surveillance [17]. The significance of social media transcends personal communication and has become essential for research. As of April 2023, the global social media active user base has reached 4.8 billion people [18]. Platforms such as Reddit, Twitter, and Facebook are now integral to mental health research, facilitating the discovery of behavioral patterns that might elude conventional methods [19].

The rise of social media has created new avenues for supporting individuals with mental health issues [20]. These individuals utilize social media at comparable rates to the broader population, often seeking anonymity and refuge from social isolation [21,22,23]. The accessible and perceived safe space of online environments provides a platform for unjudged expression and connections often avoided offline [24,25,26]. This is particularly relevant to those with limited real-life social interactions [27].

Younger generations have voiced that social media helps alleviate feelings of isolation, with a preference for sharing their experiences online over traditional clinical settings [28]. Early research has shown that individuals with schizophrenia are more inclined to share their experiences on social media [29,30], with others indicating a general trend towards seeking online support over in-person help for mental health concerns [31]. The aggregation of user-generated content on forums such as Reddit has further expanded the scope for detecting a range of mental health issues, from depression to self-harm, through various analytical methodologies [32,33].

Collectively, these insights establish social media as a critical resource for understanding and supporting mental health, encouraging researchers to leverage this data to pre-emptively identify and address mental health risks, potentially averting more severe outcomes [34,35,36].

2.2. Natural Language Processing (NLP) Application for Mental Health Issues

With the prevalent use of social media among individuals with mental health disorders, these platforms present an opportunity to enhance public health research and improve the provision of mental health services. NLP and ML are increasingly applied to detect mental health issues at an early stage, revolutionizing the potential for timely and effective intervention and alleviating the burden of mental disorders. The confluence of ML and NLP with social media data has proven invaluable for identifying behavioral patterns indicative of mental health disorders [37,38]. This synergy has propelled proactive mental healthcare forward, facilitating early diagnosis and treatment [39]. Numerous studies have successfully employed classification algorithms, such as support vector machines and random forest, to identify conditions like depression, stress, and suicidal ideation from social media content, examining the use of ML and NLP in social media datasets to extract profound insights [40,41].

The period from 2012 to 2021 has seen a consistent increase in the application of NLP for mental health detection, with deep learning techniques becoming particularly prominent [42]. The majority of these studies have leveraged the power of supervised learning, utilizing a variety of models, including support vector machines [43], decision trees [44], and complex ensemble methods like AdaBoost [45], to accurately label data and predict mental health outcomes.

Advancements in deep learning have inaugurated a new era of NLP applications in mental health research [46]. Innovative models using embeddings such as GloVe and Word2Vec have been crafted to probe the linguistic intricacies of social media discourse more deeply [47]. Deep learning architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) with long short-term memory (LSTM) units have overcome some limitations inherent in traditional RNNs, enhancing the models’ ability to capture long-term dependencies [48,49].

In summary, the deployment of NLP in mental health research has established these technologies as essential tools within the mental healthcare sector. Their application extends beyond diagnosis to encompass support and treatment, marking them as crucial in reducing the prevalence of mental health issues [50].

3. Materials and Methods

3.1. Dataset

The rise of social media data in recent years is indisputable, offering an unprecedented volume of real-time insights into human behaviour. This data, when effectively harnessed, empowers researchers to discern patterns, trends, and sentiments, leading to pioneering discoveries across various fields, including mental health. Our study leverages this data to identify trends and patterns in mental health discourse.

3.1.1. Platform Selection

A cross-platform analysis was performed to ascertain the most suitable social media platform for our research objectives. Facebook and Twitter, while popular, have inherent limitations that can restrict access to comprehensive data [51]. Facebook’s predominantly private content limits accessible public data, while Twitter’s user engagement patterns, often skewed towards certain demographics, hinder the acquisition of a representative dataset for a diverse population [52,53].

In contrast, Reddit’s structure, fostering anonymous and in-depth discussions, is particularly conducive to mental health discourse, allowing users to share experiences and perspectives openly, which is essential for comprehensive mental health research. Its ’subreddits’ allow for the efficient gathering of relevant data. Reddit’s upvote and downvote system further ensures the quality of content, making it an ideal platform for our research. Our cross-platform analysis evaluated these platforms based on data accessibility, user demographics, content nature, and relevance to mental health topics.

3.1.2. Reddit as Data Source

There is a growing body of evidence pointing to the increasing use of Reddit as a data source across a variety of research disciplines, signifying an upward trend in academic publications utilizing this platform [54]. In the context of our proposed study, we aim to create our dataset using Reddit. This method has been proven effective in prior research, providing a solid foundation for our study [55,56].

Our rationale for choosing Reddit as our data collection medium is anchored on the following considerations:

Access to High Volumes of Mental Health Data

Reddit, as a widely utilized social media site with approximately 430 million users, hosts numerous dedicated, topic-specific forums known as ‘subreddits’. These subreddits provide focused information on a plethora of subjects, including a significant number related to mental health. Prominent examples include r/mentalhealth, r/suicidewatch, r/lonely, r/depression, and r/anxiety, all of which offer deep insights into users’ mental health issues. These forums have gained recognition due to their size, consistency, and active user participation, and have been deemed beneficial and prevalent by domain experts.
Reddit serves as a platform where advice on a variety of mental health issues is sought and given. The r/IAMA subreddit, with its 22.4 million members, frequently features discussions on mental health topics, particularly under its affiliate r/Iama Health (Mental Health AMA), managed by professional psychotherapists [57].
Mental health-related subreddits host large memberships, as illustrated in Table 1. These users candidly post about their mental health issues, and these subreddits have already been used for research purposes [58,59]. Table 2 showcases the volume of mental health-related posts across selected months, illustrating the platform’s active engagement.
Reddit allows for extensive textual submissions, with a generous character limit of 40,000 characters per post. This enables users to express their sentiments in detail, seek advice, or provide support. The potential for such detailed posts can provide a richer and more nuanced dataset for mental health studies.

Impacts of Anonymity on Content Quality and Relevance

A unique advantage of Reddit is the pseudonymity to its users. This feature encourages individuals to express themselves freely without fear of social stigma. Subreddit moderators play a critical role in maintaining the anonymity of users by enforcing site rules that prohibit the disclosure of personal identities. Consequently, this results in high-quality, subject-specific content that is likely less biased than data collected through questionnaires and surveys [60].

The authenticity and relevance of the subreddits are also maintained through the moderators’ actions, who ensure the deletion of off-topic posts or ‘submissions’. Reddit, unlike other social media platforms, allows for lengthy user posts or ‘submissions’. This aspect permits users to express their sentiments in detail and either seek or provide support, making Reddit an ideal source for researchers seeking profound insights into sensitive aspects of the human psyche.

Reddit’s Role in Mental Health Research

Reddit has become an increasingly recognized platform in mental health research due to its facilitation of open and detailed user discussions, offering a valuable environment for the study of mental health conditions, such as depression and anxiety. Moreover, its utility extends to capturing real-time mental health data, which proved to be especially relevant during the COVID-19 pandemic [61].

The affirmation of Reddit as a rich source of authentic mental health narratives, providing both depth and confidentiality, further validates our selection of this platform for comprehensive mental health research. Capitalizing on these qualities, our study delves into the underlying patterns and sentiments that characterize mental health discourse, aiming to uncover more profound insights.

3.1.3. Time Frame Selection

Our study’s time frame, extending from January 2019 to August 2022, was strategically chosen to encompass the pre-pandemic, during-pandemic, and emerging post-pandemic phases. This deliberate choice was informed by research findings, which highlighted significant psychological distress and increased social media use during the pandemic [62]. These studies provided insights into changing themes and sentiments in social media discussions during the pandemic, including heightened levels of anxiety and depression [63,64]. By capturing this broad spectrum of mental health discourse on Reddit, our study aims to provide a comprehensive analysis that reflects the dynamic shifts in public sentiment and discussion patterns during these critical periods.

Pre-Pandemic (January 2019–December 2019): This period serves as a baseline, offering a perspective on mental health discussions before the global awareness and impact of COVID-19.
Mid-Pandemic (January 2020–December 2021): This phase captures the height of the pandemic’s impact on mental health. It is a critical period for understanding the immediate reactions, coping mechanisms, and the evolving nature of mental health discussions influenced by the pandemic’s progression, lockdowns, and social distancing measures.
Emerging Post-Pandemic (January 2022–August 2022): Chosen to explore the early transition into the post-pandemic phase, this time frame provides insights into the shifts towards a new normal in mental health discourse.

3.1.4. Application Programming Interface (API) for Data Extraction

Following our selection of Reddit as the preferred platform, we focused on utilizing the most effective data extraction APIs, primarily Pushshift and PRAW. Pushshift, in particular, offers enhanced capabilities over Reddit’s official API, PRAW, by allowing access to extensive historical data and larger query limits. Jason Baumgartner’s development of Pushshift has significantly facilitated the collection and archiving of Reddit data, making it a valuable tool for large-scale research projects [65].

The effectiveness of Pushshift as a tool for academic research is established through its capability to compile complete and unbiased datasets, offering a more thorough data collection process than alternatives [66]. Its application in mental health research, particularly in extracting Reddit data for studies on suicide ideation detection, has demonstrated Pushshift’s accuracy and efficiency in managing complex data requirements [67]. These qualities render Pushshift an indispensable component of our research methodology, allowing for the effective utilization of Reddit’s extensive data while upholding user privacy and data integrity.

3.1.5. Data Collection

We used the Pushshift API (https://github.com/pushshift/api) to collect posts and corresponding metadata from chosen mental health subreddits (Data accessed periodically between May and September 2022). In order to select subreddits that focus on mental health issues, we utilized Reddit’s search feature and identified the top five mental health disorder subreddits with the largest memberships, as detailed in Table 1. Our data extraction was targeted towards Reddit posts from these subreddits across three distinct periods:

Pre-Pandemic;
Mid-Pandemic;
Post-Pandemic.

For an initial assessment of content relevancy and volume, we extracted posts from five subreddits for the month of January in 2020, 2021, and 2022 as the sample. Our preliminary analysis showed that 99 percent of the sampled subreddit content directly addressed mental health concerns. A more detailed breakdown of these posts can be found in Table 2 and Figure 1.

Table 3 presents an average post length of 103 words related to mental health illness. This metric denotes the quantity of processed tokens subsequent to the pre-processing phase.

Data was downloaded from a total of five subreddits, namely r/depression, r/SuicideWatch, r/mentalhealth, r/anxiety, and r/lonely, as indicated in Table 1. The scope of the study was limited to original posts, thereby excluding comments and images. The posts were then cleaned to be exclusively in English. We extracted a total of 1,494,019 posts from 1 January 2019 to 31 August 2022, as shown in Figure 2.

3.1.6. Dataset Organisation

Our dataset is an organized collection of Reddit posts, stemming from five key subreddits related to mental health: r/anxiety, r/depression, r/mentalhealth, r/SuicideWatch, and r/lonely. These subreddits were selected for their focused discussions on mental health, providing a valuable pool of data for relevant research.

We have organized the dataset chronologically, dividing it into annual folders spanning from January 2019 to August 2022. Each year is further segmented into months, with corresponding monthly folders containing subreddit-specific CSV files. Our complete dataset is divided into two major sections. The first section (Part A) encompasses the raw Reddit posts in their totality. However, the second section (Part B) serves as the cornerstone of our dataset, containing a carefully annotated subset of posts.

Structure of Part A—Raw Data

Each CSV file in our dataset includes the following columns, providing a detailed view of the Reddit posts along with essential metadata:

Author: The username of the Reddit post’s author.
Created UTC (Coordinated Universal Time): The UTC timestamp of when the post was created.
Score: The net score (upvotes minus downvotes) of the post.
Selftext: The main text content of the post.
Subreddit: The subreddit from which the post was sourced.
Title: The title of the Reddit post.
Timestamp: The local date and time when the post was created, converted from the UTC timestamp.

This structured approach allows researchers to conduct detailed, time-based analyses and to easily access data from specific subreddits.

Structure of Part B—Labelled Data

Part B of our dataset, which includes a subset of 800 manually annotated posts, is structured differently to provide focused insights into the mental health discussions. The columns in Part B are as follows:

Score
Selftext
Subreddit
Title
Label: The assigned label indicating the identified root cause of mental health issues, based on our annotation process.

The core strength of our dataset lies in this labeled section (Part B), where we have manually labeled 800 posts, according to the following four critical root cause categories:

Drug and alcohol;
Early life;
Personality factors;
Trauma and stress.

These categories represent key factors that contribute to mental health issues. This labelled subset of posts is of significant value, allowing researchers to delve deeper into the root causes of mental health concerns and to gain richer insights.

In compliance with privacy and ethical standards, any mentions of ages 1–18 in posts have been redacted. The posts themselves have been retained in the dataset, but any reference to this age range has been carefully removed. This step was taken to uphold the highest standards of research ethics and data privacy, ensuring that our dataset is not only informative but also respectful and responsible in its representation of individuals’ experiences as shared on social media. The accessible link is provided in the Dataset Availability Statement of this paper.

3.2. Data Annotation

To effectively leverage machine learning for text classification, the critical first step is data annotation, a process that provides meaningful labels to our unstructured text data. In constructing a dataset that aids in the development of machine learning models for identifying root causes of mental health issues from social media content, we draw inspiration from innovative approaches in the field. This study aligns with our goal, as it exemplifies the categorization of complex mental health concepts through manual labeling, providing a framework that we can adapt for identifying and understanding the root causes of mental health discussions in social media posts [68].

3.2.1. Initial Sub Dataset Assembly and Stratification

Our data annotation phase commenced with the careful assembly of an initial sub dataset, encompassing a month’s worth of posts from five targeted subreddits: r/anxiety, r/lonely, r/SuicideWatch, r/depression, and r/mentalhealth. This deliberate selection aimed to encapsulate a broad range of mental health discussions, ensuring a comprehensive and varied dataset for analysis.

Stratified Sampling Methodology

Recognizing the labor-intensive nature of manual labeling, our strategy focuses on achieving a manageable yet representative subset of 800 posts after completing the annotation process. We adopted a stratified sampling approach, similar to the method proposed for studying social media communities [69]. This process involves dividing the larger dataset into distinct strata, each aligning with one of the five subreddits. Posts within these strata are randomly selected based on keyword analysis, a method that ensures the proportional and fair representation of each subreddit in our eventual annotated sample.

Balanced Distribution across Labels

Upon completion of the annotation process, we intended to apply a further level of rigor in distributing the annotated posts. Our goal was to evenly allocate these 800 posts across four predetermined labels, with each label corresponding to a key mental health root cause. The plan was for each label to encompass an equal share of the dataset, amounting to 200 posts per label. This approach is vital to maintain balance and diversity in our dataset, which is crucial for minimizing potential biases and ensuring the equitable representation of various mental health issues.

3.2.2. Annotator Selection and Training

In line with practices observed in mental health research, we carefully selected our annotators to ensure a blend of domain expertise and technical skills, vital for the annotation process [70,71]. The first annotator brought to the project has a Master’s degree in Computer Science and experience as a business analyst and resolution specialist, providing a strong analytical and problem-solving skill set. The second investigator, holding a Ph.D. in Machine Learning, contributed extensive experience from previous projects, emphasizing the significance of annotated datasets for machine learning applications.

Both team members received comprehensive training from a psychiatry domain expert. This approach is akin to studies where domain experts, such as clinicians or psychologists, often collaborate with or guide researchers in the annotation process [72,73]. Our domain expert’s extensive experience, especially in mental health across diverse cultural contexts, was instrumental in shaping our research methodology. The expert’s proficiency in navigating mental health complexities provided our annotators with the necessary context and depth, aligning with practices where collaborations between domain experts and computer science researchers are common [74].

The training program was carefully structured to be iterative, allowing the annotators to progressively refine their understanding and application of the guidelines. Training focused on identifying keywords and interpreting their context within the posts, ensuring that the root cause categories were not just mechanically applied but were truly reflective of the discussions’ essence. Table 4 provides examples of posts with their associated root cause labels, along with the rationale for each annotation. The domain expert guided the annotators through multiple rounds of practice annotations, followed by feedback sessions to calibrate their interpretations and ensure alignment with the study’s objectives. The objective of this rigorous training was to minimize subjective bias and enhance annotation consistency, laying a reliable foundation for our machine learning analysis.

3.2.3. Annotation Guidelines

The complexity of mental health issues demands a precise and nuanced approach to data annotation. To address this, we developed a multi-class label framework for our annotations, drawing on established root cause categories from recognized mental health resources like Health Direct Australia and Beyond Blue [75].

From these resources, six predominant root causes were discerned, as delineated in Figure 3. This serves as a guide for root cause identification. For the purposes of our study, we selected four out of six root causes: personality factors, drug and alcohol abuse, early life environment, and trauma and stress. These categories were chosen due to their broad representation of areas of concern in mental health, their frequent discussion in both academic literature and public discourse, and their comprehensive encapsulation of the potential root causes of mental health issues [76,77,78,79,80,81,82].

Annotators were instructed on how to detect the presence of keywords within the posts and, more critically, on interpreting the broader context in which these keywords were used. This approach ensured that the root cause categories were applied accurately, reflecting the true essence of the users’ discussions. For a subset of these posts, Table 4 provides a detailed rationale, linking individual posts to their corresponding root cause labels, illustrating the application of our annotation framework in practice.

To maintain consistency and reliability, our guidelines also included protocols for addressing ambiguous cases by employing a consensus-based approach or through consultation with a domain expert if the consensus failed. Quality control was upheld through measures like inter-annotator agreement checks and periodic guideline reviews by the domain expert.

The design of our annotation guidelines was a critical juncture in our study, necessitating a profound understanding of the subject matter, an exhaustive review of the literature, and a thoughtful consideration of the practical aspects of the annotation process. The outcome was a set of guidelines that provided a clear and robust framework for the annotation of our dataset.

3.2.4. Annotation Process

Our methodology for pinpointing the fundamental factors contributing to mental health challenges as depicted in social media content employed a multi-faceted approach, guaranteeing the reliability and uniformity of our annotated data. This encompassed a tripartite procedure: a semi-automated keyword analysis, a holistic examination of entire posts, and a rigorous manual annotation process, which is reflected in Figure 4.

Keyword-Based Analysis

The first step was the keyword-based analysis. Our methodology entailed a semi-automated keyword identification process using the search function. This involved systematically searching the dataset for a predefined set of keywords associated with each category, carefully selected and listed in Table 5, representing relevant traits or behaviors. For instance, ‘perfectionism’, ‘low self-esteem’, and ‘pessimistic’ were some of the keywords used for the category “personality factors”. The selection of these keywords was underpinned by the expert guidance of a consultant psychiatrist, ensuring a grounded and empirically-informed approach. The search function was designed to execute precise string-matching procedures, enabling the identification of relevant instances within the corpus of social media posts. This methodological approach allowed for the initial categorization of content, setting the stage for more nuanced analysis in subsequent stages.

Whole Post Analysis

After the initial phase of keyword identification, we proceeded to whole post analysis. This stage encompassed an exhaustive review of the entirety of each post, ensuring that the identified root causes were congruent with the broader context and narrative presented. This step was instrumental in determining whether the root cause was indeed the focal point of the post, adhering to our predefined annotation guidelines. Such a comprehensive evaluation was imperative, not only for validating the accuracy of our preliminary keyword-based insights, but also for delving deeper into the nuanced experiences and expressions of the authors. This process allowed for a more holistic understanding of the content, beyond the surface-level keyword occurrences.

Manual Labelling

In the final step of the annotation process, each post was assigned a single category label from the defined root causes. Posts covering multiple categories were excluded, to maintain the integrity of the labelling process. This approach allowed for a more focused and interpretable dataset, where each post provides unambiguous information about a specific root cause.

3.2.5. Validation and Quality Assurance

After the completion of the annotation process, our study progressed into the crucial phase of validation and quality assurance. The credibility and analytical utility of our annotated dataset depended significantly on a thorough validation process. The foundation of our methodology sits on the adoption of a consensus-based approach, complemented by inter-annotator agreement (IAA) measures [83]. These measures were critical in ensuring both the reliability and objectivity of our annotations. They were instrumental not only in verifying the accuracy of our initial annotations but also in continuously improving our annotation guidelines. This further contributed to the overall consistency and dependability of our dataset. The subsequent sections will outline in detail how we seamlessly integrated IAA measures into our annotation workflow, thereby reinforcing the methodological integrity of our research.

Consensus-Based Approach

Our two annotators, equipped with comprehensive guidelines, independently classified each post into one of four mental health root causes: personality, trauma, drug and alcohol, and early life. However, independent classification was only the first step. The key aspect of our methodology was the requirement for a unanimous consensus between annotators on each post, which was particularly crucial in cases of initial disagreement. Such dialogue was instrumental in capturing the intricate details of both the subjective experiences conveyed by the authors and the objective circumstances of their mental health situations.

During the dialogue, both annotators were tasked to identify whether the author of the post was expressing one or more of the root causes, and whether the overall situation described in the post fell under these categories while following annotation strategy. This dual focus allowed us to capture both the subjective experiences of the authors and the objective circumstances of their situations, providing a more comprehensive view of the root causes of mental health issues.

Inter-Annotator Agreement (IAA) Measures

The cornerstone of our quality assurance process involved the implementation of inter-annotator agreement (IAA) measures. Following the first phase of independent annotation, we had a inter-annotator agreement (IAA) contingency Table 6, which we used to implement IAA calculations, to quantitatively assess the agreement level between the annotators. By employing Cohen’s Kappa, we moved beyond mere percentage agreement to a statistical measure that accounts for the probability of chance agreement. This step was pivotal, especially when addressing the initial annotator disagreement [84]. Our annotated dataset consisted of 800 posts, out of which there were disagreements on 78 posts, while a high level of agreement was observed on the remaining 722 posts.

The Cohen’s Kappa calculation was conducted as follows:

Total number of posts: 800 (722 agreements + 78 disagreements); $P_{o}$ :
Observed Agreement $P_{o}$ : The proportion of times both annotators agreed (including the 722 agreed posts);
Expected Agreement by Chance $P_{e}$ : Remained the same, calculated based on the proportion of each category chosen by each annotator.

κ = \frac{P_{o} - P_{e}}{1 - P_{e}}

(1)

Our analysis revealed a Cohen’s Kappa value of approximately 0.869, indicating a very high level of agreement beyond chance. This value reflects the fact that we could confidently assert the high degree of reliability in our annotations—a testament to the robustness of our annotation process.

Domain Expert Consultation

Recognizing the complexity of mental health narratives, we further strengthened our annotation process with domain expert consultation. In instances where the content of the posts presented exceptional complexity, or when the annotators’ discussions reached a deadlock, we sought the insights of a domain expert. This step was invaluable, providing clarity and guidance that reinforced the annotation process and contributed to the dataset’s depth and accuracy.

The integration of IAA measures into our annotation workflow was fundamental to our research methodology, ensuring that our dataset was a dependable reflection of the mental health narratives from the selected subreddits. Through this rigorous process, we upheld the highest standards of data quality, paving the way for credible research outcomes.

4. Results

4.1. Temporal Trends in Mental Health Discourse on Reddit

Our comprehensive analysis of subreddit activity revealed significant patterns in public engagement with mental health discussions. A heatmap visualizing the average post count per phase allowed us to identify distinct user engagement trends across the pre-pandemic, mid-pandemic, and post-pandemic periods, as shown in Figure 5.

4.1.1. Pre-Pandemic Baseline

In the pre-pandemic period, subreddit activity established a baseline, with the r/depression subreddit notably leading in monthly post averages. Although it was lower, there was still considerable activity in the r/anxiety and r/SuicideWatch subreddits, suggesting ongoing conversations about these topics.

4.1.2. Mid-Pandemic Surge

The mid-pandemic phase marked an escalation in posting frequency. The r/anxiety subreddit, in particular, exhibited a noticeable increase, signaling the intensifying public concern over mental well-being due to the pandemic’s stressors. This surge aligns with patterns observed in the broader literature on mental health and social media usage [85], indicating a correlation between increased social media engagement and globally reported anxiety symptoms. The r/SuicideWatch subreddit also saw a significant increase in activity, possibly mirroring the intensification of mental health crises during this time. These patterns highlight the critical role of online communities in offering support and fostering discussion amid increased anxiety and isolation due to the pandemic.

4.1.3. Post-Pandemic Engagement

In the post-pandemic period, there was a decrease in the frequency of posts; however, activity levels remained higher than the pre-pandemic figures. This sustained interaction, particularly within the r/anxiety and r/SuicideWatch subreddits, suggests a lasting psychological impact of the pandemic and hints at the emergence of new communication norms concerning mental health after this global crisis.

4.1.4. Heatmap Analysis of of Engagement Patterns

The heatmap provides a clear illustration of these shifting trends in Figure 5. The transition from darker to lighter colors from the mid- to post-pandemic phases, without reverting to pre-pandemic levels, visually affirms the enduring effect of the pandemic on mental health discussions. The color intensities correspond to the volume of posts, with darker shades indicating higher activity. Particularly, the r/depression subreddit maintained a consistently high level of posts throughout all phases, suggesting that it remained a critical space for discourse throughout the pandemic. The r/anxiety subreddit showed a prominent peak during the mid-pandemic period, reflecting the spike in public concern and discussion related to anxiety. Interestingly, the data also showed that the discussion remained more active in the post-pandemic phase compared to the pre-pandemic phase, indicating a lasting shift in engagement patterns. The r/SuicideWatch subreddit also marked an increase in activity during the mid-pandemic period, which aligned with broader societal concerns about mental health crises during this challenging time. Again, post-pandemic activity in this subreddit did not return to pre-pandemic levels, suggesting a sustained need for crisis support and discussion spaces.

This observation could imply that the pandemic may have permanently changed how individuals discuss and approach mental health on public forums. It also sets the stage for further research into the effectiveness of online communities in providing support, the potential changes in the stigma surrounding mental health discussions, and the broader societal acknowledgment of mental health challenges.

4.2. Annotation Consistency and Discrepancies

Our data annotation process revealed significant trends in inter-annotator agreement. Our findings, which are illustrated in Contingency Table 6, showed both concordance and discordance between the two annotators across four key mental health root cause categories. Notably, the “trauma and stress” category exhibited substantial concordance, with both annotators achieving a 92.5 percent agreement rate, which is considered high. This category comprises a diverse range of factors, including domestic violence, relationship issues, financial strain, work-related stress, and feelings of loneliness. For instance, posts P4–P6 in Table 4 clearly indicate that the “trauma and stress” label can be attributed to expressions of financial hardship, social isolation, and relational dissolution.

Conversely, distinguishing between the “personality” and “trauma and stress” categories presented challenges, leading to noticeable discrepancies. Differentiating posts regarding “early life” factors also showed some inconsistencies, while the “drug and alcohol” category displayed the least disagreement, indicating more distinct criteria for classification in this area.

The post in Table 7 presents a case where the subjective nature of mental health narratives can lead to initial disagreements among annotators. Annotator 2 perceived the post as indicative of “trauma and stress”, due to the mention of the loss of loved ones, a classification typically associated with impactful events causing significant emotional distress. However, Annotator 1 identified the enduring nature of the poster’s grief, tracing its origin to adverse experiences in childhood, thus categorizing it under “early life” factors.

The distinction here lies in the recognition that the user has carried the stress from childhood into adulthood, an interpretation that aligns with the “early life” label, which encompasses formative experiences that have a lasting influence on an individual’s mental health. This is in contrast to “trauma and stress”, which often focuses on more recent events and their immediate impact.

Such disagreements were systematically resolved by engaging in a consensus-based discussion or, when necessary, consulting with a domain expert. Through these resolution mechanisms, the team was able to reconcile different interpretations and achieve a unified understanding, ensuring the reliability and consistency of the dataset’s categorization.

The obtained Cohen’s Kappa score of 0.869 demonstrated near-perfect agreement between the annotators, testifying to the effectiveness of our iterative training protocol and the thoroughness of our annotation guidelines. Despite this overall high agreement rate, the areas of disagreement provide crucial insights into the complex subtleties present in mental health discourse. These discrepancies are invaluable for refining our annotation framework and highlighting the necessity of ongoing improvements to capture the nuanced spectrum of mental health narratives accurately.

5. Discussion

5.1. Platform Selection and Its Implications for Mental Health Discourse

Our strategic selection of Reddit as the data source was pivotal in capturing mental health discourse. This decision was made with a clear recognition of Reddit’s distinctive environment, which naturally encourages genuine, open dialogues on topics related to mental health. Reddit’s anonymity encourages open conversations, providing insights less likely to be influenced by the biases of identity-driven platforms. This environment has allowed us to collect data that offer a open look at mental health issues, a contrast to platforms like Facebook. Anonymity on Reddit has mitigated the social desirability bias, a common issue in mental health research where stigma can influence self-reporting. The resultant data quality provides a more genuine representation of mental health states.

Moreover, the extensive textual submissions available on Reddit have paved the way for a potential longitudinal analysis. The evolution of mental health discussions from the pre-pandemic period to the anticipated post-pandemic phase presents an opportunity to chronicle the shifts in public sentiment and communication norms within mental health discourse. This long-term view is indispensable, offering a narrative compass to guide the development of future mental health strategies and support mechanisms within the digital space. This plays a crucial role for developing responsive and resilient mental health support systems that are aligned with the real-world experiences of individuals.

5.2. Insights from Time Frame Analysis

Our investigation has tracked posting volume trends, setting the stage for subsequent in-depth analysis. The temporal scrutiny is pivotal in establishing a foundational ’normal’ against which the variances observed in later pandemic stages can be contextualized. The pre-pandemic baseline serves as an anchor point, providing a control for comparative insights into the transformative influences of the pandemic on mental health discourse.

The midst of the pandemic offers a lens through which to view the immediate psychological responses, adaptive coping strategies, and the dynamic evolution of mental health dialogue, as shaped by the pandemic’s unfolding events. This period stands as a vital reference point for future research to conduct a contextual dissection of discourse, uncovering layers of adaptive behaviors and support-seeking dynamics that emerged as direct responses to the pandemic’s societal disruptions.

The period following the pandemic’s peak, marked by the commencement of vaccine distribution, the relaxation of restrictions, and the gradual resumption of day-to-day life, signalled the potential for a return to pre-pandemic discursive patterns or, alternatively, the start of enduring conversational shifts. This is a critical juncture for future analyses to investigate the long-term narrative consequences on mental health dialogue, potentially illuminating how public discourse has been reshaped in the face of a retreating global crisis.

Selecting these distinct time frames strategically positions future research to probe deeper into the lasting reverberations of the pandemic on the narrative landscape of mental health. This long-term examination paves the way for a comprehensive understanding of how public discussions around mental health have shifted, adapting to and reflecting the collective sentiment and the crystallization of emergent communication conventions in the wake of a global crisis.

5.3. Implications of Temporal Trends on Mental Health Discourse

Aligning with our observations, Alambo et al. conducted a longitudinal analysis of postings on various subreddits related to mental health and substance use disorders. Their study found a high correlation between postings in the depression subreddit and the pandemic, indicating a distinct impact on mental health during this period [86].

We have documented a sustained increase in post volumes across various subreddits post-pandemic, suggesting that collective narratives around mental health are evolving. Bak et al. complement these findings by reporting a significant rise in depression-related discussions in loneliness subreddits during the pandemic. The nature of these discussions shifted from dating to online interaction and community support, reflecting a broader societal trend towards seeking digital social support during times of isolation [87]. Such changes carry profound implications for healthcare providers and policymakers, emphasizing the need for interventions and support systems that align with the contemporary landscape of public mental health discourse.

In our examination of community engagement in mental health conversations, the study spanned multiple subreddits and traced how these discussions have developed through the pandemic’s different stages. For instance, there was a consistent notable focus on r/loneliness subreddit, emphasising a continuous demand for community support mechanisms throughout the pandemic timeline.

Contributing to this field, our study presents a novel and comprehensive dataset encompassing discussions from five subreddits. This extensive dataset not only broadens our understanding of mental health discussions on social media platforms but also serves as a unique longitudinal lens to assess the impact of COVID-19.

5.4. Technical Implications for Data Annotation Quality

The process of annotating mental health discussions highlighted challenges that annotators face, particularly when discerning between categories such as “personality” and “trauma and stress”. These discrepancies underscore the necessity of enhancing our training protocols and targeting areas of ambiguity, which will refine data categorization methods and significantly influence the accuracy of machine learning models in interpreting mental health language.

A consensus-based approach is vital, as evidenced by our contingency table analysis, which shows the importance of reconciling differing annotator perspectives to achieve a uniform understanding of labeling criteria. Our findings advocate for the further calibration of annotators and the refinement of guidelines to improve the consistency of annotations. The need for high inter-annotator reliability is imperative for the development of accurate machine learning models. This study points to the importance of ongoing training, iterative guideline improvement, and continuous expert consultation to ensure the quality of data annotation, especially in the complex domain of mental health.

Our research also provides a comprehensive view of the shifting landscape of mental health dialogue within digital spaces, set against the backdrop of the COVID-19 pandemic. This in-depth examination into the complexities of virtual communication not only identifies the challenges in classifying mental health narratives but also highlights emerging patterns that are likely to influence future post-pandemic discussions.

5.5. Long-Term Impact of the Pandemic on Mental Health Narratives

Our longitudinal dataset provides invaluable insights into how mental health discourse has evolved during the pandemic. The persistent activity on platforms like the r/anxiety and r/SuicideWatch subreddits signals the potential for enduring psychological effects, which highlights the necessity for long-term mental health strategies. Particularly noteworthy is the impact on adolescents. According to one study, adolescents experienced heightened rates of anxiety, depression, and stress during this period [88]. These strategies include developing predictive models for proactive resource allocation and tailoring interventions to the emerging needs highlighted by the data.

The heatmap analysis, which illustrates the COVID-19 pandemic’s impact on mental health dialogues across various subreddits, offers a deep insight into public engagement trends. Notably, it depicts sustained and even heightened conversation levels throughout the pandemic, suggesting a lasting change in the way mental health is discussed online. This pattern of enhanced engagement provides critical narrative context that is poised to inform the development of mental health strategies and the dynamics of community support in digital spaces. Ongoing research monitoring these social media conversations is pivotal for guiding policy development and for establishing support systems attuned to the shifting realities of a society adapting to a post-pandemic world.

Future investigations are vital to comprehensively understand the prolonged effects of the pandemic on mental health discourse. As communities integrate new norms of communication, discerning these shifts will be crucial for the creation of support systems that are both adaptive and resilient. Our dataset acts as a methodological foundation for such inquiries, offering a detailed account of the narrative shifts in mental health discussions through the various stages of the pandemic. These insights not only aid in current and future policy formulation but also in the strategic planning of mental health interventions designed to meet the nuanced needs of diverse populations.

5.6. Policy Recommendations and Future Directions

Building on the study’s implications, our findings point to several actionable policy changes that could be instrumental for healthcare providers and policymakers. The persistent engagement in mental health discussions on social media, especially those related to themes of anxiety and isolation, presents an opportunity for public health campaigns to utilize these platforms more effectively in their strategies. We recommend the development of targeted mental health interventions that capitalize on the reach and immediacy of social media to provide timely support. Moreover, the notable increase in subreddit activity during the pandemic signifies the necessity for policies that encourage collaboration between online platforms and mental health professionals. Such collaboration could involve moderator training in mental health first aid, establishing direct links to professional help within these platforms, and implementing AI-driven tools for early identification and support. The study also emphasizes the need for policies to ensure the sustainability of online support communities as integral components of the mental health infrastructure. To meet the demand reflected by increased online discussions, there should be an expansion in investments for digital mental health services, including applications and teletherapy.

6. Limitations

This investigation into the discourse of mental health on Reddit offers substantial insights but is accompanied by several limitations. Predominantly, the data originates from user-generated content on Reddit, a platform whose demographics may not encompass the full diversity of the global population. Such demographic constraints potentially introduce biases, suggesting that our findings, while revealing trends within Reddit, may not extend seamlessly to wider mental health dynamics across more varied populations.

The study’s quantitative focus on post volumes, analyzed using heatmap visualization, captures engagement patterns but falls short of interpreting the qualitative depth of user interactions. This approach does not encompass the nuanced discourse quality, tone, or context emerging within online communities, thereby presenting a limited view of mental health discussions’ complexity.

A notable challenge is the categorization of complex mental health narratives, especially within the broad “trauma and stress” category. The diverse range of issues encapsulated in this category resulted in ambiguities in distinguishing these narratives from those related to “personality”. Future studies should consider breaking down “trauma and stress” into finer categories, to enhance clarity and reduce annotation overlaps.

Moreover, the limited scope of our annotated sub dataset, with only 800 posts from a larger corpus, restricts a thorough analysis of the multifaceted mental health discourse on Reddit. Expanding annotation efforts in future research would enhance the robustness and generalizability of findings, offering a more detailed understanding and aiding in the development of more sophisticated machine learning models for automated analysis.

Another consideration is that our dataset may predominantly reflect transient emotional states, particularly due to the pandemic’s impact. The expressions of novelty, loneliness, and isolation on social media during this time may not consistently represent long-term mental health conditions, but rather immediate reactions to an unprecedented global event. Thus, while this study provides a valuable analysis of mental health discourse, it is crucial to consider that some observed trends might be episodic responses to the pandemic rather than indicative of a lasting change in mental health status.

In light of these limitations, future research should adopt a multifaceted approach, combining both quantitative and qualitative methods and utilizing a broader range of data sources. Such an approach is essential for a more holistic understanding of mental health discussions in the digital age, especially in the context of global crises like the COVID-19 pandemic. This comprehensive perspective will be crucial in informing effective interventions and policy development in mental health care.

7. Conclusions and Future Work

In this paper, we made a substantial contribution to the field of mental health discourse analysis by creating a large-scale dataset. This dataset not only provides an in-depth view of mental health discussions on social media but also lays the groundwork for the identification of root causes of mental health issues.

By shifting the focus from merely detecting the presence of mental health disorders to identifying their root causes, our research offers a more comprehensive understanding of mental health discourse online. This approach has the potential to significantly enhance mental health interventions by providing insights into the subtle linguistic patterns indicative of mental distress and the contextual factors influencing mental health discourse.

Looking to the future, we see immense potential for future research in this area. The emergence of transformer-based models has opened up new possibilities for extracting more insights from our annotated dataset. These advanced models, renowned for their ability to understand context and semantics in text data, could potentially enhance our understanding of mental health discussions online.

Moreover, the potential of our carefully annotated dataset extends to the development of predictive models. These models can be instrumental in identifying early indicators of mental health issues from user-generated content. For instance, by employing machine learning algorithms, we could predict the likelihood of a user experiencing a mental health crisis based on their social media posts. This predictive capability could be pivotal in facilitating timely interventions, offering support to individuals at critical moments.

In conclusion, our large dataset, complemented by a rigorously annotated subset, is a valuable resource for future research, opening up exciting possibilities for leveraging advanced machine learning models and predictive analytics to enhance our understanding of and intervention in mental health discussions online. We believe that our work represents a significant step forward in the field of mental health research, and we look forward to seeing how it will be utilized and built upon in the future.

Author Contributions

Conceptualization, S.R. and K.A.; methodology, S.R., K.A. and S.S.; validation, S.R. and K.A.; writing, S.R.; writing review and editing, K.A.; supervision, K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the National Health and Medical Research Council (NHMRC) National Statement on Ethical Conduct in Human Research (2007) Updated 2018, and was approved by the Victoria University Human Research Ethics Committee on 29 May 2023 (ID: HRE23-005).

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is publicly available and can be accessed on Kaggle at https://rb.gy/ewtjy, accessed on 17 December 2023.

Acknowledgments

The authors extend their profound gratitude to Manjula Datta O’Connor for her invaluable contributions to this study. O’Connor’s extensive experience as a consultant psychiatrist, particularly her work addressing mental health implications in diverse cultural contexts, has been pivotal in guiding the research methodology. Her expertise in the intersections of mental health with gender and race, coupled with her roles at the University of Melbourne, UNSW School of Social Sciences, and as the Chair of the Family Violence Psychiatry Network at the Royal Australian New Zealand College of Psychiatrists, significantly enhanced the data annotation process. Her commitment to address unique mental health challenges across different populations has greatly influenced the depth and breadth of this research.

Conflicts of Interest

The authors declare that they have no financial interests related to this research, as it did not receive any specific grants from funding agencies in the public, commercial, or not-for-profit sectors.

References

Steel, Z.; Marnane, C.; Iranpour, C.; Chey, T.; Jackson, J.W.; Patel, V.; Silove, D. The Global Prevalence of Common Mental Disorders: A Systematic Review and Meta-Analysis 1980–2013. Int. J. Epidemiol. 2014, 43, 476–493. [Google Scholar] [CrossRef]
Izadinia, N.; Amiri, M.; Jahromi, R.G.; Hamidi, S. A Study of Relationship Between Suicidal Ideas, Depression, Anxiety, Resiliency, Daily Stresses and Mental Health Among Tehran University Students. Procedia Soc. Behav. Sci. 2010, 5, 1615–1619. [Google Scholar] [CrossRef]
BBloom, D.E.; Cafiero, E.T.; Jané-Llopis, E.; Abrahams-Gessel, S.; Bloom, L.R.; Fathima, S.; Feigl, A.B.; Gaziano, T.; Mowafi, M.; Pandya, A.; et al. The Global Economic Burden of Noncommunicable Diseases; World Economic Forum: Geneva, Switzerland, 2011. [Google Scholar]
Department of Health and Aged Care Australia. Mental Health and Suicide Prevention. 2022. Available online: https://www.health.gov.au/health-topics/mental-health-and-suicide-prevention (accessed on 15 March 2022).
Australian Institute of Health and Welfare. Death by Suicide. 2022. Available online: https://www.aihw.gov.au/suicide-self-harm-monitoring/data/deaths-by-suicide-in-australia/suicide-deaths-over-time (accessed on 15 March 2022).
Gillies, D.; Chicop, D.; O’Halloran, P. Root Cause Analyses of Suicides of Mental Health Clients: Identifying Systematic Processes and Service-Level Prevention Strategies. Crisis 2015, 36, 316. [Google Scholar] [CrossRef]
Radloff, L.S. The CES-D Scale: A Self-Report Depression Scale for Research in the General Population. Appl. Psychol. Meas. 1977, 1, 385–401. [Google Scholar] [CrossRef]
Marcus, M.; Yasamy, M.T.; van Ommeren, M.; Chisholm, D.; Saxena, S. Depression: A Global Public Health Concern; WHO: Geneva, Switzerland, 2012.
Collins, P.Y.; Patel, V.; Joestl, S.S.; March, D.; Insel, T.R.; Daar, A.S.; Bordin, I.A.; Costello, E.J.; Durkin, M.; Fairburn, C.; et al. Grand Challenges in Global Mental Health. Nature 2011, 475, 27–30. [Google Scholar] [CrossRef]
World Health Organization. Mental Disorder. 2022. Available online: https://www.who.int/news-room/fact-sheets/detail/mental-disorders (accessed on 17 March 2022).
Healthdirect. Mental Illness. 2022. Available online: https://www.healthdirect.gov.au/mental-illness (accessed on 15 March 2022).
Boettcher, N. Studies of Depression and Anxiety Using Reddit as a Data Source: Scoping Review. JMIR Ment. Health 2021, 8, e29487. [Google Scholar] [CrossRef]
Baheti, R.; Kinariwala, S. Detection and Analysis of Stress Using Machine Learning Techniques. Int. J. Eng. Adv. Technol. 2019, 9, 335–342. [Google Scholar] [CrossRef]
Breland, J.Y.; Quintiliani, L.M.; Schneider, K.L.; May, C.N.; Pagoto, S. Social Media as a Tool to Increase the Impact of Public Health Research. Am. J. Public Health 2017, 107, 1890. [Google Scholar] [CrossRef]
Calvo, R.A.; Milne, D.N.; Hussain, M.S.; Christensen, H. Natural Language Processing in Mental Health Applications Using Non-Clinical Texts. Nat. Lang. Eng. 2017, 23, 649–685. [Google Scholar] [CrossRef]
Tannenbaum, C.; Lexchin, J.; Tamblyn, R.; Romans, S. Indicators for Measuring Mental Health: Towards Better Surveillance. Healthc. Policy 2009, 5, e177. [Google Scholar] [CrossRef]
Lisitsa, E.; Benjamin, K.S.; Chun, S.K.; Skalisky, J.; Hammond, L.E.; Mezulis, A.H. Loneliness Among Young Adults During COVID-19 Pandemic: The Mediational Roles of Social Media Use and Social Support Seeking. J. Soc. Clin. Psychol. 2020, 39, 708–726. [Google Scholar] [CrossRef]
Petrosyan, A. Worldwide Digital Population 2023. 2023. Available online: https://www.statista.com/statistics/617136/digital-population-worldwide/ (accessed on 17 March 2022).
Kursuncu, U.; Gaur, M.; Lokala, U.; Thirunarayan, K.; Sheth, A.; Arpinar, I.B. Predictive Analysis on Twitter: Techniques and Applications. In Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining; Springer: Berlin/Heidelberg, Germany, 2019; pp. 67–104. [Google Scholar]
Naslund, J.A.; Bondre, A.; Torous, J.; Aschbrenner, K.A. Social Media and Mental Health: Benefits, Risks, and Opportunities for Research and Practice. J. Technol. Behav. Sci. 2020, 5, 245–257. [Google Scholar] [CrossRef]
Birnbaum, M.L.; Rizvi, A.F.; Correll, C.U.; Kane, J.M.; Confino, J. Role of Social Media and the Internet in Pathways to Care for Adolescents and Young Adults with Psychotic Disorders and Non-Psychotic Mood Disorders. Early Interv. Psychiatry 2017, 11, 290–295. [Google Scholar] [CrossRef]
Naslund, J.A.; Aschbrenner, K.A.; Bartels, S.J. How People with Serious Mental Illness Use Smartphones, Mobile Apps, and Social Media. Psychiatr. Rehabil. J. 2016, 39, 364. [Google Scholar] [CrossRef]
Giacco, D.; Palumbo, C.; Strappelli, N.; Catapano, F.; Priebe, S. Social Contacts and Loneliness in People with Psychotic and Mood Disorders. Compr. Psychiatry 2016, 66, 59–66. [Google Scholar] [CrossRef] [PubMed]
Gowen, K.; Deschaine, M.; Gruttadara, D.; Markey, D. Young Adults with Mental Health Conditions and Social Networking Websites: Seeking Tools to Build Community. Psychiatr. Rehabil. J. 2012, 35, 245. [Google Scholar] [CrossRef]
Torous, J.; Keshavan, M. The Role of Social Media in Schizophrenia: Evaluating Risks, Benefits, and Potential. Curr. Opin. Psychiatry 2016, 29, 190–195. [Google Scholar] [CrossRef] [PubMed]
Berger, M.; Wagner, T.H.; Baker, L.C. Internet Use and Stigmatized Illness. Soc. Sci. Med. 2005, 61, 1821–1827. [Google Scholar] [CrossRef] [PubMed]
Badcock, J.C.; Shah, S.; Mackinnon, A.; Stain, H.J.; Galletly, C.; Jablensky, A.; Morgan, V.A. Loneliness in Psychotic Disorders and Its Association with Cognitive Function and Symptom Profile. Schizophr. Res. 2015, 169, 268–273. [Google Scholar] [CrossRef] [PubMed]
Rideout, V.; Fox, S. Digital Health Practices, Social Media Use, and Mental Well-Being Among Teens and Young Adults in the US; Providence: Renton, WA, USA, 2018. [Google Scholar]
Miller, B.J.; Stewart, A.; Schrimsher, J.; Peeples, D.; Buckley, P.F. How Connected Are People with Schizophrenia? Cell Phone, Computer, Email, and Social Media Use. Psychiatry Res. 2015, 225, 458–463. [Google Scholar] [CrossRef] [PubMed]
Haker, H.; Lauber, C.; Rössler, W. Internet Forums: A Self-Help Approach for Individuals with Schizophrenia? Acta Psychiatr. Scand. 2005, 112, 474–477. [Google Scholar] [CrossRef]
Naslund, J.A.; Aschbrenner, K.A.; Marsch, L.A.; Bartels, S.J. The Future of Mental Health Care: Peer-to-Peer Support and Social Media. Epidemiol. Psychiatr. Sci. 2016, 25, 113–122. [Google Scholar] [CrossRef]
Tadesse, M.M.; Lin, H.; Xu, B.; Yang, L. Detection of Depression-Related Posts in Reddit Social Media Forum. IEEE Access 2019, 7, 44883–44893. [Google Scholar] [CrossRef]
De Choudhury, M.; Kiciman, E.; Dredze, M.; Coppersmith, G.; Kumar, M. Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; pp. 2098–2110. [Google Scholar]
Barbier, G.; Liu, H. Data Mining in Social Media. In Social Network Data Analytics; Springer: Berlin/Heidelberg, Germany, 2011; pp. 327–352. [Google Scholar]
Brunette, M.; Achtyes, E.; Pratt, S.; Stilwell, K.; Opperman, M.; Guarino, S.; Kay-Lambkin, F. Use of Smartphones, Computers and Social Media Among People with SMI: Opportunity for Intervention. Community Ment. Health J. 2019, 55, 973–978. [Google Scholar] [CrossRef]
Losada, D.E.; Crestani, F.; Parapar, J. Overview of eRisk at CLEF 2020: Early Risk Prediction on the Internet (Extended Overview). In Proceedings of the CLEF (Working Notes), Thessaloniki, Greece, 22–25 September 2020. [Google Scholar]
Moreno, M.A.; Jelenchick, L.A.; Egan, K.G.; Cox, E.; Young, H.; Gannon, K.E.; Becker, T. Feeling Bad on Facebook: Depression Disclosures by College Students on a Social Networking Site. Depress. Anxiety 2011, 28, 447–455. [Google Scholar] [CrossRef]
Eichstaedt, J.C.; Smith, R.J.; Merchant, R.M.; Schwartz, H.A. Facebook Language Predicts Depression in Medical Records. Proc. Natl. Acad. Sci. USA 2018, 115, 11203–11208. [Google Scholar] [CrossRef]
Kim, J.; Lee, D.; Park, E. Machine Learning for Mental Health in Social Media: Bibliometric Study. J. Med. Internet Res. 2021, 23, e24870. [Google Scholar] [CrossRef]
O’Dea, B.; Wan, S.; Batterham, P.J.; Calear, A.L.; Paris, C.; Christensen, H. Detecting Suicidality on Twitter. Internet Interv. 2015, 2, 183–188. [Google Scholar] [CrossRef]
Wongkoblap, A.; Vadillo, M.A.; Curcin, V. A Multilevel Predictive Model for Detecting Social Network Users with Depression. In Proceedings of the 2018 IEEE International Conference on Healthcare Informatics (ICHI), New York, NY, USA, 4–7 June 2018; pp. 130–135. [Google Scholar]
Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural Language Processing Applied to Mental Illness Detection: A Narrative Review. npj Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef] [PubMed]
Prakash, A.; Agarwal, K.; Shekhar, S.; Mutreja, T.; Chakraborty, P.S. An Ensemble Learning Approach for the Detection of Depression and Mental Illness Over Twitter Data. In Proceedings of the 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 17–19 March 2021; pp. 565–570. [Google Scholar]
Fodeh, S.; Li, T.; Menczynski, K.; Burgette, T.; Harris, A.; Ilita, G.; Rao, S.; Gemmell, J.; Raicu, D. Using Machine Learning Algorithms to Detect Suicide Risk Factors on Twitter. In Proceedings of the 2019 International Conference on Data Mining Workshops (ICDMW), Beijing, China, 8–11 November 2019; pp. 941–948. [Google Scholar]
Tong, L.; Liu, Z.; Jiang, Z.; Zhou, F.; Chen, L.; Lyu, J.; Zhang, X.; Zhang, Q.; Sadka, A.; Wang, Y.; et al. Cost-Sensitive Boosting Pruning Trees for Depression Detection on Twitter. IEEE Trans. Affect. Comput. 2022, 14, 1898–1911. [Google Scholar] [CrossRef]
Su, C.; Xu, Z.; Pathak, J.; Wang, F. Deep Learning in Mental Health Outcome Research: A Scoping Review. Transl. Psychiatry 2020, 10, 116. [Google Scholar] [CrossRef] [PubMed]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Ghosh, S.; Anwar, T. Depression Intensity Estimation via Social Media: A Deep Learning Approach. IEEE Trans. Comput. Soc. Syst. 2021, 8, 1465–1474. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Shatte, A.B.R.; Hutchinson, D.M.; Teague, S.J. Machine Learning in Mental Health: A Scoping Review of Methods and Applications. Psychol. Med. 2019, 49, 1426–1448. [Google Scholar] [CrossRef] [PubMed]
Stöckli, S.; Hofer, D. Susceptibility to Social Influence Predicts Behavior on Facebook. PLoS ONE 2020, 15, e0229337. [Google Scholar] [CrossRef]
Olmstead, K. The Challenges of Using Facebook for Research. 2015. Available online: https://www.pewresearch.org/fact-tank/2015/03/26/the-challenges-of-using-facebook-for-research/ (accessed on 20 March 2022).
Wojcik, S. Sizing Up Twitter Users. 2019. Available online: https://www.pewresearch.org/internet/2019/04/24/sizing-up-twitter-users/ (accessed on 20 March 2022).
Proferes, N.; Jones, N.; Gilbert, S.; Fiesler, C.; Zimmer, M. Studying Reddit: A Systematic Overview of Disciplines, Approaches, Methods, and Ethics. Soc. Media Soc. 2021, 7, 20563051211019004. [Google Scholar] [CrossRef]
Yeskuatov, E.; Chua, S.-L.; Foo, L.K. Leveraging Reddit for Suicidal Ideation Detection: A Review of Machine Learning and Natural Language Processing Techniques. Int. J. Environ. Res. Public Health 2022, 19, 10347. [Google Scholar] [CrossRef] [PubMed]
Pirina, I.; Çöltekin, Ç. Identifying Depression on Reddit: The Effect of Training Data. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, Brussels, Belgium, 31 October 2018; pp. 9–12. [Google Scholar]
Consulting, E. Mental Health AMA. 2021. Available online: https://www.reddit.com/r/IAmA/comments/oqqb8z/mental_health_ama/ (accessed on 20 March 2022).
Kim, J.; Lee, J.; Park, E.; Han, J. A Deep Learning Model for Detecting Mental Illness from User Content on Social Media. Sci. Rep. 2020, 10, 11846. [Google Scholar] [CrossRef]
Thorstad, R.; Wolff, P. Predicting Future Mental Illness from Social Media: A Big-Data Approach. Behav. Res. Methods 2019, 51, 1586–1600. [Google Scholar] [CrossRef]
Chandrasekharan, E.; Samory, M.; Jhaver, S.; Charvat, H.; Bruckman, A.; Lampe, C.; Eisenstein, J.; Gilbert, C. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proc. ACM Hum.-Comput. Interact. 2018, 2, 1–25. [Google Scholar] [CrossRef]
Del Rio-Chanona, R.M.; Hermida-Carrillo, A.; Sepahpour-Fard, M.; Sun, L.; Topinkova, R.; Nedelkoska, L. Mental Health Concerns Precede Quits: Shifts in the Work Discourse During the COVID-19 Pandemic and Great Resignation. EPJ Data Sci. 2023, 12, 49. [Google Scholar] [CrossRef] [PubMed]
Bailey, E.; Boland, A.; Bell, I.; Nicholas, J.; La Sala, L.; Robinson, J. The Mental Health and Social Media Use of Young Australians during the COVID-19 Pandemic. Int. J. Environ. Res. Public Health 2022, 19, 1077. [Google Scholar] [CrossRef] [PubMed]
Valdez, D.; ten Thij, M.; Bathina, K.; Rutter, L.; Bollen, J. Social Media Insights Into US Mental Health During the COVID-19 Pandemic: Longitudinal Analysis of Twitter Data. J. Med. Internet Res. 2020, 22, e21418. [Google Scholar] [CrossRef] [PubMed]
Lee, Y.; Jeon, Y.J.; Kang, S.; Shin, J.I.; Jung, Y.-C.; Jung, S.J. Social media use and mental health during the COVID-19 pandemic in young adults: A meta-analysis of 14 cross-sectional studies. BMC Public Health 2022, 22, 995. [Google Scholar] [CrossRef] [PubMed]
Baumgartner, J.; Zannettou, S.; Keegan, B.; Squire, M.; Blackburn, J. The Pushshift Reddit Dataset. Proc. Int. AAAI Conf. Web Soc. Media 2019, 14, 830–839. [Google Scholar] [CrossRef]
Poudel, A.K.; Weninger, T. Navigating the Post-API Dilemma: Search Engine Results Pages Present a Biased View of Social Media Data. arXiv 2024, arXiv:2401.15479. [Google Scholar]
Nikhileswar, K.; Vishal, D.; Sphoorthi, L.; Fathimabi, S. Suicide Ideation Detection in Social Media Forums. In Proceedings of the 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 7–9 October 2021; pp. 1741–1747. [Google Scholar] [CrossRef]
Garg, M.; Saxena, C.; Krishnan, V.; Joshi, R.; Saha, S.; Mago, V.; Dorr, B.J. CAMS: An annotated corpus for causal analysis of mental health issues in social media posts. arXiv 2022. [Google Scholar] [CrossRef]
Beauvais, T. Hybrid Representative Sampling of Social Media. Bull. Sociol. Methodol. Methodol. Sociol. 2023, 160, 57–70. [Google Scholar] [CrossRef]
Benton, A.; Mitchell, M.; Hovy, D. Multitask Learning for Mental Health Conditions with Limited Social Media Data. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 1. Available online: http://www.aclweb.org/anthology/E17-1015 (accessed on 20 March 2022).
Birnbaum, M.L.; Ernala, S.K.; Rizvi, A.F.; De Choudhury, M.; Kane, J.M. A Collaborative Approach to Identifying Social Media Markers of Schizophrenia by Employing Machine Learning and Clinical Appraisals. J. Med. Internet Res. 2017, 19, e289. [Google Scholar] [CrossRef]
Zhou, Y.; Zhan, J.; Luo, J. Predicting Multiple Risky Behaviors via Multimedia Content. In Proceedings of the International Conference on Social Informatics, Oxford, UK, 13–15 September 2017; Springer International: Cham, Switzerland, 2017. [Google Scholar]
Huang, X.; Li, X.; Liu, T.; Chiu, D.; Zhu, T.; Zhang, L. Topic Model for Identifying Suicidal Ideation in Chinese Microblog. In Proceedings of the Pacific Asia Conference on Language, Information and Computation, Shanghai, China, 30 October–1 November 2015; pp. 553–562. Available online: http://www.aclweb.org/anthology/Y15-1064 (accessed on 11 November 2023).
Homan, C.M. Toward Macro-Insights for Suicide Prevention: Analyzing Fine-Grained Distress at Scale. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology, Baltimore, MD, USA, 27 June 2014; p. 107. [Google Scholar]
health.gov.au. Mental Health and Suicide Prevention. 2022. Available online: https://www.health.gov.au/health-topics/mental-health-and-suicide-prevention/what-were-doing-about-mental-health (accessed on 15 March 2022).
Skeem, J.L.; Miller, J.D.; Mulvey, E.; Tiemann, J.; Monahan, J. Using a Five-Factor Lens to Explore the Relation Between Personality Traits and Violence in Psychiatric Patients. J. Consult. Clin. Psychol. 2005, 73, 454. [Google Scholar] [CrossRef]
Krueger, R.F. Personality Traits in Late Adolescence Predict Mental Disorders in Early Adulthood: A Perspective-Epidemiological Study. J. Pers. 1999, 67, 39–65. [Google Scholar] [CrossRef] [PubMed]
Preoţiuc-Pietro, D.; Eichstaedt, J.; Park, G.; Sap, M.; Smith, L.; Tobolsky, V.; Schwartz, H.A.; Ungar, L. The Role of Personality, Age, and Gender in Tweeting About Mental Illness. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA, 5 June 2015; pp. 21–30. [Google Scholar]
Drake, R.E.; Brunette, M.F. Complications of Severe Mental Illness Related to Alcohol and Drug Use Disorders. In Recent Developments in Alcoholism: The Consequences of Alcoholism Medical, Neuropsychiatric, Economic, Cross-Cultural; Springer: New York, NY, USA, 1998; pp. 285–299. [Google Scholar]
Skogen, J.C.; Sivertsen, B.; Lundervold, A.J.; Stormark, K.M.; Jakobsen, R.; Hysing, M. Alcohol and Drug Use Among Adolescents: Furthermore, the Co-Occurrence of Mental Health Problems. Ung@ Hordaland, a Population-Based Study. BMJ Open 2014, 4, e005357. [Google Scholar] [CrossRef]
Lilley, C.; Ball, R.; Vernon, H. The Experiences of 11–16 Year Olds on Social Networking Sites; National Society for the Prevention of Cruelty to Children (NSPCC): London, UK, 2014. [Google Scholar]
Swanson, J.D.; Wadhwa, P.M. Developmental Origins of Child Mental Health Disorders. J. Child Psychol. Psychiatry 2008, 49, 1009. [Google Scholar] [CrossRef] [PubMed]
Teruel, M.; Cardellino, C.; Cardellino, F.; Alemany, L.A.; Villata, S. Increasing Argument Annotation Reproducibility by Using Inter-Annotator Agreement to Improve Guidelines. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Seibold, C.; Jaus, A.; Fink, M.A.; Kim, M.; Reiß, S.; Herrmann, K.; Kleesiek, J.; Stiefelhagen, R. Accurate Fine-Grained Segmentation of Human Anatomy in Radiographs via Volumetric Pseudo-Labeling. arXiv 2023. [Google Scholar] [CrossRef]
Zhu, J.; Yalamanchi, N.; Jin, R.; Kenne, D.; Phan, N. Investigating COVID-19’s Impact on Mental Health: Trend and Thematic Analysis of Reddit Users’ Discourse. J. Med. Internet Res. 2023, 25, e46867. [Google Scholar] [CrossRef]
Alambo, A.; Padhee, S.; Banerjee, T.; Thirunarayan, K. COVID-19 and Mental Health/Substance Use Disorders on Reddit: A Longitudinal Study. In Pattern Recognition. ICPR International Workshops and Challenges; Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12662. [Google Scholar] [CrossRef]
Bak, M.; Chiu, C.; Chin, J. Mental Health Pandemic during the COVID-19 Outbreak: Social Media As a Window to Public Mental Health. Cyberpsychol. Behav. Soc. Netw. 2023, 26, 346–356. [Google Scholar] [CrossRef]
Jones, E.A.K.; Mitra, A.K.; Bhuiyan, A.R. Impact of COVID-19 on Mental Health in Adolescents: A Systematic Review. Int. J. Environ. Res. Public Health 2021, 18, 2470. [Google Scholar] [CrossRef]

Figure 1. Number of sample posts per subreddit. Source: authors’ compilation, based on data collected from Reddit. Data reflects samples gathered in January 2020, January 2021, and January 2022.

Figure 2. Total number of extracted posts per subreddit. Source: authors’ compilation, based on data extracted from selected subreddits on Reddit from January 2019 to August 2022.

Figure 3. Mental health Root cause categories by Health Direct Australia. Source: Categories as defined by Health Direct Australia [11], visualization compiled by the authors.

Figure 4. Data annotation process. Source: authors’ depiction of the data annotation process, see Data Annotation section of this paper.

Figure 5. Heat map of average posts per subreddit. Source: authors’ creation, visualizing the average post count per subreddit in order to identify user engagement trends between January 2019 and August 2022.

Table 1. Mental health subreddits with the largest memberships.

Subreddits	Members
r/depression	998,608
r/anxiety	647,120
r/SuicideWatch	452,111
r/mentalhealth	428,742
r/lonely	361,023

Source: derived from Reddit, accessed November 2023. Membership numbers are subject to change.

Table 2. Number of sample posts.

Subreddits	Jan. 2020	Jan. 2021	Jan. 2022
r/depression	17,516	18,706	14,953
r/anxiety	5750	7931	7605
r/SuicideWatch	8161	15,135	14,309
r/mentalhealth	4700	9098	9644
r/lonely	3061	4382	5215

Source: authors’ compilation, based on data collected from Reddit, spanning January 2020, January 2021, and January 2022.

Table 3. Average post length example.

Title

Post Text

lonely as hell

“I feel like as I am a pathetic loser. I don’t feel like if any of my friend care about me. I’m always the one sitting alone, and no one invite me to sit with them or take part in anything, and I know they all prefer each other over me. I don’t know what wrong am ng. I want to live my life without worries. There must be something wrong with me I feel like a freak. It’s making my depression worse. I am always on the verge of tears. I don’t know what to do with myself I have no one.”

Source: Example post selected by authors to represent average length from data collected on Reddit, encompassing the time period from January 2019 to August 2022.

Table 4. Root cause labels with rationales.

	Root Cause Label	Post	Rationale
P1	Personality	Essence of human life is love is not it?unfortunately i am unable to feel it. I feel like a stranger. I do not belong anywhere. Everything is same i want diversity. I do not have any moral values or guilt. I strongly desire self actualization [Some text omitted for brevity]	Low self-esteem
P2	Drug and alcohol	I use cannabis every day. I look forward to it because I feel so much better/content when I use. However some say it’s actually making my depression and anxiety worse so it’s like a vicious cycle [Some text omitted for brevity]	Drug consumption
P3	Early life	When i was a kid i used to be really skinny and too short for my age. I got bullied a lot because of that. I’m 20 y.o now, and i think i have distorted view of my own appearance. I keep forgetting that i’m the size of an adult nowadays, and i feel like i’m skinnier and weaker than i actually am. I still feel like a little boy. I keep feeling like i’m still not big enough. [Some text omitted for brevity]	Bullying, poor body image since early life
P4	Trauma and stress	No one LITERALLY …. NO ONE has talked or even invited me to hang out for New Years. I just looked at my mates story and it’s my 3 “closest friends” hanging out and having fun. Why does everyone seem to hate me? I’m not “too nice” or a cunt to people, I don’t make a fuss and I just chill. All my life I have seem to ignored for some reason and it’s driving me fucking insane! F… everything right now. [Some text omitted for brevity]	Loneliness, social rejection
P5	Trauma and stress	I’m miserable. Things went really bad when I was 16 and my parents split up and my cousin killed himself within a month. my parents splitting up was and is still extremely bitter, their hate for each other consumes them and they both want each other dead. When they say it out loud it’s normal for me now but it feels so wrong. My dad says, he’s gonna make her homeless. So I’m going to be paying £600 a month when I will only be earning about £700 or less. I had a weekend job which I left because it was too stressful working. it is all so wrong [Some text omitted for brevity]	Relationship issues, financial disorganised attachment, bitterness, sadness
P6	Trauma and stress	I don’t understand what is wrong with me. My family makes me feel wrong. According to them everything i do is selfish. I don’t need them though. They don’t understand me, My family hates me. I’m their least favourite. Always have been. I wish there was a way for them to share how my head feels so they could understand why i am how i am. I hate living. There is nothing i want to do more than die. Someone who wants to care. I need someone please [Some text omitted for brevity]	Relationship issues, anger, sadness

Source: authors’ analysis of selected Reddit posts with root cause labels and rationales based on annotation guidelines developed by a domain expert contributor (see the Acknowledgments section) for this research. Text excerpts are underlined to illustrate the reasoning behind label assignments.

Table 5. Potential keywords for each label.

Label	Potential Keywords
Personality	Perfectionism, low self-esteem, self-critical, negative, worry, pessimistic, impatient, impulsive, indecisive, disrespectful, aggressive, arrogant, emptiness
Drug and alcohol	Alcohol, drugs, substance abuse, addiction, dependence, overdose, intoxication, withdrawal, rehab, detox, relapse, sobriety
Early life	Childhood, upbringing, parenting, neglect, abuse, trauma, environment, family, poverty, divorce, bullying, school
Trauma and stress	Trauma, stress, PTSD, anxiety, depression, violence, abuse, accident, disaster, loss, grief, crisis

Source: potential keywords, as determined by a domain expert contributor (see the Acknowledgments section) to this research. These keywords were utilized to assist in the annotation process.

Table 6. Inter-annotator agreement (IAA) contingency table.

		Annotator 1
	Label	Personality	Trauma and Stress	Early Life	Drug and Alcohol	Total
Annotator 2	Personality	157	23	16	4	200
	Trauma and stress	15	185	0	0	200
	Early life	0	15	185	0	200
	Drug and alcohol	0	5	0	195	200
	Total	172	228	201	199	800

Source: authors’ creation, detailing concordance and discordance between the two annotators across four key mental health root cause categories. This table serves to quantify inter-annotator agreement, as part of the data validation process.

Table 7. Annotation Disagreement.

Annotator-1	Annotator-2	Post Text
Early life	Trauma and stress	“It’s been over a decade since my sister passed away and the next year prior to my sister’s passing, my mom died. Two of them because of cancer. I was just 11 years old at that time. and now why am I still grieving? I am an adult now. Every time I watch shows or even read something about hospitals, diagnosis, griefs, I start cryinggg. Everything that reminds me of how they suffer, triggers this feeling. I just need to get out of this already. I have been grieving for so long, i am so tired. [Some text omitted for brevity]”

Source: authors’ compilation, presenting a Reddit post with differing labels assigned by two annotators to illustrate an instance of annotation disagreement. This table is part of the analysis to understand and address the variability in annotator perspectives.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rani, S.; Ahmed, K.; Subramani, S. From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives. Appl. Sci. 2024, 14, 1547. https://doi.org/10.3390/app14041547

AMA Style

Rani S, Ahmed K, Subramani S. From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives. Applied Sciences. 2024; 14(4):1547. https://doi.org/10.3390/app14041547

Chicago/Turabian Style

Rani, Saima, Khandakar Ahmed, and Sudha Subramani. 2024. "From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives" Applied Sciences 14, no. 4: 1547. https://doi.org/10.3390/app14041547

APA Style

Rani, S., Ahmed, K., & Subramani, S. (2024). From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives. Applied Sciences, 14(4), 1547. https://doi.org/10.3390/app14041547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Posts to Knowledge: Annotating a Pandemic-Era Reddit Dataset to Navigate Mental Health Narratives

Abstract

1. Introduction

2. Background

2.1. Mental Health Detection on Social Media

2.2. Natural Language Processing (NLP) Application for Mental Health Issues

3. Materials and Methods

3.1. Dataset

3.1.1. Platform Selection

3.1.2. Reddit as Data Source

Access to High Volumes of Mental Health Data

Impacts of Anonymity on Content Quality and Relevance

Reddit’s Role in Mental Health Research

3.1.3. Time Frame Selection

3.1.4. Application Programming Interface (API) for Data Extraction

3.1.5. Data Collection

3.1.6. Dataset Organisation

Structure of Part A—Raw Data

Structure of Part B—Labelled Data

3.2. Data Annotation

3.2.1. Initial Sub Dataset Assembly and Stratification

Stratified Sampling Methodology

Balanced Distribution across Labels

3.2.2. Annotator Selection and Training

3.2.3. Annotation Guidelines

3.2.4. Annotation Process

Keyword-Based Analysis

Whole Post Analysis

Manual Labelling

3.2.5. Validation and Quality Assurance

Consensus-Based Approach

Inter-Annotator Agreement (IAA) Measures

Domain Expert Consultation

4. Results

4.1. Temporal Trends in Mental Health Discourse on Reddit

4.1.1. Pre-Pandemic Baseline

4.1.2. Mid-Pandemic Surge

4.1.3. Post-Pandemic Engagement

4.1.4. Heatmap Analysis of of Engagement Patterns

4.2. Annotation Consistency and Discrepancies

5. Discussion

5.1. Platform Selection and Its Implications for Mental Health Discourse

5.2. Insights from Time Frame Analysis

5.3. Implications of Temporal Trends on Mental Health Discourse

5.4. Technical Implications for Data Annotation Quality

5.5. Long-Term Impact of the Pandemic on Mental Health Narratives

5.6. Policy Recommendations and Future Directions

6. Limitations

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI