1. Introduction
The biopsychosocial approach recognizes that health and illness are a result of complex interactions between psychological, biological, and social factors. This viewpoint (biopsychosocial approach) considers health to be a condition of mental, physical, and social well-being, rather than simply the absence of illness [
1]. The biopsychosocial approach acknowledges that social, environmental, cultural, and socioeconomic factors, as well as biological aspects such as genetics and physiology, and psychological aspects such as emotions, attitudes, and behaviors, all have an impact on physical health. In contrast to the common misconception that mental health is merely the absence of mental illness, the biopsychosocial approach to mental health and illness views mental health as a state of general psychological well-being. Mental illness is a complicated and varied phenomenon that is influenced by a range of psychological, biological, and social factors. It is frequently characterized by changes in thoughts, emotions, behaviors, and social functioning, all of which significantly impair daily functioning and cause severe discomfort [
2].
Mental illnesses refer to “a wide range of mental health conditions that affect mood, thinking, behavior, and relationships with others”, and pose devastating threats to personal well-being [
3]. Mental illness can range from a variety of conditions, including depression and personality disorders such as borderline personality disorder (BPD), bipolar disorder, schizophrenia, anxiety, and drug or alcohol use disorders [
4]. In the United Kingdom (UK), mental diseases cost around GBP 105 billion annually. Mental diseases are the biggest cause of sickness absence in the UK, accounting for 70 million sick days lost each year. A total of 44% of applicants for employment and support assistance have a mental condition as their primary diagnosis [
5,
6]. According to predictions, mental illness could have a global economic impact of more than USD 5 trillion by 2030 [
7]. It is estimated that more than 792 million people of all ages worldwide suffer from mental health problems, according to the World Health Organization’s (WHO) Fact Sheet published in 2017 [
8].
Szasz’s view was that mental illnesses are not medical conditions, but difficulties that arise from social, cultural, and psychological influences. Szasz contended that classifying individuals as mentally ill and treating them, as well as using other medical measures, are methods of social regulation that restrict personal independence and self-determination. In lieu of this, Szasz supported a humanistic and libertarian strategy for mental health, emphasizing individual accountability, free will, and self-governance. He believed that people should have the freedom to live their lives as they wish, without being identified as mentally ill or being forced into psychiatric treatment against their will [
9,
10].
Communication is an essential part of the community, and textual communication is presently one of the most common ways to express ourselves. People utilize social networks to explain their sentiments, mental states, goals, and desires, as well as to document their activities or routines [
11]. According to the latest survey reports, more than half of the population (59%) uses social media, and time spent on social media accounts for two-thirds of total internet usage [
12]. People often share health information online to gain experience-based knowledge, find emotional support, and work together towards their health objectives. This includes receiving information on specific therapies or behaviors and collaborating on related decisions. Individuals with mental illnesses are more prone to expressing themselves online, whether through blogging, social networking, or public forums [
13]. As people write more digitally, an enormous amount of data can be analyzed automatically to infer meaningful information about one’s well-being, such as mental health conditions. Using social data provides an additional advantage in reducing the stigma associated with mental health screening, as such techniques can create new chances for early diagnosis and intervention, as well as fresh insights into the study of the causes and processes of mental health [
12,
13].
Partially due to the stigma surrounding mental illness and the frequent difficulty in receiving appropriate care through the healthcare system, people are increasingly turning to social media to express their problems and find emotional support [
14,
15]. As the number of people suffering from mental illness is growing, coupled with the pressure imposed on health and social care systems, social media platforms have evolved into a source of “in-the-moment” daily conversation, with subjects such as mental health and well-being being discussed. This presents an interesting research opportunity to better understand and classify various types of mental health through analyses of social media data (Twitter and Reddit) [
16].
Traditionally, monitoring mental health was carried out by asking carefully crafted questions to a random sample of the population in order to conduct mental health surveys. However, high-quality survey data require a significant investment of effort, time, and money for survey designers, interviewers who gather data, and participants who voluntarily provide answers [
17]. Relying on self-reported statistics (survey data) is also problematic since mental illness is prone to bias.
Social media data provide the following benefits over surveys: (i) Social media data are very inexpensive to gather in comparison to the price of conventional sample surveying, especially when comparing the prices of telephone surveys (phone calls plus interviewee expenses), in-person surveys (interviewee expenses plus probable travel fees), and internet interviews (postage and printing) [
18,
19]. (ii) Traditional surveys only have a limited ability to observe the respondent’s actual behaviors and can only ask the participant about their behaviors; the relationship between responses and real behaviors is often weak. On the other hand, social media offers a plethora of data regarding user behavior because social media postings are made outside of the context of the survey; in other words, social media data offer a record of a user’s actual behavior [
18,
19]. (iii) Social media data provide potentially large samples compared to survey data (on Twitter, more than 500 million tweets are posted every day, and around 1.12 million posts are posted on Reddit) [
18], making it possible to access queries about more specific “subgroups” (for example, issues affecting a certain geographical area) [
19].
Furthermore, although previous research has demonstrated the feasibility of examining and predicting mental health from social media activities, many such studies are platform-specific. Single-platform analysis may restrict findings and actions to a subset of the target population [
20]. Although useful information on a certain behavior or themed community may be available on one platform, such knowledge may not be available on another [
21,
22,
23]. In addition, there is evidence that shows that individuals behave differently on various social media platforms due to the platform’s distinctive social etiquette and design features of the platform [
24].
The general objective of a study states what is expected to be achieved in general terms. Specific objectives are:
To examine the linguistic characteristics and patterns of different social media activities associated with different mental health groups;
To investigate whether a machine learning model can be developed to categorize a user’s social media activity patterns into different mental illness groups;
To understand if a machine learning model trained on a specific social medium can generalize to other social media platforms.
2. Related Work
An increasing number of individuals are using social media platforms such as Twitter, Facebook, Reddit, and Instagram to express themselves and communicate with others in real-time. As a result, vast amounts of social data are created, containing important information about people’s interests, emotions, and behaviors [
7]. Social media is transforming how people self-identify as having a mental health condition and how they interact with others who have had similar experiences, often asking about treatment, and side effects, and reducing feelings of stigma and loneliness. The study of prominent social media platforms such as Reddit and Twitter may provide insight into what patients are most concerned about (more so than their physicians) [
24]. Furthermore, this form of large-scale user-generated content (social media) provides a unique opportunity to study mechanisms underlying mental health disorders. For example, research on children and adolescents has indicated that frequent daily use of social networking platforms is independently associated with poor self-rating of mental health, higher levels of psychological distress, and suicidal thoughts [
25,
26].
Beyond simple features such as frequency of usage, researchers have now employed more sophisticated methods to extract in-depth usage pattern features such as linguistic style, affective content of the posts, and the interaction pattern as characterized via a social graph to predict mental health conditions associated with specific social media posts [
26]. The language used in Reddit forums dedicated to mental health has been studied to discover linguistic traits that might be useful in creating future applications to detect individuals who require immediate assistance [
17]. Users’ self-disclosure in Reddit mental illness forums has been studied to create language models that explain social support, which has been found to contain informational, emotional, instrumental, and prescriptive information. Even though Redditors are not paid for their work, the feedback expressed in the comments is of remarkably high quality; it can be both emotional and useful, as well as informative. This is a crucial difference compared to social media platforms such as Twitter, where sharing health information is frequently broadcast or an emotional outburst and not always about seeking accurate or detailed information about diagnosis and treatment [
26,
27].
Social media data have been identified as a resource for gaining knowledge about mental illnesses. For example, Twitter data have been used to develop classifiers that can identify individuals who are depressed [
27]. Coppersmith et al. (2015) used Twitter data to identify linguistic characteristics that may be used to classify Twitter users into those suffering from mental illness and those who do not [
25]. Dinu et al. (2021) used Reddit data to classify various mental illness groups based on users’ posts rather than individual users or groups of users. Supervised machine learning, which is used for categorization or prediction modeling, offers the advantage of accounting for complicated interactions between variables that were previously unknown [
28]. As datasets become larger and variables become more complex, machine learning techniques may become a useful tool in psychiatry to correctly detangle variables linked with patient outcomes [
29,
30].
Goffman et al. (2009) categorized three types of stigmas present in our society: physical, moral, and tribal. The first type, physical stigma, is based on visible or physical differences, including skin color, disability, and disfigurement. These types of stigmas are easily identified and often lead to social exclusion and discrimination. The second type, moral stigma, is rooted in perceived moral failures or character flaws, such as addiction, criminal behavior, or mental illness. These prejudices are associated with negative stereotypes and often cause social rejection and marginalization. Lastly, tribal stigma is derived from being a member of a specific group or community, such as racial or ethnic minorities, religious groups, or non-heterosexual orientations. Tribal stigmas are a result of social norms and cultural values and can lead to discrimination and exclusion from mainstream society [
31].
Stigma is a social construct that denotes a negative perception or attitude towards individuals or groups who deviate from social norms due to factors such as race, gender, sexual orientation, and physical or mental health [
32]. Throughout history, stigma has been associated with various social issues. For instance, in ancient times, people with physical disabilities or deformities were often believed to be cursed or possessed by evil spirits. Similarly, individuals with mental health issues were thought to be demon-possessed during the Middle Ages and were subjected to cruel treatments such as exorcism [
33]. Initially, researchers viewed stigma as an individual problem, where stigmatized individuals were seen as having a personal deficiency. However, later research adopted a social model, recognizing that stigma is often a result of broader societal attitudes and structural inequalities. This perspective emphasizes the importance of addressing social and structural barriers to reduce stigma [
32].
Several researchers have used machine learning on social media data and healthcare for classifying various mental illness G groups. For example, Gkotsis et al. (2016) [
16] collected data from 11 different mental health subreddits and developed a multiclass classification model. If a user suffers from several mental health issues, such as anxiety and depression, the user can submit posts in multiple subreddits. If the model is trained on posts from users with multiple symptoms, the multiclass classification model may suffer from being noisy [
16]. Kim et al. (2020) collected data across six mental and health-related subreddits and developed six binary classification models for each mental illness (anxiety, autism, bipolar disorder, BPD, depression, and schizophrenia) and utilized pre-trained word vectors rather than random initialization, which yields superior results; however, the classification model suffers from noisy data if the model is trained with the posts of users with multiple symptoms [
34].
Numerous studies have shown that language usage, social expressiveness, and interaction are important indicators of mental health. The Linguistic Inquiry Word Count (LIWC), a validated technique for the psychometric evaluation of language data [
35], has been used extensively to analyze linguistic features associated with various mental illnesses. For example, De Choudhury, M. et al. (2015) [
26] gathered Twitter posts from individuals who had been diagnosed with depression and used the Linguistic Inquiry and Word Count (LIWC) to examine the linguistic and emotional characteristics of the tweets [
26]. Coppersmith et al. (2015) [
25] also emphasized that associated language patterns, such as the use of first-person pronouns, negative emotions, and angry words, had a significant relationship with mental problems. Several studies examine the relationship between language usage and mental health [
25]. According to Aaron Beck et al.’s (1967) cognitive theory of depression, depressed people tend to view themselves and their surroundings negatively. They frequently use negative terms and first-person pronouns while expressing themselves (I, or me) [
25]. Rude et al. (2004) [
36] examined linguistic patterns of essays written by college students who were depressed, had been depressed in the past, and had never been depressed. His findings showed that depressed students used fewer positive emotion words and more negative valence words [
36].
Despite growing interest in the detection of mental illness, present efforts have mostly been limited to research on a single platform, with less emphasis given to generalizability across multiple social media platforms. The reasons why it is important to check if the model can be generalized across different platforms can be summarized as follows:
(1) First, even the most widely used social media platforms are not used by everyone, and most platforms only reach small segments of the population. As an illustration, 25% of US adults claim to use Twitter [
37], while 18% of US adults claim to use Reddit. Social media users do not confine themselves to a single social media platform; instead, users efficiently navigate across multiple platforms to express themselves by exploiting variations among these platforms [
38].
Furthermore, social media data are skewed in terms of demographics. Twitter, for example, has a 55% overall adoption rate in the United States. However, approximately 38.5% of those aged 25 to 34 use Twitter, with the great majority using it multiple times per day. Furthermore, 57% of respondents claimed their primary motivation for accessing Twitter is to increase their understanding of current events. Reddit has a 39% overall adoption rate in the United States. However, roughly 64% of those aged 18 to 29 use Reddit, with the great majority using it multiple times per day. A total of 72% of respondents claimed their primary motivation for accessing Reddit is for entertainment [
39]. There are socio-demographic biases associated with social media and these must be thoroughly investigated before drawing broad generalizations about the broader population. For example, although Reddit has a large user base with a wide range of socio-demographics, with an estimated 6% of internet users active on Reddit, there is a gender bias (8% of male internet users compared to 4% of female). With an estimated 18.7% of internet users active on Twitter, there is a bias towards male users (12.3% of male internet users compared to 6.4% of females) and a bias toward younger users, with a higher percentage of users aged 18–49 than those over 50, on these platforms [
40].
(2) Different social media platforms may feature different usage patterns. For instance, Twitter may provide more frequent updates on an event, whereas Reddit may provide more critical analysis regarding the same events. Furthermore, Twitter may discuss political news and current events more rigorously than Reddit, but Reddit may be a better choice for news updates and entertainment discussions. Social media is mostly driven by normal users; therefore, a platform’s suitability depends on how the corresponding users utilize it. For example, if a considerable number of people discuss an event, then the event is important. In an emergency, receiving frequent updates is critical; therefore, a platform with active users is better suited for this type of event. Thus, diverse characteristics of the content published, user posting behavior, and post-spreading patterns across these platforms can prove to be useful for meeting certain requirements such as exploring important events, live updates, or analyzing news stories [
41].
Therefore, in this study, we investigate two popular social media platforms, Reddit, and Twitter. Specifically, we are interested in examining how different/similar linguistic characteristics and patterns of activities are associated with different mental health groups in the two platforms. Following this, we study how machine learning models trained on one platform are generalizable to another.
5. Discussion
- 1.
To examine the linguistic characteristics and patterns of different social media activities associated with different mental health groups.
The mental illness lexicon on social media platforms such as Twitter and Reddit was analyzed using LIWC. The similarities are as follows: (i) both Twitter and Reddit groups largely communicated using negative sentiments, which is expected given the nature of the discussion topics [
54,
55,
56]; (ii) the discussions (encouragement words) (positive sentiment) on both social media platforms (Twitter and Reddit) focused on how people obtain support from family and friends and discussed problems regarding mental illness, which is a common topic of discussion among individuals suffering from mental illness [
54]. However, there are some differences. Computing the mean and standard deviation across LIWC indicators shows that the differences between Reddit and Twitter are statistically significant when comparing a chosen set of LIWC categories (clout, analytical thinking, authenticity, tone, pronoun, death, emotion). This could be due to the restriction on words, or the depth of topics discussed on Twitter as compared to Reddit. According to the literature, length influences writing style, exhibiting specific linguistic aspects; for example, length limits disproportionately retain negative emotions, adverbs, and articles, and conjunctions have the highest probability of being omitted [
57].
- 2.
To investigate whether a machine learning model can be developed to categorize a user’s social media activity patterns into different mental illness groups.
In this study, the machine learning model (CNN + word2vec) has been successfully used to classify various mental illnesses using social media platforms (Twitter and Reddit). We are comparing the proposed model, which was trained and tested on Reddit, with Kim et al.’s (2020) model [
34]. Our proposed approach has a greater classification effect, as well as a higher average recall rate, F1 score, and accuracy rate. Kim, J., et al. (2020) reported that the model achieved remarkable accuracy. However, our suggested model is simpler and has a lower level of complexity analysis, achieving a greater level of accuracy (2.11%). A critical step was taken to optimize the model by tuning the hyperparameter. The Reddit model has the lowest F1-score on autism, bipolar disorder, and schizophrenia, which is due to the class imbalance problem [
34].
The reasons why the performance of Reddit is better compared to that of Twitter are as follows: (i) Reddit communities are monitored by individuals who volunteer to be moderators. Moderating privileges include the ability to delete posts and comments from the community. A moderated Reddit post might become a safe space to discuss topics related to mental health. Reddit users may find it more comforting that a moderator may remove harsh or harmful messages or individuals from the subreddit [
58,
59]. (ii) Because there are more posts with promotional content on Twitter, tweets about mental illness symptoms are sometimes diluted by other topics such as fitness blogs or meditation seminars. [
58,
59].
- 3.
To understand if a machine learning model trained on a specific social media platform can be generalized to other social media platforms.
We investigated machine learning classifiers (CNN and word2vec) to provide a generalized method for classifying various mental illnesses using social media data (Twitter and Reddit). We trained and tested machine learning models using labeled Twitter datasets, and then compared the performance of our trained models to other social media sources using non-Twitter datasets (Reddit). Despite the differences in linguistic characteristics between Reddit and Twitter, our machine learning models could generalize the model between social media platforms (Twitter and Reddit). We compared the results of testing the Twitter model on Reddit and testing the Reddit model on Twitter to better understand how well each model generalizes beyond the platform. The Reddit model seemed to have a better performance than the Twitter model, indicating that the Reddit model could generalize better compared to another social media platform (Twitter). The Reddit model tested on Twitter had F1 scores ranging from the twenties to sixties percent across various mental health conditions and had the lowest accuracy for depression and anxiety. The Twitter model tested on Reddit had F1 scores ranging from thirties to sixties percent across various mental illnesses and had the lowest accuracy for depression, which is due to the class imbalance problem [
34].
The reasons why Reddit is better compared to Twitter are as follows: (i) We believe that the reason the Reddit model is better at generalization is due to the interaction structure of Reddit making it ideal for seeking expert opinions. Twitter may provide more frequent updates on an event, whereas Reddit may provide more critical analysis regarding the same events. Furthermore, Twitter users tend to discuss political news and current events more so than Reddit users; however, Reddit may be a better choice for news updates and entertainment discussions [
41]. Twitter is suitable for obtaining frequent updates during an emergency or any live event. Reddit’s unrestricted post length plays a vital function in providing us with additional background information. The reasons why Twitter may be better compared to Reddit are as follows: (ii) In contrast to Reddit, imposing a post length constraint on Twitter helps to reduce biased and extreme viewpoints. Furthermore, an event on Twitter can be tracked for a longer period, which might be valuable for analyzing its evolution [
41]. (iii) Twitter’s high negative associativity suggests that information dispersion is heavily influenced by the user who created the tweet. When a user has many followers, their post is more likely to spread quickly and widely. Reddit users do not have close communities, as seen by a low clustering coefficient, and a small number of related components; as a result, information spreads slowly on Reddit on average [
41]. Thus, there are significant differences between these two platforms (Twitter and Reddit) in terms of user behavior as well as their conversation and posting patterns. However, given the available abundance of these platforms, each with its uniqueness in presentation, spreading patterns, and user interests, a comparative analysis of their efficacy would be beneficial [
41].
Our long-term objective is not only to provide aid to clinical researchers and policymakers in addressing communications on social media but also to aid individuals suffering from mental illness. To allocate resources more effectively and provide help where it is most needed, the authorities (policymakers, healthcare professionals, etc.) must keep track of the population’s mental health over time and across different geographic regions. Our findings might be beneficial for policymakers, academics, and healthcare professionals interested in understanding the occurrence of different mental health conditions and concerns over time and in different locations, and hence in formulating better policies, recommendations, and health promotion activities in response to address the issue [
16].