4.3.3. Discussion on Results Obtained in Statistical Analysis
Analysis of Normalized Correlations among LIWC Categories
In this section, we present the insights we had by analyzing the dataset according to the methodology of Normalized correlations among LIWC categories described in
Section 4.3.2. We analysed each risk group separately to either confirm existing knowledge or find new, or even unexpected knowledge in texts containing suicidal declarations. Each potential finding was checked by looking at a number of specific examples from the dataset to either confirm or reject the hypothesis.
Suicidal
One of the categories for which the difference was the highest, both for Pearson and Spearman correlation in the group of highly suicidal subreddits (
Table 8), were Positive emotions (Love, nice, sweet) and Reward-related expressions (take, prize, benefit). Although at first glance this might seem contradictory, a greater frequency of positive emotions is in fact a well-known characteristic of farewell letters. These often represent gratitude towards family members and friends, who have supported the individual for a long time. Moreover, texts written by individuals who made the final decision, often contain phrases related to achieving the goal, which often is represented with reward-related phrases. The difference is also visible in both Pearson and Spearman, yet weaker in the latter.
Some of the examples include: “Finally got an opportunity to do it. Gonna throw myself off a building on [DATE]. Thank you guys. Your’e the only reason I made it this long”, “I put my shotgun to my chin and pulled the trigger (not bullet in the chamber) and it felt good”, or “i’m going to kill myself before i turn 20. it’s the best option for me. end it early while i still can”.
On the other hand, there were also cases where the correlation appeared by mistake. For example, in the sentence “Corona take me away. I no longer want to be alive”, the word ‘corona’ (crown) was considered by LIWC as a reward. In addition, in question such as “Is this enough to successfully kill myself?” the word ‘successfully’ was triggering both positive emotions and rewards, which, looking at the context, is not an appropriate interpretation.
The use of emoji (graphical emoticons) together with Netspeak (internet slang) suggests the user’s lighthearted attitude towards the written content, thus it can be considered as a valuable negative predictor (
Table 9).
Similarly to emojis with internet slang, the use of emoji together with fillers (aaa, uhmm, etc.) suggests a lack of decisiveness and in general a lighthearted attitude (
Table 10). Thus, despite no correlation in TP, it had a sufficiently strong correlation in FP to consider it as a valuable negative predictor.
A good example of the fact that one category alone does not have a predicting capability is the correlation between Internet slang and motion-related expressions (arrive, car, go) (
Table 11). Despite the occurrence of the popularly accepted Netspeak, in suicidal texts, the slang is usually of much lesser intensity, yet with the use of motion-related expressions, it shows a sufficiently strong correlation in TP. Here, typical examples of Netspeak include such contractions as ‘going to’ → ‘gonna’, or ‘I dont know’ → ‘idk’, etc., while movement and motion-related expressions include ‘go [kill oneself]’, or ‘stop [living]’, as in the following sentences: “I want to die. I hate my life. And I think imma set a date and go through with it”, “im gonna hurt myself. i dont want to but i cant stop myself”, “[…] idk what im going to do next. If I end it this is my goodbye and I’m sorry”, or “I get off of work in three hours. When I get home I think I’m gonna dip. […] idk how to start saying goodbye to people without drawing suspicion”.
In suicidal messages colons (:) often represent time, either reported, or planned for suicide (“It’s 3:24 in the morning here in Portugal, and im having suicidal thoughts!”), or when used for further explanations (“[…] in a few words: I’m afraid of food. I just want to die”) (
Table 12). They are also often used in emoticons (“:(”). Together with exclamations (!) they usually express frustration, often because of someone’s bad advice (“They always say the same thing: life is wonderful, […] good sides make it worthwhile, etc… The thing is- I already know that! […] I’m just done with living”, or “here’s the bad thing: […] I have times where I just…just get pushed off the edge!”). On the other hand, suicidal Reddit posts can sometimes be very long (several thousand words), with whole life stories explained leading to the suicidal decision, and thus sometimes the two categories simply appear together without any specific direct relation. This shows another disadvantage of the LIWC, namely, the lack of structured analysis incorporating the inter-relations between categories.
Words representing Social processes (mate, talk, etc.) correlated negatively with Authenticity for both Pearson and Spearman, more strongly in FP, than in TP (
Table 13).
Authenticity is defined in LIWC in the following way: “When people reveal themselves in an authentic or honest way, they are more personal, humble, and vulnerable. The algorithm for Authenticity was derived from a series of studies where people were induced to be honest or deceptive [
105] as well as a summary of deception studies published in the years afterwards [
106]”.
Since the correlation was negative, it can be hypothesized that the more the users write about their social life, the less authentic they seem, and vice versa. Although this assumption seems reasonable, the tendency of the correlation was the same in both FP and TP. It is also a reasonable result when we consider that basting about one’s personal life in non-suicidal context has a different quality than when a suicidal individual explains their life story leading him to the tragic decision.
However, although the tendency of the correlation for this pair was consistent, and examples are confirming the hypothesis, LIWC scores for those categories for specific Reddit posts were low compared to other category pairs. This might be due to the difference in calculating each of the two categories—Social processes are simply calculated on the basis of the number of social words in the sentence, while Authenticity is calculated using a specific algorithm.
Exclamations correlated positively with Negative emotions (
Table 14), and more precisely, with Anger-related expressions for FP rather than TP. Firstly, it is reasonable, if not obvious, that anger is often expressed on the internet with the use of such linguistic tools as exclamation marks, therefore, the fact that there is some correlation is not surprising. It is also predictable that such correlation will be more expressed in FP, rather than in TP, since truly suicidal individuals are less likely to scream their negative emotions out (and thus reach a relief), but rather would bottle them up. In this sense, the correlation is in line with present studies on writings of suicidal and depressive individuals. However, since the correlation is not strong even in FP, it cannot be considered as a strong negative predictor, but rather a supporting one.
Death-related expressions and 1st person singular (‘I’) is the most obvious correlation to expect in suicidal and suicidal-looking texts (“I want to kill myself”, “I want to die”, etc.) (
Table 15). LIWC also properly shows that this feature is more pronounced in truly suicidal messages. However, due to the prevalence of this characteristic, it should be taken with caution, as it also occurs in plentiful numbers in pseudo-suicidal texts, such as on gaming forums (“I just died”).
High
One of the most conspicuous differences in High-risk group was between categories Death (words related to death: bury, coffin, kill, etc.) and Focusfuture (words related to the future: may, will, soon) (
Table 16). Since correlation in TP for this pair was sufficiently and significantly strong (0.42722 ****), while for FP there was no correlation (0.03902, although not statistically significant), we can consider this correlation pair as meaningful. Often in suicidal posts, users write phrases like “I will soon killl myself”, “soon my life will end”, or “I may kill myself tomorrow”, etc.
For Spearman, the difference was not observable, which means that the strength of the correlation is not due to the order (ranking) but the actual distribution.
Another strong candidate was the pair of Colon (:) and Dash (-, –, ―) in High-risk category (
Table 17). The correlation was especially significantly strong in TP (0.50178 ****), and significantly weak in FP (0.11621 ***), which suggested a meaningful correlation in suicidal texts. However, as this correlation was between two sets of punctuation marks, we did not have high hopes for this pair result.
In practice, users were, e.g., using sad emoticons (“ :-( ”, “ :( ”), which contain colons and often dashes as well, to express their sad emotions. Moreover, there was an unrelated, yet correlated, usage of colons and dashes, used in explanations of users’ mental states, or specifying time periods in personal stories (“For 3–4 days a month, I’m a mess”.)
However. Spearman’s correlation was similar for both TP and FP, which could suggest that the differences in data distribution and differences among other LIWC features caused the high difference between Pearson’s correlations. Therefore, although this category pair seems meaningful, further experiments are needed to confirm whether this result was not purely data-dependent.
Next, potentially viable pair of categories was Apostrophes (“, ”) and Death (
Table 18). We noticed that as the texts in the dataset often contain words related to death, and thus the LIWC category of Death often appears in correlations. This means, that although this category could provide plentiful new knowledge, it could be equally misleading.
However, this time the stronger correlation was in FP (0.35013 ****) rather than in TP (−0.00992), which would rather suggest a meaningful non-suicidal correlation. This could be useful in filtering out messages that despite showing high occurrence of death-related topics are in fact not suicidal.
A stronger correlation for words related to death and apostrophes together in FP could suggest that people sometimes use those words in quotation marks thus indicating not being serious while using it. It could also be used to quote famous death-related quotations from literature or poems. Some people might use apostrophes for citing specific ways to die.
However, the most frequently seen explanation of this correlation was in the use of death-related words with contextual use of personal pronouns and grammatical contractions (I’m, he’s, he’d, don’t, etc.), such as in the following examples: “I’m dead”, “I really wish he’d die from covid-19 or anything else”, “It’s quite the hell”, “I hope I don’t die from stress”.
However, since there was no sensible correlation found for this category pair for Spearman, it should be checked in the future whether this correlation was not data-dependent.
A weaker, but meaningful correlation was noticed between categories Periods (“.”, “…”, etc.) and Death ( bury, coffin, kill, etc.) (
Table 19). The correlation was moderate in TP for Pearson (0.27252 ****) and weaker for Spearman (0.11526 **), while there was no correlation for FP.
Since higher correlation was found in TP. Specific examples showed that when people write about death they use an ellipsis (…), such as “And then everything will come to an end …”, or “I’m suicidal and this is my last bit of purpose and hope …”. However, in large part death-related words in TP were simply used in carefully written full sentences. Although a far-fetched explanation could suggest that users who are suicidal tend to use the full stop (period) character to symbolize their suicidal decision, a weak correlation for Spearman suggests that the correlation rather might be data-dependent, and thus analysis on a different set of data would be desirable.
Another valuable correlation was between the categories Ingest (Biological processes → Ingestion, containing ingestion-related words, like: hungry, hungrier, hungriest, dish, eat, pizza) and Leisure (Personal Concerns → Leisure, containing words related to leisure activities, cook, chat, movie) (
Table 20).
Ingest and Leisure correlated significantly strongly with TP, suggesting people write more about eating and leisure activities in suicidal messages.
Eating disorders have been correlated with depression and suicide. In fact, there was a number of examples proving this, such as in: “I’ve been dieting to help myself feel better about myself, but my mom is catching on, so she’ll probably make me eat and gain weight”. or “I have to be at work in an hour and I cant even bring myself to get ready or at least eat something”.
However, in most examples this suggest a misunderstanding because of words such as “talk” (leisure) and “drink”, such as in e.g., “I dont have anyine to talk to, so I will drink myself to death/swallow a bunch of pills and die.” Although in principle this confirms the correlation, considering “lack of having someone to talk” as Leisure and “drinking”, in the sense of being drunk with alcohol, as Ingestion, seems more likely to be caused by a lack of contextual processing in the LIWC software.
This is confirmed in Spearman, where for TP and FP the two categories were similarly correlated (TP = 0.35925 ****, FP = 0.33136 ****). Spearman also shows moderately strong correlations, which suggests that the distribution (many highly scoring samples) influences the results.
A strong candidate for a valuable correlation was between Death (words related to death, such as: bury, coffin, kill, etc.), and Informal language (
Table 21). Additionally, the Informal language category is a super category for several sub-categories such as Swear words (fuck, damn, shit, etc.), Netspeak (btw, lol, thx), Assent (agree, OK, yes), Nonfluencies (er, hm, umm), and Fillers (I mean, you know). Therefore, we checked additionally whether one of these categories is more influential than others.
In general, the correlation was significantly strong for TP (0.44829 ****), rather than significantly weak in FP (0.11313 ***), which suggests that more informal language is used with death-related words in actual suicidal texts. This is a good candidate for a strong predictor.
However, a closer look at which of the Informal language subcategories revealed that there was no visible pattern of correlation for any of the separate sub-categories. Only when all of them are combined a clear correlation was visible. Thus the super category as a whole is a stronger predictor than any of the sub-categories. This is an interesting discovery in the sense that a too fine-grained analysis might sometimes blur the image, while a more coarse-grained look at the data might reveal some interesting findings. Unfortunately, there was no meaningful correlation for Spearman, thus there is no confirmation for this correlation in rank-based correlation. Typical examples for this correlation were, e.g., “I just want to fucking kill myself” (Death with Swear words), or “I wanna die now” (Deat with Netspeak).
An interesting correlation was observed between Death and Discrepancy (should, would, etc.). There was a strong positive correlation in TP (0.51509 ****), and weak in FP (0.22868 ****) (
Table 22). This was clearly visible in suicidal messages (TP), where users often used phrases like “I should kill myself”, or on the other hand, ask questions like “Should I kill myself?” Thus, the correlation could be stronger there. In FP users also use such phrases (thus the correlation exists), but in a metaphorical sense, and the co-occurrences are not that frequent (thus the correlation is lower in FP). Spearman’s correlation did not show a strong correlation, but the tendency was similar (TP > FP).
Death also strongly correlated negatively with Word count for Spearman (
Table 23). This would suggest that when sending actual suicidal messages and talking about death users tend to write shorter sentences. Pearson revealed a similar yet weaker correlation, however, since the correlation is stable, we can say that this could be a strong predictor. Shorter sentences could be more indicative of a suicidal attempt.
On the other hand, Colons (:) strongly correlated with Exclamation marks (!) in FP for Spearman (
Table 24). Since the correlation is stronger for non-suicidal messages, this information should be treated rather as a disambiguator than a predictor.
Since it is easy to recognize each category in the text (colon, excl. mark), the information could help detect FPs. Unfortunately, the strength of the correlation is not high enough to simply use it as a rule for detecting FP. However, it could be useful to check if the presence/absence of those two features together changes the results in machine learning approaches to suicidal text detection. Pearson showed a similar, although weaker tendency in correlation.
A moderately strong negative correlation was observed for categories Death with 1st person plural, or “we” in short, which contains such words as “we”, “us”, “our”, “ours”, and is a subcategory of Personal pronouns (Function → Words → Total pronouns → Personal pronouns → 1st person plural) (
Table 25).
Both correlations are negative, but the only one considerably meaningful is for TPs. This means that the less 1st person plural pronouns are there and the more death-related words are in the message, the more it is probable for it to be actually suicidal.
From the psychological point of view, “we” is a word building community, so also strength. People who want to commit suicide, usually feel alienated, alone, lonely, thus would use such words less often. This tendency is clearly visible in examples, although it must be noted that 1st person plural pronouns do appear in suicidal messages from time to time.
Similarly to the correlation between Deat and We, there was a weak, yet potentially meaningful correlation between Death and 3rd person singular pronouns, or in short “Shehe”, which is also a subcategory of Pronouns (Function Words → Total pronouns → Personal pronouns → 3rd person singular), and contains such words as “she”, “he”, “their”, etc. (
Table 26).
TPs moderately characterize themselves with a higher frequency of death words and at the same time lower frequency of she/he pronouns. An explanation for this could be that users who want to commit suicide would usually talk about themselves, thus would use she/he pronouns less frequently. However, even in our data, there were many people who wrote they want to commit suicide because of someone else (family member, harassers, etc.). These people use such she/he pronouns more often.
Medium
High positive correlation for FP suggests that users express their feelings in a variety of ways, including rhetorical use of suicidal phrases like “I want to die” in a non-suicidal context, often connected to health-related complaints (
Table 27).
Messages typically included phrases like “I hate my life”, with “Bio” mainly referring to the ‘life’ keyword, which was not particularly representative. Other examples included, e.g.: “I really need help, I feel so alone so sad. Don’t wanna be alive no more. Need drugs to live because nobody can love me. This hurts”, or “hope that i get coronavirus and die from it because of how shitty my life is going lately”.
Anger, which is related to affect, showed a strong positive correlation with Biological processes in non-suicidal messages (FP) (
Table 28). In rhetorical FPs, users expressed their anger (also related to health and/or their looks) by saying they want to die—e.g., “The pain is so big that I feel I want to die” etc., which should not be treated literally. In TPs, they were expressing angry while adding they wanted to die in a literal sense.
The majority of the examples included the ‘life’ keyword as representative of the Bio category. Some others included ‘face’, which is more representative, or ‘sick’ in the sense of ‘being sick of something’, also ‘to face something’ as a verb etc, ‘guts’ as in “I don’t have guts to do it”. Crucially, ‘bio’ often did not refer to a specific biological state (‘life/death’ are too general), but was annotated on words used in metaphorical meaning.
Some examples from FP include: “I fucking hate my life”, “Every girl that sees my face has no interest. Why shouldnt I kill myself?”, “I have no idea why I’m anxious, it’s a week now. I’ve not been to uni[versity] either. I have no idea what the hell is wrong with my brain. I couldn’t sleep much, now I can’t get up from bed. I fucking hate it”.
While actual suicidal messages (TP) included: “Time and time again each day I am reminded of how much of an ugly loser I am. I have no real life friends. I do not deserve to be happy or have a good life. I deserve to kill myself and suffer”, “I’ve been watching Unbelievable on Netflix and I just have so many traumatic memories of my assault/rape flooding back. I just want to die. I’m feeling impulsive today and I’m afraid I might do something rash. […]”
Similarly to the correlation Anger with Bio, Anger also correlated with the Health category in FP (
Table 29).
This is due to the fact that in FP, users express their anger (often related to health issues) by stating rhetorically that they want to die, e.g., “The pain is so big that I feel I want to die” etc., which should not be treated literally. An additional example for this would be: “I don’t want to kill myself but sometimes (every night) the pain gets so bad that I cry myself to sleep again. I don’t want to die but I am going to end up killing myself because the pain is too much”. Additionally, in some contexts users expressed their wishes to die of some sickness (often coronavirus), such as in: “I honestly hope that i get coronavirus and die from it because of how shitty my life is”.
High positive correlation between Anger related expressions and 1st person pronouns in FPs suggests that users either express their own anger towards some external object (“I hate …”.) or express anger towards themselves (
Table 30). Typical examples from FP confirm that, as in: “I hate my fucking life”, or “I fucking hate dating I’m going to die alone”.
As this correlation is more represented in FPs, this pair of categories could be a useful disambiguator for messages in Medium risk group.
Biological processes (eat, blood, pain) and Negative emotions (hurt, ugly, nasty) correlate positively in FP more than in TP (
Table 31). In FPs, users often express their negative emotions related to health issues by saying metaphorically that they want to die, e.g., “The pain is so big that I feel I want to die” etc. Like in many cases for correlations with the Bio category, the meaning of biological processes is broad and includes ‘life’ as well as metaphorical references to body parts. All FP cases of “I want to die” represent this category as well.
Examples from TP include: “I hate my fucking life”, “Even with medication i’m still convinced that everyone hates me […] I hate my life why this disease had to choose me?”. While examples from FP, where the correlation is much stronger, include, e.g.: “I want to die I can still pass the course but this is so fucking stupid oh my god im so sad :/”, or “My life is just shit. I am terrible at school, my parents always yell at me, I am sad all the goddamn time for no reason at all. All I want to to is bake bread and die”.
The correlation for Death and Discrepancy (should, would—function words related to planning) is sufficiently strong for TP for both Pearson and Spearman (
Table 32).
Although Pearson correlation for FP is also strong, it is negligible in Spearman, thus we can say that more meaningful is the Pearson correlation for TP, which can be considered as a strong predictor. “I should just die”, “I just want to die”, or “I’m planning on killing myself tonight. I keep fucking things up for everyone. And I’m a freakshow. It would be better if I was dead”. FPs include either rhetorical use of ‘I wanna die’ cases, or some speculations that one could die unintentionally, such as in: “Plus I’m stuck at work with comprised health. I could be dead soon”.
1st person pronoun (I, me, mine) moderately correlates with Sexual words (horny, love, incest) for FP (
Table 33). In general sexuality is more common in FPs, which would suggests that users talk about their own sexuality and sex-related problems and desires which may coincide with rhetorical figures of wanting to die. However, few examples are actually sexual—it is usually about the use of word ‘fuck’ being used as an expletive.
A similar situation to the correlation between Sexual words and 1st person pronoun was between Sexual words and Personal pronouns (I, them, her), which is a super-category for 1st person pronoun, and influenced the result in this case, thus needs not be considered separately (
Table 34).
Low
Comparing to previous risk group categories, messages in Low are mostly non-suicidal, and those which appear in this group can be expected to have different characteristics than those messages which appear on Reddit channels typically used to post suicidal posts. Therefore, we expected vividly different LIWC categories to stand out.
A category pair for which a vivid correlation was found was the pair of Health (containing words such as clinic, flu, pill), which is a subcategory of Biological processes concerning health and illness, and Money (containing words such as audit, cash, owe), which is a subcategory of Personal Concerns dealing with financial matters and the worries that come with them (
Table 35).
There was a strong positive correlation in TPs with a large difference between TPs and FPs, and no correlation in FPs. It would seem that in the TPs from the Low category people tend to discuss matters related to financial or health issues or both (large costs of treatment leading to a difficult financial situation). These descriptions might function as explanations for their suicidal state.
Some of the examples contained people mentioning their struggles with health or financial matters. There were also samples which combined the two issues: “I can’t afford to go to the doctor” or “My parents obligated me to go this september we do not even have the money and I am failing most of my exams and I just want to stop living”. Mentioning these difficulties serves to contextualize the authors’ suicidal tendencies and provides an explanation for their mental state. On the other hand, this pair also includes certain messages that do not correspond to discussions of money or health-related issues such as “Life isnt worth living”, in which the Money category is realized in a metaphorical meaning.
A correlation similar to the above Health and Money was the pair of Money and Bio (which includes words such as eat, blood, or pain) and represents Biological processes, which is a larger category concerning health, body, sexual and eating matters (
Table 36).
Pearson correlation for this pair in TP was significantly moderately strong and the difference was large, yet with no correlation in FP This pair and its correlations are rather similar to the previous one, but rather than just health issues the discussions here are somewhat broader, and revolve around sexual issues or eating disorders.
Due to the similarity between this pair and the one preceding it, there is a lot of overlap between the two. The content of the messages correlated with this category pair is somewhat broader compared to the previous one, with the first example listing the ways in which the author prepared their body as part of their suicide plan. The same issue with the metaphorical use of money-related phrases as in the previous category pair was also noticeable here, as well as the phrase such as “Life isnt worth living”, which also correlates with it.
Another standing out pair were the categories of Informal language (a general super-category spanning different subcategories of informal language such as netspeak, nonfluencies, etc.) and Fillers (I mean, you know), which is one of the subcategories of informal language (
Table 37).
Strong positive Pearson correlation in TPs, weak positive correlation in FPs, large difference (with smaller difference yet similar tendency for Spearman), point out to simply a correlation between a super-category and subcategory. It correlates more strongly with TPs, which could suggest that people are more informal TPs, perhaps as a coping strategy. Or since they’re writing on subreddits in the Low risk group category there’s a need to tone down the serious content of their posts, such as in the example such as “anyways i wanna die”.
Many of the messages in this correlation pair are quite short and indeed use informal language or fillers, but their content does not seem to support the above hypothesis. They are rather straightforward suicidal messages that are simply written in a fairly informal way, nothing about them suggests that the authors are trying to tone down the serious content of these messages. At the same time, many of the messages correlated with this pair are also quite long and they do not read as particularly informal, exhibiting proper use of punctuation, standard vocabulary and full sentences, they just use a fair amount of contractions but even there it is not an unusually high amount.
Second pair related to Filler and Informal language in general was a pair of Filler and more specific from the Informal language category, namely, Non-fluency (er, hm, umm), which is one of the subcategories of Informal language (
Table 38). There was a strong positive Pearson correlation in TPs, and no correlation FPs, with a large difference.
The hypothesis in this pair is nearly identical to the previous one and so are the conclusions with the caveat that in general the messages in this pair appear to be longer on average than in the previous one. At the same time, apart from the use of certain nonfluencies or informal language in general, they do not stand out in any particular way in terms of the overall content of the messages.
Another pair for which there was a larger difference in correlations between TP and FP was Body (cheek, hands, spit, etc.) which is a subcategory of Biological processes concerning specifically the body, and body parts, and Power (superior, bully, etc.), which is a subcategory of Drives, containing words dealing with power dynamics (
Table 39).
Positive, yet not overly strong correlation in TPs and no correlation in FPs suggests appearance of discussions of ones’ body and issues related with it as well as potential imbalances of power between interlocutors or characters in a story described in a Reddit message.
Although many messages adhere to the above explanation, unfortunately, a closer look at many other messages revealed that especially the way the power-related words show up in these messages does not have a lot to do with actual power imbalances between people but it is rather used by the authors to discuss their own powerlessness, as in “I’m ugly unlovable and apparently unloving and unbearably annoying no one would want me and I must die soon cuz I can’t be alone anymore I just can’t someone help me kms”. One other way in which body-related words appear in these messages is the discussion of the specifics of suicide plans, as in: “So I guess I’ll just jump, and then snap my neck”.
Common Adjectives (free, happy, long) also correlate with Swearing (fuck, damn, shit—subcategory of informal language) more for TP than FP (
Table 40).
3. Correlation strength and Hypothesis Specifically, there was a moderately strong positive correlation in TPs, and weak positive correlation in FPs. This suggests that the posts in TPs might be more descriptive with a higher frequency of swear words, which could signify frustration and anger.
This pair shows a straightforward correlation with the adjectives present in these messages, describing how the authors feel about themselves, their mental state or their outlook on the world and life. The swear words on the other hand also function as adjectives (as in: “The world is so fucked up”) or serve to strengthen the overall message and express anger, frustration or helplessness, as in: “I want fucking die”, “i wanna kill myself how’d i let myself be sick goddamit”, or “I’m fucking tired of living …”.
Adjectives correlate with Sexual words (horny, love, incest—subcategory of Biological processes, terms concerning sex and sex-related matters) similarly as with Swearing (
Table 41). This strong positive correlation in TPs, weak positive correlation in FPs suggests that the posts in TPs might be more descriptive with a higher frequency of sex-related words, which could mean long and specific descriptions of sex-related issues are more common in TPs.
Descriptions of sex-related issues can also be found among the messages representative for this pair. Usually, they serve as an explanation for the author’s suicidal tendencies, as in: “I don’t think I’m a good person. I was raped when I was six years old and raped again on my 21st birthday”. Sexual words also function as a form of verbal self-harm, as in: “I’m not even strong enough to commit suicide I’m a fucking failed abortion But tonight Im actually doing goodbye world”.
Unfortunately, the messages representative for this pair also reveal a weakness of the LIWC tool—there is a major overlap between the messages in this pair and the adj_swear pair, in cases where “fuck” or “fucking”, used simply as swear words, are also interpreted as sexual words. This can be seen quite clearly in the following example: “fucking fuck why am i sad at home this fucking early shit fucking fuck. i dont know why im sad. i just … am. nothing bad’s happened. nothing big happened, so why the FUCK am I sad. FUCK I HATE LIFE”.
Interestingly, Anger related words (hate, kill, annoyed), which is a subcategory of Psychological processes, strongly correlate positively in Low risk group with Biological processes (eat, blood, pain, etc.), which is a super-category concerning health, body, sexual and eating matters (
Table 42).
Strong positive correlation in FPs, and weak positive correlation in TPs suggests a combination of words that can both describe a negative physical state as well as violent video games, e.g., in the community of gamers.
This category pair is more common in FP posts, however it does not only correspond to gaming-related messages. Major part of the FP messages seen for this pair has their authors simply vent their frustrations and anger about the world, particularly in sentences starting with “I hate …”. The TP messages showing for this pair are somewhat similar to the ones found in the adj_swear and adj_sexual pairs, with swear words signifying anger. Some of the typical examples are: “I’m either killing myself or someones else. I don’t fucking care about anything anymore, not a single fucking thing and I’m tired of this shit I’m fucking ending it”. On the other hand, FP messages (similar in wording to suicidal, but not actually suicidal) contain such messages as: “I hate my fuckin life lol”, “fuckin hate living in this house”, or “I’m fucking dying lmao” (humorous or sarcastic).
Last pair of categories for which there was a sufficiently strong correlation in TP and a non-negligible difference between TP and FP was for Home (kitchen, landlord), which is a subcategory of Personal concerns, and concerns matters related to home and maintaining it, and Leisure (cook, chat, movie), another subcategory from Personal concerns, containing words related to relaxation (
Table 43).
Strong positive Spearman correlation and a weak positive Pearson correlation in TPs, with high difference with FPs, could suggest that these terms are brought up in TPs more often either due to issues with the authors’ home situation or because of an inability to relax. It could also signify that the authors have certain issues in these areas or find that the situation at home does not support them or help them with their mental issues.
The particular examples show that this category pair does correspond primarily to home-related topics, usually in the form of concern or issues with the authors’ family. Users write about their desire for normal family life or friendship as well as their desire not to cause the family any problems or concern, such as in the following example: “I don’t want to live anymore and if that takes me then my family may see me as less of s failure”. If leisure is clearly mentioned, it is typically in the context of it being unattainable, due to circumstances having to do with home, family or other people, as in the following example: “But my misophonia sucks. I don’t have my room, live in small house and everyone has to always shitting eat or drink something or somwthing that triggers my misophonia and everyone except my brother is a trigger… […]”
All
We also analyzed the correlations for all texts, without dividing them into risk categories. This was conducted with two goals in mind. Primarily, we wanted to see what are the novel and overall most general correlations, not specific to risk categories. Secondly, we wanted to confirm the strongest and potentially the most informative correlations from the category-based analysis. Specifically, those correlations which reappear also in this overall analysis would be those to which suicide counselors and practitioners should put the most attention when making an assessment of a suicidal message.
The division into four risk categories (Suicidal, High, Medium, Low) was also an important step from the point of view of performing
information triage [
107] for future use by suicide experts and counselors. Information triage is a recently developed approach to the processing of information on social media with the goal to efficiently provide useful information to experts for more detailed analysis. For example, Ptaszynski et al. [
107] performed information triage on Twitter during natural disasters (earthquakes, floods, etc.) to extract only tweets that were sent from actual victims of the disasters during the event with the goal to provide this relevant information directly to rescue teams.
In the context of suicidal messages, the division into suicide categories provides additional information to suicide experts and counselors, which can help disambiguate potential pseudo-suicidal messages classified initially as potentially true suicidal messages. However, having the correlations analyzed from the overall perspective as well is also important. Risk category-based analysis enforces the division of the whole dataset into smaller risk group portions, which means the correlations in risk groups are calculated based on a smaller sample size. The overall analysis gives a more holistic view of the language used by suicidal users, which can provide new correlations, lost in risk group-based analysis. Additionally, those reappearing in the overall analysis can be considered as the most valuable in general. Although we acknowledge that the fact that a correlation for a category pair reappeared in the overall analysis could be caused by differences in numbers of samples in the risk group-based analysis, namely, correlations from more populated groups are more likely to reappear in the overall analysis. Moreover, the overall analysis also confirms the validity of the risk group-based analysis, as most of the correlations that appeared in the overall analysis were with only one category, namely, Word Count, which is also not sufficiently informative compared to other more vocabulary-specific categories.
Correlations Confirmed from Risk Group-Based Analysis
There were two category pair correlations which reappeared in the overall analysis, namely “Death and Discrepancy” and “Death and Focusfuture”. This confirms that death-related vocabulary is pervasive in suicidal messages. However, in practice, both of those correlations are most often represented by two simple sentence examples (and their various modifications), which notoriously appear in suicidal messages, namely, “I should kill myself” (Death [kill] and Discrepancy [should]) and “I will kill myself” (Death [kill] and Focusfuture [will]).
Correlation between Death and Discrepancy was the strongest correlated category pair in the overall analysis for True Positives (actual suicidal messages). This suggests that it can be considered a strong predictor (
Table 44). The reason for it being so strong was that it also appears in High and Medium risk categories, which makes it a highly populated category pair. However, an important fact is that even comparing to High and Medium, the correlation is even stronger in All, which suggests that it also appears to some extent in other group risks but is lost among other correlations due to high cut-off threshold (see
Section 4.3.2).
The second correlation was between Death and Focusfuture, which also appears in High. Here, however, the correlation is weaker, which suggests that there were not many additional cases for this correlation in other risk groups, which would strengthen the overall correlation (
Table 45). Still, together with Discrepancy, Focusfuture makes up one of the most pervasive category pair, and as such can be consider as a valuable predictor.
New Correlations Not Appearing in Risk Group-Based Analysis
The only newly appearing Pearson correlations were between Word count and various subcategories of Psychological Processes (Anxiety, Sadness, Feel, Risk, Ingestion, Body) and Personal Concerns (Health, Home), and were all negative predictors, or disambiguators, meaning that the correlations were stronger for False Positives (messages that look like suicidal but are not) (
Table 46). All correlations for FP were also positive, which would suggest that if someone is talking about their concerns and problems related to psychological processes, then the longer the message is, the less likely they would be to commit suicide. This can be backed by situations where the user writes the message to relieve stress or seeks help with solving such problems. However, practice shows that long messages containing such expressions of concern are often explanations of whole life stories, which can represent a farewell letter of actual suicidal user, and thus should be taken with caution, especially, since Spearman correlations do not confirm the tendencies in all cases.
Rejected Correlations
Pronouns and Rewards correlate more strongly in FPs in suicidal subreddits, however, we were not able to find any meaningful, coherent, and true relationships in the use of those two categories (
Table 47).
Exclamations and Affect seem to correlate more strongly for FP in FPs from the suicidal risk group (
Table 48). However, Affective processes (happy, cried, etc.) typically occur in sentences where the author is expressing their sincere emotions, which when conducted with high intensity, also often coincides with exclamations marks. Therefore, as this correlation is too general, and could occur in any circumstances, with no particular sentence pattern, we decided to not consider it as a predictor.
One of the potential correlations meaningful which for which there was a strong difference between TP correlations and FP correlations was the category pair of Apostrophes (“, ”, ‘, ’) and Personal concerns (topics related to things which people could consider as personal concerns, wuch as Work, Leisure, Home, Money, Religion, Death) in High risk group (
Table 49).
Since this is a general category, one could expect a similar difference between Apostro and one of the more specific subcategories within Personal concerns, such as Death. And indeed this was the case (see in paragraph High above).
The difference between the correlations for Apostro_persconc was large (0.41277), so if at least one of the correlations was also strong (≥0.4 or ≤−0.4), we could consider searching for deeper meaning in this result. However, since both correlations, although statistically significant, were weak (TP = −0.16251, FP = 0.25026), for TP it was slightly negative, for FP slightly positive, but we cannot expect anything meaningful from such correlations. I.e., no specific characteristic can be expected for neither TP nor FP in this case. Since the difference of correlations for Spearman was negligible, this potential correlation was rejected.
Two categories which were potentially strong candidates for predictors were Focusfuture and Time orientated language in High (
Table 50). There is a realistic probability for those two categories to correlate simply because TimeOrient is a super-category for Focusfuture (may, will, soon), with two other categories including: Past focus/focuspast (ago, did, talked), and Present focus/focuspresent (today, is, now). However, a strong postitive correlation for those categories in TP (0.54688 ****) with weak in FP (0.21270 ****) suggests, that users tend to speak more about the future in actual suicidal texts.
This tendency was confirmed for Spearman (TP = 0.32289 ****, FP = 0.20906 ****). This further confirms the relationship between the two categories, but since one is a subcategory of the other it is too obvious and thus not too informative.
There was a noticeable significant moderate correlation in TP between Money (audit, cash, owe) and Negations (no, not, never) (
Table 51). However, in practice, it typically included phrases like “isn’t worth” (as in “Life isn’t worth living”), which, although being a proper suicidal phrase, the frequent metaphorical use of money-related terms did not allow for considering this category pair as a good predictor for suicidal texts.
Although there is a sufficiently strong correlation for Exclamation markers and Question markers, for both Pearson and Spearman in TP, it is dubious if correlations for such meaning-heavy contents as suicidal texts should be considered for punctuation only (
Table 52).
Although there is a significant moderate correlation for Emoji and emoticons with Colons for FP, we can consider it a trivial correlation as there are many emoticons containing colons (“:-)”); thus, this correlation is rejected (
Table 53).
Similarly to Emoji, Colons also correlate with URLs, which is obvious as colons typically appear in URL (“http://www…”), thus this correlation can be safely rejected (
Table 54 and
Table 55).