A Method for Identifying the Mood States of Social Network Users Based on Cyber Psychometrics

: Analyzing people’s opinions, attitudes, sentiments, and emotions based on user-generated content (UGC) is feasible for identifying the psychological characteristics of social network users. However, most studies focus on identifying the sentiments carried in the micro-blogging text and there is no ideal calculation method for users’ real emotional states. In this study, the Proﬁle of Mood State (POMS) is used to characterize users’ real mood states and a regression model is built based on cyber psychometrics and a multitask method. Features of users’ online behavior are selected through structured statistics and unstructured text. Results of the correlation analysis of different features demonstrate that users’ real mood states are not only characterized by the messages expressed through texts, but also correlate with statistical features of online behavior. The sentiment-related features in different timespans indicate different correlations with the real mood state. The comparison among various regression algorithms suggests that the multitask learning method outperforms other algorithms in root-mean-square error and error ratio. Therefore, this cyber psychometrics method based on multitask learning that integrates structural features and temporal emotional information could effectively obtain users’ real mood states and could be applied in further psychological measurements and predictions.


Introduction
In the psychological field of emotion science, emotions play a crucial role in people's decision-making [1]. Emotions constitute potent, pervasive, predictable, harmful, and beneficial drivers of judgment and decisions through multiple mechanisms [2]. Because mood states comprise of a particularly important set of sentimental factors [3], they may have significant effects on cognition control [4], consumer behavior [3], and investment decisions [5]. Differing from immediate emotions, mood state is a long-term, continuous, emotional state and is reflected in the process of cognitive behavior, including individual perception, learning, and decision-making [3,6].
With the prevalence of Web 2.0, user-generated content (UGC) has exploded on social networks, especially on microblogging platforms such as Sina Weibo, Facebook, and Twitter, which provide data sources for social studies. As the technology of sentiment analysis and natural language processing (NLP) has gradually developed, many researchers have investigated sentiment analysis, applying it to opinion mining [7,8] and prediction of stock performance [5]. Research attempts to acquire people's emotion or mood information by analyzing text data on social networks and extracting emotional words [9]. However, due to the anonymity of the internet, the self-disclosure of social network users is different from that in real life [10], which means the emotion expressed on the internet may not be in agreement with the users' real sentiments. Therefore, this study considered a stable and continuous emotional variable, mood state, trying to build a model to identify the mood state of social network users automatically. The mood state was defined as an indirect evaluation state that allows one to explain things and produce corresponding behaviors for the time being, according to their emotional content [11]. The Profile of Mood State (POMS) is a widely-used scale for the measurement of mood state and the abbreviated POMS used in this study consists of seven mood scales, measuring different aspects of psychological health of individuals [12]. The POMS has been used to study the influence of different mood states on the survival rate of long-term cancer patients [13]. It has been proven that there is a correlation between online behavior and psychological traits through empirical study [14]. In order to automatically identify the mood state of network users, this paper collected users' online behavior, extracted related features and adopted a multitask regression model to build the relationship between mood state and users' online behavior. The online information was used to calculate the psychological variables and this cyber psychometrics method focuses on making linkages from online behavior to psychological traits.
Based on cyber psychometrics, this study proposed a multitask model to automatically calculate the long-term, continuous mood state, and a natural language processing technology was used to build the mood state lexicon in order to extract emotional information. Identifying the mood state of social network users may provide some information for advertisers to improve the recommendation performance. For example, a combination of users' mood with a mood-based music recommendation could possibly make a better effect [15]. Merchants may develop different strategies since people in different moods behave differently with respect to risky decision-making [16,17]. Sentiment analysis techniques have been widely applied to the prediction and observation of political events [18], such as monitoring day-to-day electoral campaigns [19]. Thus, the automated information extraction method proposed by this study could be used to explore or track the preferences of social network users and provide reference information for political decisions as a growing number of citizens choose to express opinions and sentiments online.
On the other hand, this automatic calculation method may be widely applied in predicting other psychological variables or building pictures of users considering the rich information generated by social network users [20]. The remainder of this paper is structured as follows. The related work of sentiment analysis and cyber psychometrics method is discussed in Section 2. The experimental material and method is presented in Section 3. The feature extraction and regression result is described and discussed in Section 4. Finally, the conclusion is presented in Section 5.

Sentiment Analysis and Emotion Recognition in Social Networks
With advances in machine learning and the emergence of big data sets, the computational detection of opinions, emotions, and subjectivities in unstructured UGC (from free-formatted texts, reviews, and blogs) has been applied to research over the past decade [21]. The linguistic and machine learning approaches are two of the main methods used to detect sentiments or emotions. The linguistic method uses dictionaries or lexicons that contain pre-determined affective words to calculate the frequency of a word and determine the emotional attributes of text [22]. The machine learning method uses computer algorithms to automatically learn text sentiment, given the trained data set [23]. This study adopted a lexicon-based method to construct the sentiment-related features and a lexicon of mood state was constructed.
In contrast to emotion, which reflects short-term affects and is connected with occurred experience, the mood state reflects medium-term affects [24]. In NLP research, affects, feelings, emotions and sentiments are often considered similarly and used interchangeably [25]. In general, moods differ from emotions, feelings, or sentiments in that they are less specific, less intense, and less likely to be triggered by a particular stimulus or event [24,26]. They are different in many ways, such as duration and time mode, and the comparison is shown in Table 1. Thus, in general, the mood state is a relative stable variable that may influence one's behavior over a long time and could be reflected in online social behavior.

Cyber Psychometrics Method
Relying on big data, the cyber psychometrics method can help predict the users' psychological characteristics through online behavior and is extensively used by researchers. It is acknowledged that microblogging information could be used to build a picture of users and Wald et al. made a prediction of psychopathy using twitter content [20]. Considering the effect of personality on real life [27], Bai et al. has conducted research to predict the Big-Five personality of Chinese Weibo users based on user behaviors at social network sites using the decision tree method [28]. Zhang et al. used linguistic features to predict suicide probability of Weibo users through linear regression [29]. Golbeck et al. collected information from Twitter and made a prediction of personality using the Gaussian process [30]. Researchers have made use of UGC data to predict psychological variables while the mood state is rarely studied. In addition, when dealing with multiple related tasks, studies always model every prediction task separately [31]. However, psychological variables are always interrelated and a joint prediction method such as multitask learning could improve generalization performance by training tasks together and capturing their intrinsic correlation [32]. Therefore, a multitask regression method was adopted in this study to predict the mood state of Weibo users. This paper obtained the mood states of social network users through the POMS, then used the cyber psychometrics method to build a model to establish the relationships between online behavior and mood state, and finally calculated the psychological variables automatically based on both the structured information and unstructured text. A lexicon of mood state was constructed in this paper and used to detect the sentiment in unstructured text. The multi-task regression model was adopted to calculate users' mood state automatically.

Data
For the assessment of mood states, the POMS has been widely used. It was compiled by McNair et al. in 1971 and was initially used to measure the psychological health of individuals. The POMS consists of 65 items and six mood scales (Tension-Anxiety, Depression-Dejection, Anger-Hostility, Vigor-Activity, Fatigue-Inertia, and Confusion-Bewilderment). R. Grove et al. later developed an abbreviated POMS, which contains 40 items and seven mood scales, including an esteem-related affect, and verified its reliability and validity [12]. Participants rated each item on a five-point answer scale. The score of each mood dimension equals the sum of scores of several particular questions. The total score of abbreviated POMS, called total mood disturbance, is calculated by subtracting the scores of two positive dimensions from the five negative dimensions and adding 100. Therefore, the higher the total score is, the more negative mood one may have.
With the permission of participants, researchers gained access to their Weibo home address and then obtained detailed information and blog messages using a Web crawler. The experiment was conducted between June 1 and 3, 2016 and comprised 224 users, including 110 males and 114 females. The experiment was conducted in a campus and the participants were college students. The subjects ranged in age from 17 to 28 and there were 27 individuals younger than 20, 101 individuals between 20 and 22, 92 individuals in the range of 23 to 25 years of age, and 4 individuals older than 25. The participants consisted of 72 individuals with a bachelor´s degree, 126 individuals with a masters, five PhD degree, and 21 individuals with education below an undergraduate level. The inactive user was defined as someone who had no more than 50 friends and no blog updates within three months prior to the experiment. The inactive users were considered invalid and excluded and 132 active users were selected. The questionnaire result showed that these individuals were under normal psychological state considering that the extreme value of POMS score did not reach the bounds.
The design of this study is shown in Figure 1. The mood states of social network users were obtained through the Profile of Mood State. The structured data such as registration date and number of followers could be captured from the Weibo home page directly and comprised the structural features. The text information of microblogging was processed via a lexicon-based method to generate the sentiment related features. After the feature extraction, a multitask regression model was adopted and trained. PhD degree, and 21 individuals with education below an undergraduate level. The inactive user was defined as someone who had no more than 50 friends and no blog updates within three months prior to the experiment. The inactive users were considered invalid and excluded and 132 active users were selected. The questionnaire result showed that these individuals were under normal psychological state considering that the extreme value of POMS score did not reach the bounds. The design of this study is shown in Figure 1. The mood states of social network users were obtained through the Profile of Mood State. The structured data such as registration date and number of followers could be captured from the Weibo home page directly and comprised the structural features. The text information of microblogging was processed via a lexicon-based method to generate the sentiment related features. After the feature extraction, a multitask regression model was adopted and trained.

Detecting Emotional Features of Weibo Text
There are many methods for handling semantic orientation recognition in natural language processing, including manual marking and the lexicon-based approach. Manual tagging has the highest accuracy but costs a great amount of time and resources. The lexicon-based method is widely used in sentiment analysis and many researchers have adopted approaches like word embedding to bootstrap lexicons for one particular domain or language [33,34]. When constructing a lexicon, a seed word set is created using tools like WordNet, HowNet, and a synonymy thesaurus [35]. The list of words is expanded based on seed words according to synonymous or antonymous relations. In this paper, a mood state lexicon was constructed on the basis of abbreviated POMS. The 40 adjectives in abbreviated POMS are regarded as seed words and expanded via Tongyici Cilin (a Chinese synonym thesaurus) and Word2Vec.
The Tongyici Cilin contains both synonymous and related words. The Harbin Institute of Technology Information Retrieval Laboratory produced the extended version of Tongyici Cilin, composed of 77,343 words [36]. In this thesaurus, words are organized into a five-layer tree structure. In the fifth layer, every category contains several primitive words, which refer to words that have only one meaning. Every group of primitive words has a code of five layers, which is represented by 11 letters and numbers. Tian et al. has provided a method to calculate the similarity between primitive words in Tongyici Cilin according to their codes [37].

Detecting Emotional Features of Weibo Text
There are many methods for handling semantic orientation recognition in natural language processing, including manual marking and the lexicon-based approach. Manual tagging has the highest accuracy but costs a great amount of time and resources. The lexicon-based method is widely used in sentiment analysis and many researchers have adopted approaches like word embedding to bootstrap lexicons for one particular domain or language [33,34]. When constructing a lexicon, a seed word set is created using tools like WordNet, HowNet, and a synonymy thesaurus [35]. The list of words is expanded based on seed words according to synonymous or antonymous relations. In this paper, a mood state lexicon was constructed on the basis of abbreviated POMS. The 40 adjectives in abbreviated POMS are regarded as seed words and expanded via Tongyici Cilin (a Chinese synonym thesaurus) and Word2Vec.
The Tongyici Cilin contains both synonymous and related words. The Harbin Institute of Technology Information Retrieval Laboratory produced the extended version of Tongyici Cilin, composed of 77,343 words [36]. In this thesaurus, words are organized into a five-layer tree structure. In the fifth layer, every category contains several primitive words, which refer to words that have only one meaning. Every group of primitive words has a code of five layers, which is represented by 11 letters and numbers. Tian et al. has provided a method to calculate the similarity between primitive words in Tongyici Cilin according to their codes [37].
The 40 adjectives in abbreviated POMS are expanded in Tongyici Cilin to obtain seed words. Considering that this study was executed based on a Chinese thesaurus, a similar study was repeated using an English thesaurus to demonstrate the process. The result is shown in Table 2.
Owing to the lack of updates for Tongyici Cilin and the diversity and variability of the network language, the vocabulary used by Weibo users is usually different from that used in written language. Thus, relying only on the general thesaurus cannot cover enough words under the specific network environment.

POMS Subscales Items Seed Words
The word embedding method can map words into high-dimensional numeric vectors and has been applied in computing similarity, lexical analogy, and machine learning. Word2Vec is an open source natural language processing tool developed by Google in 2013, which translates texts into word vectors. Applying the deep neural network algorithm, Word2Vec simplifies the processing of textual content into numeric operations in the high dimensional vector spaces. The similarity of word vectors represents the semantic similarity of textual words. Based on the semantic relevance of words, this paper utilized Word2Vec to expand the word set from 5 million corpuses on Weibo [38]. Before the training of Word2Vec model, a Chinese segmenting tool, Jieba, was used to cut words and then the stop words were removed. For the train of Word2Vec, here are some parameter settings. The skip-gram model was adopted and hierarchical softmax method was used to train the skip-gram. The window size was set to be 6 and the dimensionality of the feature vectors was 100. , where m is the dimension of word vector, the cosine similarity could be calculated as: The numerator represents the dot product of two-word vectors and the denominator represents the modular product of two-word vectors. This research set 0.8 as the threshold of the similarity value, and traversed the corpus to obtain the top-10 most similar words with seed words. After filtering through the threshold, the extended word was added into candidate word sets. Based on the new candidate word sets, the same procedure was carried out repeatedly and the next iteration operated back and forth until there were no new words extracted. Finally, manual verification was adopted to remove words with large deviation or no meaning. As a result, approximately 3000 candidate words were selected and a mood state lexicon containing seven dimensions of mood state words (tension, anger, fatigue, depression, vigor, confusion, and esteem) was constructed.
The semantic distance between words in tweets and the seed words in the lexicon was computed and the score of the word in each mood state dimension could be obtained. For every user, the word set of each tweet was represented as Logwords = {word 1 , word 2 , . . . , word n }, and the seed words set of mood states was L SeedWordsets = {L s1 , L s2 , . . . , L s7 }, where L si contains a set of seed words in one dimension. By computing the similarity between word in tweets and seed words and finding the maximum similarity in each dimension, a seven-dimension attribute vector for every word was obtained, denoted as W i = {S i1 , S i2 , . . . , S i7 }, where S ij = max sim word i , L sj . For example, for the word "restless" in a tweet, calculate the similarity between this word and the words in L s1 . If the word "anxious" in L s1 has the maximum similarity with "restless", then this similarity is regarded as the value of the first dimension in the attribute vector, denoted as S i1 . Same procedures could be repeated in other dimensions.
For each tweet, this paper defined seven-dimensional variables, V text ∈ V 7 , to represent its sentiment-related feature, representing tension, anger, fatigue, depression, confusion, vigor, and esteem, respectively. This value function combined all the word vectors contained in the text and obtained the final score of mood states.

The Construction and Evaluation of the Feature Set
This research collected 52 features of the Weibo users and divided them into four categories, which are listed in Table 3. The first category, denoted as D i , covers basic information about the users, including the length of Weibo ID, gender, registration date, number of labels, and length of personal profile. The second category, denoted by S j , contains the number of followers, number of follows, and total tweets, demonstrating the social characteristics of users. The third category reveals prosperities of users' tweets, including the average length of tweets, average times for the use of the @ signal, monthly average, number of tweets, and other related features. These features are denoted as T m . The last category indicates the sequential sentiment features of Weibo users' blog text, including the sentiment-related features of tweets in five specific periods, denoted as M n . The five periods include the last week, month, 3 months, 6 months, and 1 year. Therefore, the users' feature set can be represented as: MoodStates = Di,Sj, Tm, Mn .
The Pearson correlation coefficient was used in the process of feature extraction and was calculated as follows: where represents the degree of correlation between the Weibo users' features and their sentimental scores. If ρ > 0, it means that the higher the value of features, the higher would be the user's sentimental score. The larger the value of , the larger the growth rate of their sentimental scores.

Prediction of Mood States of Weibo Users Using Multitask Regression
Conventional network psychometric research usually adopts regression, neural networks and other methods to model every prediction task separately to fit the specific information of each regression task. Recent studies have shown that psychological variables are interrelated and that they play an important role in the joint prediction of different psychological variables [39]. Thus, this study adopted the multitask learning method to improve performance. The multitask learning method not only saved the specific information between tasks, but also integrated the information of multiple tasks to establish a more effective prediction model for the calculation of the different dimensions of mood states.
In the process of predicting the Weibo users' mood states, the multitask regression method was introduced to build the calculation model. Aimed at predicting mood state variables from eight sentimental dimensions (seven sub-dimensions and one total score), this paper set eight learning tasks to build a regression model. In the regression model, the number of regression tasks was denoted as T. For each task t, there was an independent training set {(x tn , y tn )}, where t = 1, 2, . . . , T, n = 1, 2, . . . , N, and N is the number of instances, assuming that all tasks have the same number of instances. The learning function is f t : R d → R, and the training set The linear function is f t (X) = w t X, where w t represents the model coefficient in task t, and X is the input column vector. The loss function is L t w t , X, y t . , where W is a coefficient matrix with w t as the row vector and w k as the column vector so that w k is the coefficient related to the feature k.
To filter the global features, w k is constrained as follows: The linear least squares fit (LLSF) is used here to make the fitting. The loss function is denoted as: Therefore, the objective function is: Finally, it can be concluded that: The parameter λ > 0 is the regularization parameter, controlling the trade-off between the regression loss and the size of the weight vector, as measured by the 2 norm in Equations (5) and (7) [40,41]. The objective function is smooth and convex and can be minimized by standard method such as coordinate descent [41].

Results and Discussion
This paper used correlation analysis to select features that are closely related to mood states in order to establish the feature set. Through the feature extraction, 25 significant features were obtained, including 5 features from the first 3 categories and 20 sentiment-related features from the fourth category. Further, the non-sentimental features contained registration time, the number of users that one follows, the average number of comments, the number of tweets that are reposted by Weibo users, and the average comments of blogs that are reposted by Weibo users. The mood states feature comprises different textual sentiment-related features within 1 week, 1 month, 3 months, 6 months, and 1 year. This indicates that user's mood state is related to both the text expression and stable statistical characteristics. This endorses the psychological conclusion that the user's mood state is a kind of psychological characteristic that has a correlation between the personality and the affect [42,43], considering that studies have shown that the personality could be automatically acquired through users' social network behavior [28]. The result also indicates that up to 80% of the features are emotional features, demonstrating that text expression plays an important role in calculating users' mood state. Table 4 displays the correlation coefficients between the extracted features and the different dimension of mood states, including the total score of POMS. In addition, 1 week, 1 month, 3 months, 6 months, and 1 year represent the average correlation coefficient of the sentimental score of the text during these periods. These results show that the average repost number of blogs reposted by Weibo users has significant positive correlation with several mood state dimensions, including tension, anger, depression, confusion and total score, indicating an important non-sentimental feature. The dimension of fatigue and depression are both negatively correlated with several sentimental score of tweets in different time span while other dimensions do not have the same level of correlations. Figure 2 shows the correlation trend diagram regarding the mood states of tweets within each time span. The correlation coefficient trend of sentimental features within 1 year is obviously different from other time spans. Perhaps this is just because 1 year is too long for users' mood state to be stable. The sentimental features of tweets within 1 year have a much higher correlation with the dimensions of tension and anger. While considering the dimensions of vigor and esteem, the sentimental features of tweets have a higher correlation within a much shorter time span.  Figure 2 shows the correlation trend diagram regarding the mood states of tweets within each time span. The correlation coefficient trend of sentimental features within 1 year is obviously different from other time spans. Perhaps this is just because 1 year is too long for users' mood state to be stable. The sentimental features of tweets within 1 year have a much higher correlation with the dimensions of tension and anger. While considering the dimensions of vigor and esteem, the sentimental features of tweets have a higher correlation within a much shorter time span. This study employed the 10-fold cross-validation to train and validate the model, and the rootmean-square error (RMSE) was used as the evaluation index to evaluate the model. The RMSE is the most common evaluation index in the machine learning regression model, which is also called rootmean-square deviation (RMSD). It is defined as follows: (9) where represents the true value of sample i, is the predicted value of sample i and n is the number of samples. Figure 3 shows the RMSE of the different regression models on the experimental data set. It could be seen that the regression of total score always reach the largest RMSE in each regression model, since total score is a comprehensive index and may cause more error. The Gaussian process has the largest prediction error on each task, followed by the linear regression and Lasso regression. The prediction error of multitask regression is slightly less than that of the Lasso regression method on each task. For the average RMSE of multiple tasks, the values of the Gaussian process, linear regression, Lasso regression, and multitask regression is 23.919, 12.206, 8.336 and 6.273, respectively, indicating the smallest for the multitask regression model. This study employed the 10-fold cross-validation to train and validate the model, and the root-mean-square error (RMSE) was used as the evaluation index to evaluate the model. The RMSE is the most common evaluation index in the machine learning regression model, which is also called root-mean-square deviation (RMSD). It is defined as follows: where y i represents the true value of sample i,ŷ i is the predicted value of sample i and n is the number of samples. Figure 3 shows the RMSE of the different regression models on the experimental data set. It could be seen that the regression of total score always reach the largest RMSE in each regression model, since total score is a comprehensive index and may cause more error. The Gaussian process has the largest prediction error on each task, followed by the linear regression and Lasso regression. The prediction error of multitask regression is slightly less than that of the Lasso regression method on each task. For the average RMSE of multiple tasks, the values of the Gaussian process, linear regression, Lasso regression, and multitask regression is 23.919, 12.206, 8.336 and 6.273, respectively, indicating the smallest for the multitask regression model.  Table 5 shows the prediction error rate of different algorithms on different tasks. In each task, specifically, for each dimension of mood state, the error rate of multi-task regression is the smallest. Therefore, the accuracy of the multi-task regression method is superior to other algorithms, i.e., the cyber psychometrics produces a good validity and accuracy rate. This demonstrates that multi-task regression could learn the correlation between different dimension of mood state and outperform the single-task regression method [44].
This study has proved the usefulness of this automated identification method and thus, could be applied to identifying other psychological variables or preferences of social network users using the rich UGC data in the social network. Businessmen and policy-makers could also possibly make use of the predictions to support decision-making.   Table 5 shows the prediction error rate of different algorithms on different tasks. In each task, specifically, for each dimension of mood state, the error rate of multi-task regression is the smallest. Therefore, the accuracy of the multi-task regression method is superior to other algorithms, i.e., the cyber psychometrics produces a good validity and accuracy rate. This demonstrates that multi-task regression could learn the correlation between different dimension of mood state and outperform the single-task regression method [44].
This study has proved the usefulness of this automated identification method and thus, could be applied to identifying other psychological variables or preferences of social network users using the rich UGC data in the social network. Businessmen and policy-makers could also possibly make use of the predictions to support decision-making.

Conclusions
This study proposed cyber psychometrics method to calculate mood states of social network users automatically, adopting natural language processing and the machine learning method. Some conclusions can be drawn through analysis and discussion.
Through the correlation analysis of users' POMS scores and their sentiment expressed using Weibo text, this study found that the user's mood state is not only related to the text expression, but also to stable statistical characteristics. Users' emotional expressions on the micro-blog text played an important part in the process of obtaining their true mood states. The correlation analysis results also showed that sentiment-related feature sets of different time span have different correlation characteristics with users' mood states, and the emotional features for 1 year evidently differ from those of other timespans.
Compared with other classic single-task machine learning methods, the cyber psychometrics method adopting multi-task regression performed better. The experiment compared the multi-task learning method and the classic single-task regression algorithms, and it was found that the multi-task machine learning method can enhance the performance in RMSE and error ratio by learning the correlation among different dimensions of mood states.
This method could be applied in psychological measurement and the prediction of the behavior of social network users. Such an automatic calculation method may be widely applied in predicting psychological variables and other characteristics of users considering the rapid development of internet, especially the social network. However, there were still some limitations in that the experimental subjects were college students of similar ages, and the characteristics of mood states concerning those with different identities and ages were not considered. In addition, this paper constructed a mood state lexicon (about 3000 words) to extract the users' sequential sentiment-related features. The scale is relatively small and the feature extraction process for the mood states of users under large-scale data requires further research.