Identiﬁcation and Prediction of Human Behavior through Mining of Unstructured Textual Data

: The identiﬁcation of human behavior can provide useful information across multiple job spectra. Recent advances in applying data-based approaches to social sciences have increased the feasibility of modeling human behavior. In particular, studying human behavior by analyzing unstructured textual data has recently received considerable attention because of the abundance of textual data. The main objective of the present study was to discuss the primary methods for identifying and predicting human behavior through the mining of unstructured textual data. Of the 823 articles analyzed, 87 met the predeﬁned inclusion criteria and were included in the literature review. Our results show that the included articles could be symmetrically classiﬁed into two groups. The ﬁrst group of articles attempted to identify the leading indicators of human behavior in unstructured textual data. In this group, the data-based approaches had three main components: (1) collecting self-reported survey data, (2) collecting data from social media and extracting data features, and (3) applying correlation analysis to evaluate the relationship between two sets of data. In contrast, the second group focused on the accuracy of data-based approaches for predicting human behavior. In this group, the data-based approaches could be categorized into (1) approaches based on labeled unstructured textual data and (2) approaches based on unlabeled unstructured textual data. The review provides a comprehensive insight into unstructured textual data mining to identify and predict human behavior and personality traits.


Introduction
Human behavior is a complex phenomenon [1][2][3][4][5]. However, one of the basic assumptions in human behavior is that each person can be described by a set of characteristics that is stable and does not change over time [6]. This set of stable elements has been conceptualized primarily as a personality, and people's differences in social behaviors have been conceptualized as personality traits [5]. However, the concept of personality has been reported to not fully explain human behavior, and people tend to behave unstably in different situations [5].
Sticha et al. [7] have indicated that two main elements of human behavior are (1) personal characteristics and qualities, and (2) the demands of a situation. However, the relationships among tendencies. In terms of job performance, the conscientiousness dimension is the best indicator of performance in every job type [20].
• Neuroticism: The fourth trait is neuroticism, or being unstable, worried, tense, touchy, anxious, and self-pitying. Neuroticism involves danger sensitivity and psychological distress tendencies. Neuroticism shows more anxiety than other traits [21]. This type shows low emotional stability and lower stress tolerance, and has a tendency to experience negative emotions [22]. • Openness to experience: The fifth trait is openness to experience, or being original, widely interested, imaginative, insightful, artistic, and curious. It involves a willingness to think about other options and alternatives, and a tendency to curiosity [23]. In terms of job performance, the openness to experience dimension is a good indicator of training proficiency [20].
The main objective of the present study was to review and discuss the primary methods for identifying and predicting human behavior through the mining of unstructured textual data. Tpreferred reporting items for systematic reviews and meta-analyses (PRISMA) were selected as structured guidelines to ensure reliable and meaningful study results [26]. This review is symmetrically structured as follows. The methodology section discusses the inclusion and exclusion criteria, and the risk of bias. The results section provides the outputs of the literature search. The discussion section describes data-based approaches among selected records.

Methodology
The PRISMA guidelines were followed for this systematic literature review [26]. Three main steps of the systematic review are developing the research question, determining the search strategy, and addressing the risk of bias [26]. The research question was formulated as follows: RQ. What are the main approaches for identifying and predicting human behavior through the mining of unstructured textual data?
To answer the above research question, we developed a search strategy to identify and review all relevant scientific articles. The search strategy included (1) defining keywords and identifying all relevant materials, (2) removing duplicates, (3) filtering the remaining articles, as performed by three authors through reading titles, abstracts, and in some cases full text, and (4) resolving conflicts through meetings with other authors [26,27].
The first step in this review was developing keywords. According to the stated research question, the keywords were divided into three groups. The first group comprised human behavior and Symmetry 2020, 12,1902 4 of 23 personality traits. The second group comprised data-based approaches, multivariate analysis, big data methods, artificial intelligence, and machine learning. The third group comprised textual data, textual features, and textual indicators. A combination of three groups was used as keywords in searching the articles, as represented in Table 1. Web of Science, IEEE Xplore, and Science Direct were used as database search tools for this review. Table 1. The keywords used in the present review.

Row Keywords
Test set 1 "human behavior" OR "personality traits" Test set 2 "data-based approach" OR "multivariate analysis" OR "big data methods" OR "artificial intelligence" OR "machine learning" Test set 3 "textual data" OR "textual feature" OR "textual indicator" Search 1 #1 AND #2 AND #3 The keywords were used, and 823 articles with relevant content were identified and added to the main database. After developing the main database and identifying relevant articles, we applied a formal screening process to the database on the basis of the exclusion and inclusion criteria. The inclusion criteria were articles associated with the objective and research question, articles written in English, and articles published between 2000 and 2019. The exclusion criteria were articles written in other languages, book chapters or articles from secondary sources that were not free or open access, letters, newspaper articles, viewpoints, presentations, anecdotes, duplicated studies, and posters. The screening of the titles, abstracts, conclusions, and keywords in the identified records after removing duplication (n = 591) resulted in excluding 504 articles. Among the excluded papers, 226 were not associated with the objective and research question, 93 papers focused on opinion and sentiment analysis, 62 papers were published before 2000, 56 papers related to mobility behavior and learning behavior, 48 papers were book chapters, letters, newspaper articles, viewpoints, presentations, anecdotes, duplicated studies, and posters, and 19 papers were in other languages. The remaining articles (n = 87) were read in full against the eligibility principle and all articles were included. Among included articles, 17 articles were identified from IEEE Xplore, 21 articles from Web of Science, and 49 from Science Direct.
The articles selected over time and the PRISMA guidelines are shown in Figures 1 and 2 in searching the articles, as represented in Table 1. Web of Science, IEEE Xplore, and Science Direct were used as database search tools for this review.

Row
Keywords Test set 1 "human behavior" OR "personality traits" Test set 2 "data-based approach" OR "multivariate analysis" OR "big data methods" OR "artificial intelligence" OR "machine learning" Test set 3 "textual data" OR "textual feature" OR "textual indicator" Search 1 #1 AND #2 AND #3 The keywords were used, and 823 articles with relevant content were identified and added to the main database. After developing the main database and identifying relevant articles, we applied a formal screening process to the database on the basis of the exclusion and inclusion criteria. The inclusion criteria were articles associated with the objective and research question, articles written in English, and articles published between 2000 and 2019. The exclusion criteria were articles written in other languages, book chapters or articles from secondary sources that were not free or open access, letters, newspaper articles, viewpoints, presentations, anecdotes, duplicated studies, and posters. The screening of the titles, abstracts, conclusions, and keywords in the identified records after removing duplication (n = 591) resulted in excluding 504 articles. Among the excluded papers, 226 were not associated with the objective and research question, 93 papers focused on opinion and sentiment analysis, 62 papers were published before 2000, 56 papers related to mobility behavior and learning behavior, 48 papers were book chapters, letters, newspaper articles, viewpoints, presentations, anecdotes, duplicated studies, and posters, and 19 papers were in other languages. The remaining articles (n = 87) were read in full against the eligibility principle and all articles were included. Among included articles, 17 articles were identified from IEEE Xplore, 21 articles from Web of Science, and 49 from Science Direct.
The articles selected over time and the PRISMA guidelines are shown in Figures 1 and 2, respectively.   The risk of bias has been divided into external and internal biases. External bias can occur through (1) applying inclusion/exclusion criteria and (2) identifying aspects of human behavior, types of data-based approaches, and textual features. Internal bias relates to assessing the quality of the research among the selected articles. To address the first type of bias, three researchers separately reviewed the title, abstract, and conclusion to select the appropriate articles for full-text review. They compared the selected articles to produce a unified list. After analyzing the selected articles, the authors determined whether the article was appropriate for inclusion. The authors agreed on each article's inclusion before its addition to the main database. Disagreements among the three authors as to an article's inclusion or exclusion were resolved in sessions with other authors. In the next step, three authors separately summarized the data-based approaches and textual features in the selected articles, then compared the results and resolved disagreements by consulting with other authors.
For article quality assessment, the National Heart, Lung, and Blood Institute (NHLBI) Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies was used [28]. This methodology was validated by Frost and Rickwood, [29], who have reported good agreement among independent evaluators regarding classification into good/fair/poor categories of quality [29]. In the present study, following the protocol reported by [30], three researchers independently assessed the methodological quality of the reviewed articles, compared the results, and resolved disagreements by consulting with other authors. Of a total of 87 articles selected for this study, 63 articles were observational and cross-sectional studies. These articles were classified as intermediate quality (63%), poor quality (28%), or high quality (9%). A lack of exposure assessment, inadequate blinding of outcome assessors, and small samples that overrepresented young students were the main limitations of the articles classified as "poor" quality.

Results
All identified articles were categorized and stored in the main database according to year, source of publication, data-based approach, source of input data, and textual features. On the basis of these elements, selected articles were categorized as (1) articles that attempted to identify the main indicators of human behavior in unstructured textual data, and articles that focused on developing more accurate data-based approaches to better predict human behavior; (2) articles with manual feature selection and articles that used different computerized techniques for feature extraction and selection; and (3) articles that designed and developed new models for detecting human behavior, and articles that used well-known data-based approaches. The list of included articles with databased approaches is shown in Table 2. The reviewed articles were published in 39 journals ( Figure  3). The risk of bias has been divided into external and internal biases. External bias can occur through (1) applying inclusion/exclusion criteria and (2) identifying aspects of human behavior, types of data-based approaches, and textual features. Internal bias relates to assessing the quality of the research among the selected articles. To address the first type of bias, three researchers separately reviewed the title, abstract, and conclusion to select the appropriate articles for full-text review. They compared the selected articles to produce a unified list. After analyzing the selected articles, the authors determined whether the article was appropriate for inclusion. The authors agreed on each article's inclusion before its addition to the main database. Disagreements among the three authors as to an article's inclusion or exclusion were resolved in sessions with other authors. In the next step, three authors separately summarized the data-based approaches and textual features in the selected articles, then compared the results and resolved disagreements by consulting with other authors.
For article quality assessment, the National Heart, Lung, and Blood Institute (NHLBI) Quality Assessment Tool for Observational Cohort and Cross-Sectional Studies was used [28]. This methodology was validated by Frost and Rickwood, [29], who have reported good agreement among independent evaluators regarding classification into good/fair/poor categories of quality [29]. In the present study, following the protocol reported by [30], three researchers independently assessed the methodological quality of the reviewed articles, compared the results, and resolved disagreements by consulting with other authors. Of a total of 87 articles selected for this study, 63 articles were observational and cross-sectional studies. These articles were classified as intermediate quality (63%), poor quality (28%), or high quality (9%). A lack of exposure assessment, inadequate blinding of outcome assessors, and small samples that overrepresented young students were the main limitations of the articles classified as "poor" quality.

Results
All identified articles were categorized and stored in the main database according to year, source of publication, data-based approach, source of input data, and textual features. On the basis of these elements, selected articles were categorized as (1) articles that attempted to identify the main indicators of human behavior in unstructured textual data, and articles that focused on developing more accurate data-based approaches to better predict human behavior; (2) articles with manual feature selection and articles that used different computerized techniques for feature extraction and selection; and (3) articles that designed and developed new models for detecting human behavior, and articles that used well-known data-based approaches. The list of included articles with data-based approaches is shown in Table 2. The reviewed articles were published in 39 journals ( Figure 3).

Methods for Predicting Human Behavior
Fatima et al. [9] Decision trees, random forest-based method, and support vector machine classifier Gao   Numeric prediction models including linear regression, Reptree, and decision tables  To obtain a better picture, we analyzed the titles and abstracts of the included articles. Figure 4 shows a co-occurrence map based on the analysis of abstracts in the form of a bubble chart. For developing this figure, the Textrank algorithm (implemented in Gensim 3.8.3 [https://pypi.org/project/gensim/]) was used to extract the main keywords of abstracts of included papers. The keywords were fed into VOSviewer software to visualize the result. In Figure 4, the nodes correspond to specific textual terms, and their sizes represent the frequency of occurrence. The cooccurrence of the textual terms in different publications is represented by a link between two nodes.  To obtain a better picture, we analyzed the titles and abstracts of the included articles. Figure 4 shows a co-occurrence map based on the analysis of abstracts in the form of a bubble chart.
For developing this figure, the Textrank algorithm (implemented in Gensim 3.8.3 [https://pypi.org/ project/gensim/]) was used to extract the main keywords of abstracts of included papers. The keywords were fed into VOSviewer software to visualize the result. In Figure 4, the nodes correspond to specific textual terms, and their sizes represent the frequency of occurrence. The co-occurrence of the textual terms in different publications is represented by a link between two nodes. Frequently co-occurring textual terms create clusters and appear closer to each other with the same color. Figure 4 reflects the main cluster (purple color) with terms such as extraversion, agreeableness, openness, conscientiousness, and neuroticism. The next cluster (green color) contains several items such as user personality, social network, Facebook user, and questionnaire.

Discussion
Various approaches have been reported for identifying and predicting human behavior through the mining of unstructured textual data. A primary concept is that textual features are significantly correlated with individuals' behaviors and qualities [8]. These methods can be categorized according to elements, including data sources, feature sets, and techniques.
In social networks, individuals reveal substantial information regarding different topics and items in the form of status updates, self-descriptions, and interests [21]. The raw data in social media include religion, educational history, user name, birthday, gender, relationship status, hometown, personal written information, and lists of favorite things. The main source for social network data is Facebook, which provides useful information for identifying human behavior. On Facebook, individuals reveal their identification and opinions by expressing a variety of aspects of themselves [46]. On the basis of data from Facebook, life outcomes, socio-economic status, disorder behaviors, mobility behaviors, and cultural preferences can be predicted [46]. Other data sources for human behavior studies are Twitter, LinkedIn, Myspace, Foursquare, and mobile phone data.
Feature extraction and selection are important steps, which involve removing unnecessary words and information from documents, and building the derived values to facilitate successive interpretations and learning [108]. Features in selected articles can be divided into two groups with (1) pre-defined, manually selected features, and (2) use of different methods for feature extraction

Discussion
Various approaches have been reported for identifying and predicting human behavior through the mining of unstructured textual data. A primary concept is that textual features are significantly correlated with individuals' behaviors and qualities [8]. These methods can be categorized according to elements, including data sources, feature sets, and techniques.
In social networks, individuals reveal substantial information regarding different topics and items in the form of status updates, self-descriptions, and interests [21]. The raw data in social media include religion, educational history, user name, birthday, gender, relationship status, hometown, personal written information, and lists of favorite things. The main source for social network data is Facebook, which provides useful information for identifying human behavior. On Facebook, individuals reveal their identification and opinions by expressing a variety of aspects of themselves [46]. On the basis of data from Facebook, life outcomes, socio-economic status, disorder behaviors, mobility behaviors, and cultural preferences can be predicted [46]. Other data sources for human behavior studies are Twitter, LinkedIn, Myspace, Foursquare, and mobile phone data.
Feature extraction and selection are important steps, which involve removing unnecessary words and information from documents, and building the derived values to facilitate successive interpretations and learning [108]. Features in selected articles can be divided into two groups with (1) pre-defined, manually selected features, and (2) use of different methods for feature extraction and selection. The main types of features among the included articles are as follows: • Facebook's pre-defined features include personal information, work information, contact information, education, time spent on Facebook, frequency of use, number of statues, number of friends, number of groups, number of likes, number of photos, and number of tags.

•
In other social media, pre-defined features include personal information and time spent on Instagram, Sina Weibo, or LinkedIn.

Identifying Human Behavior
Multiple articles have attempted to identify the main indicators of human behavior in unstructured textual data. Data-based approaches of these articles have three main parts: (1) collecting self-reported survey data, (2) collecting social media data and extracting features, and (3) analyzing the relationship between two sets of data, as shown in Figure 5.
Text messages in social media can be good indicators of users' personality. Adali and Golbeck [31] have extracted network bandwidth and message content features from Facebook and Twitter data, and analyzed the correlation between these features and data from the Big Five inventory [31]. The study concluded that linguistic features are useful in identifying personality. In another study, Annisette and Lafreniere [35] have tested the shallowing hypothesis, in which constantly using social networking sites can lead to a significant decrease in daily reflective thought. Participants were asked to complete five measures including texting and social media use; 44 items in the Big Five Inventory to assess the levels of five personality dimensions; 58 items in the life goals inventory to assess life goals; the 12 item reflection scale from the Rumination-Reflection Questionnaire to assess tendencies to engage in self-reflective states; and a demographic questionnaire to assess participants' background information. The study used correlation analysis and concluded that participants who constantly use social networking sites place less value on life goals [35]. The relationship between text messages and human behavior has been investigated by other researchers. Holtgraves [56] has studied the relationship between language variances in text messaging and personality traits. Maria Balmaceda et al. [71] have investigated users' personality through evaluating text messages in social network, then verified the stability of the identified personality. Panicheva et al. [80] have investigated the link between the dark triad personality traits and Russian linguistic features in social networking texts.

•
Latent Dirichlet Allocation (LDA) features based on assigning topics to documents and generates topic distributions over words given a collection of texts [110]. • Linguistic features are based on pre-processing (removing stop words, stemming, and word segmentation tools) and semantic analysis.

Identifying Human Behavior
Multiple articles have attempted to identify the main indicators of human behavior in unstructured textual data. Data-based approaches of these articles have three main parts: (1) collecting self-reported survey data, (2) collecting social media data and extracting features, and (3) analyzing the relationship between two sets of data, as shown in Figure 5. Text messages in social media can be good indicators of users' personality. Adali and Golbeck [31] have extracted network bandwidth and message content features from Facebook and Twitter data, and analyzed the correlation between these features and data from the Big Five inventory [31]. The study concluded that linguistic features are useful in identifying personality. In another study, Annisette and Lafreniere [35] have tested the shallowing hypothesis, in which constantly using social networking sites can lead to a significant decrease in daily reflective thought. Participants were asked to complete five measures including texting and social media use; 44 items in the Big Five Inventory Smartphone data can be indicators of personality. Chittaranjan et al. [47,48] have collected data from smartphone users in three categories, namely call, text, and application, to automatically extract different features regarding applications and communication usage on the phone. The study used correlation analysis among these features and collected data from an online ten item personality inventory questionnaire.
Multiple articles have indicated that Facebook data can be significant indicators of user personality. Amichai-Hamburger and Vinitzky [23] have used analysis of covariance (ANCOVA) to investigate the link between personality traits and Facebook features [23]. The study was conducted in three sequential research phases of (1) evaluating personality by asking participants to complete the Revised NEO Personality Inventory as a self-reported measure; (2) collecting users' information from Facebook and dividing it into four dimensions of basic information, contact information and education, personal information, and work information; and (3) applying ANCOVA. The study developed several hypotheses about the relationship between traits of the Big Five model and behavior on Facebook, and highlighted a strong connection between Facebook behavior and the personality of users [23]. In this regard, Bachrach et al. [38] have shown a significant relationship between user personality and the information on their Facebook profiles, such as the number of uploaded photos, size of their friendship network, number of events attended, density of their friendship network, and number of group memberships. The study used a dataset containing the Facebook profiles and personality profiles of 180,000 users and applied correlation analysis to evaluate the link between personality and Facebook content [38]. On this topic, Schwartz et al. [91] have conducted a survey from 75,000 volunteers, extracted 700 million words and phrases from Facebook messages of participants, and used correlation analysis to identify the main indicators of personality traits in Facebook messages.
Beyond Facebook content, the Facebook status updates [101] and Facebook usage [93,94,97,99], as measured by the Facebook intensity scale and the Facebook use scale, have been used to detect personality. On this topic, Jenkins-Guarnieri et al. [59] have collected data from 463 participants and investigated the relationship between Big-Five personality traits and Facebook usage [59]. Moore and McElroy [72] have conducted a survey from Facebook users and have used Facebook data to detect why some users are more active than others. The authors have indicated that personality can explain differences among Facebook users [72].
Several articles have used other methods for collecting personality data rather than conducting questionnaire-based surveys. Kern et al. [64] have collected millions of posts from Facebook users and used the MyPersonality application to evaluate personality scores. The study used correlational analysis to examine the relationship between the extracted features from posts and personality scores and accordingly distinguished words and phrases representing each Big Five personality trait.
Data from other social media can be good indicators of human behavior. Quercia et al. [87,88] have used Pearson product-moment correlation to investigate the relationship between personality traits and different characteristics of Twitter users, such as the numbers of users followed and of followers. In another study, Krämer and Winter [67] have investigated the relationship between personality traits and the manner of self-presentation and self-esteem in social media. Participants were asked to complete two measures of the Revised NEO Personality Inventory to measure extraversion, and Mielke's questionnaire to measure self-presentation. The study used multivariate analysis of variance to analyze the relationship between extraversion and self-presentation.
For identifying human behavior on the basis of unstructured textual data, it is important to remember that (1) there are no significant differences between predicting all personality traits of users at once or identifying each trait of personality separately; (2) selecting only correlated features does not necessarily improve the performance of the predictive model; and (3) discovering the smallest feature set without decreasing the performance of the human behavior predictive model is a main goal for the future research [8].

Predicting Human Behavior
Most articles have focused on developing more accurate models to predict human behavior through the mining of unstructured textual data. The included articles used different data-based approaches to accurately predict human behavior, as shown in Figure 6.
Several studies have developed language-independent methods for predicting human behavior from unstructured textual data. Alsadhan and Skillicorn [34] have developed an approach based on word counts to predict both the Big Five and the Myers-Briggs personality traits from small amounts of text [34]. The proposed method is language independent, does not require particular lexicons, and has been successfully applied to different languages [34]. To develop this method, the 1000 most frequent words labeled with Big Five personality traits and Myers-Briggs personality types were selected (without removal of stop words or performing stemming) to build a model for each personality trait. The developed models were compared with posts and tweets on Facebook and Twitter to predict user personality [34]. On this topic, Pramodh and Vijayalata [83] have predicted the Big Five personality traits of authors through their writings and essays. First, two datasets containing positive and negative terms corresponding to each Big Five personality trait were created. For predicting personality after collecting data, pre-processing including tokenizing the input textual data, removing stopwords, stemming, scaling, and scoring was performed [83]. In the final step, the stemmed data were compared with the datasets, and the matched percentages of the data were calculated [83].
Several studies have used language-independent features in developing their data-based approaches. Celli et al. [43][44][45] have developed an unsupervised personality recognition system using language-independent features to predict Big Five personality traits from unstructured textual data. For developing language-independent features, the studies have used a list of pre-defined linguistic cues from published research [43,45]. A processing pipeline has been developed, including preprocessing, processing, and evaluation modules [43,45]. In the preprocessing module, the average occurrence of each feature is determined by randomly selecting samples of posts [43,45]. By matching features and using correlations, the system creates one personality hypothesis per post [43,45]. For a single user, the system evaluates all hypotheses created for all posts and generates one personality model per user with a confidence level [43,45]. The proposed system has been tested on English and Italian Twitter posts and shown to have an accuracy of approximately 65.0. at once or identifying each trait of personality separately; (2) selecting only correlated features does not necessarily improve the performance of the predictive model; and (3) discovering the smallest feature set without decreasing the performance of the human behavior predictive model is a main goal for the future research [8].

Predicting Human Behavior
Most articles have focused on developing more accurate models to predict human behavior through the mining of unstructured textual data. The included articles used different data-based approaches to accurately predict human behavior, as shown in Figure 6. Several studies have developed language-independent methods for predicting human behavior from unstructured textual data. Alsadhan and Skillicorn [34] have developed an approach based on word counts to predict both the Big Five and the Myers-Briggs personality traits from small amounts of text [34]. The proposed method is language independent, does not require particular lexicons, and has been successfully applied to different languages [34]. To develop this method, the 1000 most frequent words labeled with Big Five personality traits and Myers-Briggs personality types were  Several studies [75,107] have described improperly labeled samples in published research and have developed semi-supervised learning methods to evaluate the personality traits by using unlabeled samples. Nie et al. [75] have extracted 47 features for each user in the categories of the user personal profile, social circles, social activities, and social habits. Next, stepwise regression was used to select important features for each trait of personality. Then, the study conducted a small survey and calculated the score for each personality trait from completed Big Five personality questionnaires. Two datasets were developed: (1) a small labeled dataset {(X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n )}, where X i is a dimensional feature vector, and Y i is personality score for user i, and (2) the main unlabeled dataset {X n+1 , . . . X m }. Finally, a local linear kernel regression algorithm and a local linear semi-supervised regression algorithm were used to predict personality in the unlabeled dataset.
Most included articles have developed supervised machine learning-based models to predict human behavior from unstructured textual data. Although in many published articles, aspects of human behavior have been predicted according to extracting data from user profiles, texts, and tweets, several studies have attempted to predict human behavior in groups of labeled texts and tweets without taking user profiles into account. For example, Lima and de Castro [69,70] have developed a multi-label classifier algorithm based on the naïve Bayes algorithm to predict personality in texts and tweets [69].
Most studies have predicted personality traits through independent binary classification (modeling each personality trait in isolation). However, Iacobelli and Culotta [58] have used conditional random fields and structured classification to model the dependencies between personality traits. The authors have concluded that there is a correlation between neuroticism and agreeableness traits and that considering this correlation in a classification model can help improve accuracy for classification of the agreeableness trait [58].
Studies have additionally developed supervised learning methods to predict human behavior on the basis of labeled data and consideration of user information without considering dependencies between personality traits. For example, Park et al. [81] noting the small sample size of previous articles, have developed a model based on a sample of 66,000 participants. The authors collected data from Facebook users and their Big Five personality trait questionnaires [81]. The study generated thousands of linguistic features including multiword phrases, single words, and clusters of semantically related words [81]. After using a variety of dimensionality-reduction methods, the authors used a regression model to predict personality traits [81]. The main differences between articles were (1) feature selection and extraction, and (2) classification method, as shown in Table 3.

Conclusions
Identification of human behavior can provide valuable information across multiple job spectra, including sales (developing recommendation systems), hiring (predetermining potential from resumes and writing samples), marketing (improving stakeholder management and enhancing individuals' communication strategies, negotiations (analyzing a rival or head of a successful organization), detecting terrorists and criminals, discovering depression and disorders, jury selection, and creating personal and professional relationships. This article provides a systematic review of the published research relevant to identifying and predicting human behavior through the mining of unstructured text data. A total of 87 published articles that met the predefined inclusion criteria were included in the review. The following research question has been explored: RQ. What are the main approaches to identify and predict human behavior through the mining of unstructured textual data?
Based on collected data, all reviewed articles were divided into two categories: (1) articles that attempted to establish a clear connection between textual data features and aspects of human behavior, and (2) articles that focused on developing more accurate data-based approaches to predict human behavior. In the first category, data-based approaches had three main parts: (1) collecting self-reported survey data, (2) collecting data from social media and extracting different textual features, and (3) using correlation analysis to evaluate the connection between two sets of data. In the second category, different data-based approaches were used to predict human behavior in unstructured textual data. These data-based approaches can be divided into two categories: (1) methods based on labeled unstructured textual data and (2) methods based on unlabeled unstructured textual data. In methods based on labeled data, human behavior can be predicted according to data extracted from user profiles or from groups of texts and tweets. In addition, methods based on unlabeled data can be divided into semi-supervised learning methods and unsupervised learning methods. Extracted features in selected articles include Facebook's pre-defined features, Twitter's pre-defined features, pre-defined features of other social media, LIWC features, NLTK features, word frequency-based features, character frequency-based features, TFIDF features, LDA features, and part-of-speech tagging features. The main sources of input data include Facebook, Twitter, LinkedIn, Myspace, Foursquare, and mobile phone data.
This systematic literature review has some limitations. One of the main limitations is the timeframe for article discovery and the timetable for published articles. Article discovery was finished at the end of August 2019, and only articles published between 2000 and August 2019 were included. The second limitation is the inability to discover and include individual relevant papers arising from inclusion and exclusion criteria, a limited number of keywords, and a limited number of search databases for article discovery. Therefore, based on the developed research strategy, some highly cited articles that applied deep learning-based personality detection were not included in this literature review. For example, Majumder et al. [111] used a deep convolutional neural network to extract different essays' features. Sun et al. [112] developed a model as a fusion of bidirectional long short term memory networks with a convolutional neural network to predict users' personality using structures of texts. Su et al. [113] tried to indicate nature in a dyadic conversation through using a recurrent neural network (for modeling short term temporal evolution of a dialog) and coupled hidden Markov model for predicting the personalities of two speakers [114].
It should be noted that deep learning techniques have performed effectively in predicting human behavior among textual data. Because of generating vast amounts of unstructured textual data, these techniques with new and complex architectures will take momentum in the near future. Also, numerous other fertile research areas can be applied to study through mining textual data. Such areas include human behavior during disaster and the impact of behavioral signature [115], human behavior concerning resource management [116], patient behavior [117], and big data retrieval in social action [118].