CRANK: A Hybrid Model for User and Content Sentiment Classiﬁcation Using Social Context and Community Detection

: Recent works have shown that sentiment analysis on social media can be improved by fusing text with social context information. Social context is information such as relationships between users and interactions of users with content. Although existing works have already exploited the networked structure of social context by using graphical models or techniques such as label propagation, more advanced techniques from social network analysis remain unexplored. Our hypothesis is that these techniques can help reveal underlying features that could help with the analysis. In this work, we present a sentiment classiﬁcation model (CRANK) that leverages community partitions to improve both user and content classiﬁcation. We evaluated this model on existing datasets and compared it to other approaches.


Introduction
The state-of-the-art in the field of sentiment analysis has improved considerably in recent years, partly due to the advent of social media. Social media text imposes several limitations that are hard to overcome even for human annotators, such as the extensive use of annotations, jargon, and heavy reliance on context. Moreover, understanding a piece of content often requires following a conversation (i.e., a thread of replies) or the style and stance of the author of the content.
To solve these limitations, new approaches are starting to combine text with additional information from the social network, such as links between users and previous posts by each user. The blend of all this information can be referred to as social context. A recent work [1] analyzed the use of social context in the sentiment analysis literature, and it showed that context-based approaches performed better than traditional analysis without social context (i.e., contextless approaches). It also provided a taxonomy of approaches based on the types of features included in the context: contextlessapproaches do not use social context at all; microapproaches only use features from the users and their content; mesoapproaches include features from other users and content, as well as connections between different users and content; and macroapproaches also exploit other sources such as knowledge graphs. meso approaches are further divided into three categories: meso r only uses relations (e.g., follower-followee); meso i adds interactions (e.g., replies and likes); and meso e uses Social Network Analysis (SNA) techniques to process other elements of the context and generate additional features. Comparing the performance of existing approaches seems to show that more elaborate features provide an advantage over simpler features. Simpler features are those directly extracted from the network, such as follower-followee relations (meso r ). More complex features can be obtained from applying further processing, typically through filtering and aggregating information from the network complexity lies in extracting useful features from the text, curating them, and applying them with the appropriate predictor [5].
Lexicon-based approaches are heavily limited by the quality of the lexicon at hand, and creating consistent and reliable lexicons for a domain is an onerous task [6]. As a consequence, pure lexicon techniques are seldom used. Instead, lexicons typically are combined with machine learning techniques [7][8][9][10][11]. Hence, machine learning techniques and hybrid approaches dominate the state-of-the-art [12][13][14], Machine learning techniques can use different types of features for their predictions. These features are manually crafted and picked for the specific application. The simplest types of features, which rely solely on lexical and syntactical information (e.g., bag-of-words, syntactic trees), are often referred to as surface forms. Surface forms can also be combined with other prior information, such as lexicons with word sentiment polarity [7][8][9][10][11]. Some lexicons also include non-words such as emoticons [15,16] and emoji [17]. The combination of the resulting features is fed into a classifier, which can be trained on a known dataset or part of it.
The main disadvantage of these approaches is that each feature needs to be conceived of and added by an operator. Although there are processes to select the most informative (i.e., best) features for a given combination of dataset and classifier, the problem of finding and calculating new features still remains.
In contrast, deep learning techniques can automatically learn complex features from data. New approaches based on deep learning have shown excellent performance in sentiment analysis in recent years [18,19]. The downside is that they usually require large amounts of data, which is not always available. They also raise other concerns such as interpretability [20,21] or the inability of a model to adapt to deal with edge cases [20]. In the realm of Natural Language Processing (NLP), most of the focus is on learning fixed-length word vector representations using neural language models [22]. These representations, also known as word embeddings, can then be fed into a deep learning classifier or used with more traditional methods. One of the most popular approaches in this area is word2vec [23]. Although training these models requires enormous amounts of data and fair amounts of computation, there are several publicly available models that have already been trained on large corpora such as Wikipedia.
Lastly, it is also possible to combine independent predictors to achieve a more accurate and reliable model than any of the predictors on their own. This approach is known as ensemble learning. Many ensemble methods have been previously used for sentiment analysis. An exciting new application of ensemble methods is the combination of traditional classifiers based on feature selection and deep learning approaches [12].

Social Network Analysis
Social Network Analysis (SNA) is the investigation of social structures through a combination of social science and graph theory [24]. It provides techniques to characterize and study the connections and interactions between people, using any kind of social (human) network. The mathematical analysis of a social network using graph theory predates the appearance of Online Social Network (OSN) by more than a hundred years. The same techniques have been applied successfully on other types of social networks such as citation networks in academia and call records in mobile networks.
Through SNA techniques, it is possible to extract useful information from a social network, such as chains of influence between users, groups of like-minded users, or metrics of user importance. This information may be useful for many applications, including sentiment analysis. There are several ways in which SNA techniques can be exploited in sentiment analysis, but the analysis of current approaches [1] shows that they can be grouped into one of two categories: those that transform the network into metrics or features that can be used to inform a classifier and those that limit the analysis to certain groups or partitions of the network.
A simple example of metrics provided by SNA could be user's follower in-degree (number of users that follow the user) and out-degree (number of users followed by the user), which could be used as features for each user [25]. However, these metrics are not very rich, as they only cover users directly connected to a user, and they do so in a very naive way: all connections are treated equally. Other more sophisticated metrics could be used instead of in-/out-degree, such as centrality, a measure of the importance of a node within a network topology, or PageRank, an iterative algorithm that weights connections by the importance of the originating user. Several works have introduced alternative metrics for user and content influence in a network [26,27].
The second category of approaches is what is known either as network partition or as community detection, depending on whether the groupings may overlap. Intuitively, community detection aims to find subgroups within a larger group. This grouping can be used to inform a classifier or to limit the analysis to relevant groups only. More precisely, community detection identifies groups of vertices that are more densely connected to each other than to the rest of the network [28]. The motivation is to reduce the network into smaller parts that still retain some of the features of the bigger network. These communities may be formed due to different factors, depending on the type of link used to connect users, and the technique used to detect the communities. Each definition has its own set of characteristics and shortcomings. For instance, if users are connected after messaging each other, community detection may reveal groups of users that communicate with each other often [29]. By using friendship relations, community detection may also provide the groups of contacts of a user [30].
Other publications [28,31] cover further details of the different definitions of community and algorithms to detect them.

Social Context
Social context [1] is the collection of users, content, relations, and interactions that describe the environment in which social activity takes place. It encapsulates the frame in which communication in social media takes place.
Social context is used in sentiment analysis for two reasons that are subtly different. First, it can be used to compensate for implicit elements in the text. An example of this is how slang, abbreviations, or semantic variations can be detected and accounted for in the classification. Humans apply a similar process when trying to understand content. Content authors also unconsciously rely on this fact, and they assume certain prior knowledge. The second motivation to add social context is that it may help correct ambiguity or situations where textual queues are lacking. For example, a classifier may use the sentiment of earlier posts by the user and similar users on the same topic.
For the sake of clarity and for the ease of comparison with other works, we will employ the following general definition of social context [1]: where: U is the set of content generated; C is the set of users; I is the set of interactions between users, and of users with content; R is the set of relations between users, between pieces of content, and between users and content. Figure 1 provides a graphical representation of the possible links between entities of the two available types. Users may interact (i) with other users (I u ) or with content (I c ).
SocialContext = C, U, R, I Where: U is the set of content generated; C is the set of users; I is the set of interactions betw , and of users with content; R is the set of relations between users, between pieces of content een users and content.  Relations (R) can link any two elements: two users (R u ), a user with content (R uc ), or two pieces of content (R c ).
where T a,b are the types of elements A b , e.g., T i,uc are the types of interactions between users and content (I uc ). From these definitions, it is obvious that interactions and relations are very similar, and a network of users and content can be created using either one or both of them. In the parts of the model where a relation (R) or an interaction (I) can be used, the term edge (E) can be used instead.
There are countless ways to construct a social context for the piece of text, depending on the types of information included and how it is gathered. The richness of context influences the type of analysis that can be performed. For the sake of comparison, the ways in which social context is constructed and analyzed can be grouped into one of several categories, according to a taxonomy of approaches [1]. The categories are, from simpler to more complex: micro approaches, in which only one user is included along with the content he or she created; meso approaches, which also add other users and relations or interactions with them; and macro approaches, which include information from outside the OSN, such as facts or encyclopedic knowledge. The meso level is further divided: meso r only uses relations; meso i also includes interactions; and meso e adds information from social network analysis, such as partitions, modularity, or betweenness.

Sentiment Analysis Using Social Context
This section provides a brief summary of works that have leveraged social context for sentiment analysis, following the taxonomy of approaches by Sánchez-Rada and Iglesias [1].
Tan et al. [32] was one of the first works to incorporate social context information, which the authors called heterogeneous graph on topic, to infer (user) sentiment. The underlying ideas behind that work were user consistency and homophily. A function to measure each of those attributes was provided, and the model tried to maximize the overall value. The authors compared alternative ways to construct the user network, using variations of follower-followee relations and direct replies (interactions). However, the approach could be categorized as meso r , for two reasons. Firstly, in their work, relations and interactions yielded similar results. Secondly, in the original, formulation edges (relations or interactions) were not weighted, so users were influenced equally by all their neighbors. Interactions were bound to be noisy, and aggregating them in this fashion was likely to provide little or no advantage over a simple relation. The SANTmodel [33] follows similar ideas, but for content classification. It is also a meso r approach that combines sentiment consistency, emotion contagion, and a unigram model in a classifier.
Pozzi et al. [2] extended the model by Tan et al. [32]. Their model used what they called an approval network, which effectively added weights for edges between users. The rationale for that change was that friendship did not imply approval and that a weighted network of interactions should better capture emotion contagion. This addition invalidated the two reasons for not considering it a meso i approach.
Other models have exploited community detection, which included them into the meso e category. An example is Xiaomei et al. [34], which incorporated weak dependencies between microblogs, using community detection (different algorithms) on a network of microblogs. In their work, microblogs were connected if their authors were (i.e., there was a follower-followee relation).

Sentiment Classification
The sentiment classification task consists of finding all the sentiment labels for users (L u in a given social context, where the labels of a sub-set of users (B u ) and a sub-set of content (B c ) are known in advance. The social context is made up of a set of content (C), a set of users (U), relations between both users and content (R), and interactions between users and content (I). This is illustrated in Figure 2, where relations and interactions are simplified as undirected edges between nodes (i.e., users and content). For the sake of simplicity, we will only consider two possible labels: positive and negative. However, the model can be used with an arbitrary number of labels.
To solve the classification problem, we propose a classification model that uses a combination of a probability model for a given configuration of user and content labels and a classification algorithm that finds the set of labels with the highest probability. In other words, we define a metric that, based on a given social context, estimates the likelihood that users and content are labeled in a specific configuration. The metric incorporates homophily and consistency assumptions. It also involves several parameters that need to be adjusted or trained. We propose a classification method that estimates the parameters and the labels at the same time, by employing a modified version of SampleRank [35], an algorithm to estimate parameters in complex graphical models. Both the probability model and the classification algorithm were based on two earlier works [2,32], which are described in Section 2.4. However, this section does not assume prior knowledge of these models. 7 of 22 content (C), a set of users (U), relations between both users and content (R) and interactions between users and content (I). This is illustrated in Figure 2, where relations and interactions are simplified as undirected edges between nodes (i.e., users and content). For the sake of simplicity, we will only consider two possible labels: Positive and Negative. However, the model can be used with an arbitrary number of labels.
Unlabeled node Labeled node Positive node Negative node Figure 2. Problem definition. The task is to predict the missing labels.
To solve the classification problem, we propose a classification model that uses a combination of a probability model for a given configuration of user and content labels, and a classification algorithm that finds the set of labels with the highest probability. In other words, we define a metric that, based on a given social context, estimates the likelihood that users and content are labeled in a specific configuration. The metric incorporates homophily and consistency assumptions. It also involves several parameters that need to be adjusted or trained. We propose a classification method that estimates the parameters and the labels at the same time, by employing a modified version of SampleRank [35], an algorithm to estimate parameters in complex graphical models. Both the probability model and the classification algorithm are based on two earlier works [2,32], which are described in Section 2.4. However, this section does not assume prior knowledge of these models. Figure 2. Problem definition. The task is to predict the missing labels.

Probability Model
In order to find the best configuration of user and content labels, the classification model uses a probability model that estimates the likelihood of a given distribution of user and content labels. This probability model was based on the Markov assumption that the sentiment of user u i (l u i ) is influenced only by the sentiment of every piece of content c i (l c j ) authored by the user (P i ) and the sentiment labels of its neighbors in the network (N i ). Likewise, the sentiment of a piece of content c i (l c i ) is influenced by the sentiment label of its author. The label of a node (i.e., user or piece of content) may or may not be known in advance. If a label for a node is known, that node is said to be labeled. Labeled users (B u ) and content (B c ) are assigned a higher weight or influence on global probability. The model is defined as follows. Let l u i be the label for user u i , and let L u be the vector of labels for all users. Let l c i be the label for content u c and L c be the vector of labels for all content. To simplify our notation, we will also use P i as the subset of content that has been authored by user u i and N i as the subset of users who are connected to user u i in the social context graph. Two users are connected when there is an edge between them, which can be chosen from the different types of relations and interactions available in the context, i.e., {u i , u j } ∈ E, E ∈ {R, I}. The probability of a configuration of labels (L u , L c ) is given by Equation (9): where ρ neigh is a constant that controls the weight of the effect of neighboring users, ρ u and ρ c determine the weight of each piece of content and each user, respectively, and e i,j is the weight of the edge between neighboring users u i and u j . The value of µ(α, β) and lambda(α, β) models how a node labeled β affects a node labeled α (α, β ∈ Polarities). For the typical case, where Polarities = {positive, negative}, µ and λ can be thought of as an array with four values, one per combination of the two polarities. For instance, the value of µ positive,positive is the weight given to positive content by positive users. The weight of a specific user is controlled through ρ u (Equation (10)), and ρ c (Equation (11)) controls the weight of each piece of content. The values of both functions depend on whether the label for the specific user and or content is known a priori. For users with a known sentiment, the weight is ρ labeled , and for unknown values, it is ρ unlabeled . Based on previous works [2,32], we use the following values: ρ u,labeled = ρ c,labeled = 1, ρ u,unlabeled = ρ c,unlabeled = 0.2Ȯnce again, e i,j is the weight of the edge between users u i and u j . Intuitively, this allows for some specific edges to represent stronger bonds and, hence, have a bigger impact on the result. The influence of neighboring agents ρ neigh is a parameter that can be adjusted.

Parameter Estimation and Classification
Some parameters in the probability model in the previous section were manually set, such as ρ neigh or ρ u,labeled , whereas other values were to be calculated. More specifically, the classification process would consist of calculating the values for µ and λ and then maximizing the log-likelihood of a given distribution of labels (L u and L c ).
In order to explain the classification process, it is useful to decompose the log-likelihood into a dot product of a matrix of constants and a function of the set of labels: where φ (Equation (13)) is constant and the value of ψ (Equation (13)) only depends on the labels and the pre-set parameters. In Equation (13), the µ and λ functions are represented as matrices, where µ α,β = µ(α, β). In Equation (14), we simply introduced an auxiliary function, γ (Equation (15)), to separate the summations into components, just like µ and λ.
The model is thus trained by inferring the values of φ and the Z constant. As we explained earlier, the value of φ roughly encodes the expected likelihood of finding a given combination of labels for two nodes. For instance, λ positive,positive is the likelihood of positive content on positive users, which is expected to be lower than λ negative,positive , under the assumption of consistency. Once these parameters are calculated for a given domain, the classification consists of maximizing the log-likelihood of a given distribution of labels.
SampleRank can be used to determine the value of φ, which is divided into µ α,β and λ α,β . Ideally, the value of Z could be obtained through regularization, but in practice, this can be costly. This need can be circumvented by using other methods that calculate the labels for all unknown elements, such as loopy belief propagation. Alternatively, some works exploit the fact that SampleRank can also output the set of labels in addition to the value for φ [2]. When used in this manner, training can be interpreted as a search in the space of possible labels, and the log-likelihood function is a heuristic that restricts the search. This method has been used successfully for user classification [2], and its main advantage is that it is simpler than using an additional layer of label propagation.
Our proposed classification algorithm (Algorithm 1) is a modified version of SampleRank, which returns the labels for both users and content.

19:
L c ← L cnew 20: In this algorithm, the Random(L u , L c ) function returns a random set of user and content labels (within the range of Polarities, which in a simple case would just be negative and positive). E u represents edges between users, i.e., either relations or interactions. The CD(E u ) function performs community detection given a set of edges and returns the set of edges between all users within the same community. In particular, we are using the Louvain method [36]. The Sample(L u , L c ) function changes one of the labels from either L u or L c , at random. Since the SampleRank algorithm is inherently stochastic, the model should be run several times, and the results of each run should be aggregated.
In our case, we used a number of 21 iterations, based on earlier works [32], and simple majority over all iterations. Table 1 provides basic information about the datasets used in the evaluation. Since the model used in this work requires a social context with interactions or relations, the list is limited to datasets that either contained this information or that could be extended using other sources (Section 4.2). The OMD dataset (Obama-McCain debate) [37] contains tweets about the televised debate between Senator John McCain and then-Senator Barack Obama. The tweets were detected by following three hashtags: #current,#tweetdebate, and #debate08. The dataset contained tweets captured during the 97-minute debate, and 53 after it, for a total of 2.5 hours. The dataset included tweet IDs, publication date, text, author name and nickname, and individual annotations of up to seven annotators.

Datasets
The Health Care Reform (HCR) [38] dataset contained tweets about the run-up to the signing of the health care bill in the USA on March 23, 2010. It was collected using the #hcr hashtag, from early 2010. A subset of the collected tweets were annotated with polarity (positive, negative, neutral, and irrelevant) and polarity targets (health care reform, Obama, Democrats, Republicans, Tea Party, conservatives, liberals, and Stupak) by Speriosu et al. [38]. The tweets were separated into training, dev(HCR-DEV), and test (HCR-TEST) sets. The dataset contained the tweet ID, user ID and username, text of the tweet, sentiment, target of the sentiment, and the annotator and annotator ID.
RTMind [2] contained a set of 62 users and 159 tweets, with positive or negative annotations. To collect this dataset, Pozzi et al. [2] crawled 2500 Twitter users who tweeted about Obama during two days in May 2013. For each user, their recent tweets (up to 3200, the limit of the API) were collected. At that point, only users that tweeted at least 50 times about Obama were considered. The tweets from those users that relate to Obama were kept and manually labeled by three annotators. Then, a synthetic network of following relations was generated based on a homophily criterion, i.e., users with a similar sentiment were more likely to be connected. The dataset contained the ID of the tweet, the ID of the author, the text of the tweet, the creation time, and the sentiment (positive or negative).

Gathering and Analyzing Social Context
The model proposed needs to access the network of users. Since all datasets provide both tweet and user IDs, it would be possible to access Twitter's public API to retrieve the network. However, that approach has several disadvantages that stem from the fact that these datasets were originally captured circa 2010 [1], such as the fact that the relationships between users have likely changed and that many of the original tweets and users have been deleted or made private, making it impossible to fetch them. Alternatively, we decided to retrieve the follower network from a snapshot of the whole Twitter network in summer of 2009 [39]. Since the datasets used were gathered around the same time period as the snapshot, this should provide a more reliable list of followers than other methods. We refer to the the resulting network as relations.
Upon realizing that the relations network was rather sparse for the OMD and HCR datasets, we investigated an alternative to find hidden links between users: connecting users that followed similar people. To do so, we extracted the list of users followed by each author and we compared the list of followees for each pair of users in the dataset. Users that shared at least a given ratio of their followees were considered similar, and an edge between them was drawn. After evaluating different values for the threshold ratio, it was set to 15%, as it resulted in a degree similar to the RT Mind dataset. We refer to this network as common.
To compare the two network variants, relations and common, we used some basic statistics of each network, shown in Table 2. The table includes the average degree of each node in the network (i.e., mean number of edges per node), the ratio of users that have the same label as the majority of their neighbors in the network (majority agreement), the ratio of users that have the same label as all their neighbors (total agreement), and the ratio of users that do not have any neighbors (isolation ratio). The degree measures the density of the network. The majority and total agreement metrics are a measure of homophily in the network. The table also includes two measures of the balance in labels for user (user label ratio) and content (content label ratio). These two metrics were calculated by dividing the number of elements (i.e., users and content) with the most common label by the total number of elements.
We observed that the RT Mind dataset was the most promising of all the networks, as its labels were balanced, it had high density and homophily, higher content count per user, and all of its users were connected. The OMD networks were the densest, but their agreement was very low and a fourth of its users not connected to others. Moreover, we observed that the common extension of this dataset had a lower agreement ratio and fewer edges, whereas the isolation ratio remained the same as in the relations network. Lastly, the HCR dataset showed the lowest agreement of the datasets, and the relations network was almost non-existent. Although the common network significantly improved every metric, the majority agreement was still very low (0.29). This meant that the additional links were connecting users that were dissimilar, which negated the homophily assumption.
In summary, we concluded that this particular strategy to extend social context did not work for these datasets. The statistics for the RT Mind dataset made it ideal for the evaluation of our proposed model. The results for the OMD dataset may indicate how the model would work in scenarios with a higher degree, but relatively low homophily. In that scenario, the meso features may interfere with micro features. Lastly, the HCR dataset could show how the model would work with an almost complete lack of meso features.

Evaluation
The sentiment classification task can be divided into two sub-tasks: user-level classification, which only focuses on predicting user labels (L u ), and content-level classification, which focuses on content labels (L c ). Since these two tasks are seldom tackled at the same time, we will evaluate how the model performs in each of them independently. The datasets used are described in Section 4.
First, we focus on user-level classification (Section 5.1). The main goal was to evaluate the effect of adding community detection to the SampleRank algorithm and to compare the performance of the model to others. Then, we evaluated the content-level classification (Section 5.2) with varying levels of certainty about user and content labels.
We will compare the performance of CRANKto other classifiers that will serve as the baseline and to the results of other works in the state-of-the-art. Each model will be evaluated on different scenarios, i.e., different social contexts. The ratio of labeled (i.e., known) users and content had a significant impact on the performance of the model. Thus, we evaluated each model with different ratios of known labels for both users (ratio u ) and content (ratio c ). In each scenario, a random set of labels was kept, according to ratio u and ratio c . This process was repeated several times to ensure that the results were not too biased by the random partition. For each combination of model, dataset, ratio u , and ratio c , the results were aggregated and the mean accuracy and its standard deviation calculated. Accuracy was chosen over other metrics because it is commonly used in the field [1].

User-Level Classification
For the evaluation of user classification, we wanted to test whether Hypothesis 1 (meso features improve user classification in the absence of micro features) and Hypothesis 4 (meso e , and community detection in particular, can improve classification compared to only using meso i and meso r features) held true. In our case, Hypothesis 1 was tested by comparing the accuracy of the CRANK model to a simpler model that labeled each user using the majority label of his/her content. Hypothesis 4 was tested by comparing the CRANK model to CRANK without community detection.
The following models were compared: • Average content (AvgContent) (micro): Content was applied the same label as the majority of content by the same user, and users were labeled according to the majority label of their content. • Naive majority (AvgNeigh) (meso i or meso r , depending on the context): Users were labeled with the majority label in their group of neighbors in the network. Unlabeled content was given the label of its creator. • Majority in the community (AvgComm) (meso e ): Users were grouped into communities, and each user was given the majority label of the users in their community. Content was given the label of its creator. • CRANK without community detection (meso r or meso i , depending on the context): The CRANK model described in Algorithm 1, but using original edges instead of applying community detection. • CRANK (meso e ): Before applying Algorithm 1, the communities between users were extracted and converted to user edges, i.e., users in the same community were connected by an edge.
The results of the evaluation are shown in Table 3, where the highest value for each row is presented in bold. It also highlights in grey the highest value when the average content was ignored.
If we focus on the results for the RT Mind dataset, we could conclude that CRANK significantly improved the classification in all scenarios, especially with lower ratio c values. In other datasets, where the network of users was sparser and less cohesive, CRANK outperformed all the models, except for the average of content. This was expected, since meso features in these datasets were rather weak, and the content mean and median values were close to one. In particular, the difference between the CRANK model and the baseline in the HCR dataset was relatively small (0.02). That indicated that there was little penalty to using CRANK even when there were few meso edges between users. In the OMD dataset, which had low agreement between neighbors, the difference between CRANK and the baseline was higher, and it did not decrease with higher values of ratio u . This confirmed our suspicions that the meso features in this dataset were not useful for our purposes.
Regarding Hypothesis 4, we observed that CRANK outperformed its variant without community detection in most of the cases. The exceptions were cases where most of the user labels were known. In those cases, the accuracy of both methods was extremely high (above 0.95). This difference could be explained by interpreting community detection as an aggregate over several users. In general, all the users in a community shared the same sentiment. However, some members would have a different label from the majority in their community (i.e., outliers). Often, those outliers were users that were connected to users of other communities with a different sentiment. That information was lost when aggregating, so for those outliers, community detection was actually detrimental. The fewer users that were left unlabeled, the higher the effect of those outliers would be. Aggregating in those cases presented a higher variance, which combined with the high accuracy values also lowered the mean compared to not aggregating. Nevertheless, we could conclude that meso e features improved user classification in most cases.

Content-Level Classification
For the evaluation of content classification, we wanted to test whether Hypothesis 2 (micro features improve content classification over pure contextless features), Hypothesis 3 (meso features improve content classification in the absence of micro features), and Hypothesis 4 (meso e , and community detection in particular, can improve classification compared to only using meso i and meso r features) held true. To do so, we compared the performance of the following classifiers: • Simon [40] (contextless): A sentiment analysis model based on semantic similarity. The model can be trained with different datasets. In our evaluation, we compared with the Simon model trained on different datasets: STS, Vader, Sentiment140, and a combination of all three. • Sentiment140 (https://www.sentiment140.com) service (contextless): This is a public sentiment analysis service, tailored to Twitter. It outputs three labels: positive, negative, and neutral. This results in lower accuracy for the negative and positive labels. In fact, of all the models tested, this was the one with the lowest accuracy. If all tweets labeled neutral by the service are ignored, its accuracy reaches standard levels (around 60%). Unfortunately, this means that around 80% of tweets have to be ignored. • Meaningcloud (https://www.meaningcloud.com/) Sentiment Analysis (contextless): An enterprise service that provides several types of text analysis, including sentiment analysis. It poses the same restrictions for evaluation as Sentiment140, as it provides positive, negative, and neutral labels. Fortunately, the subjectivity detection of this service for our datasets was better than that of Sentiment140.

•
Average Content (AvgContent) (micro): Content is applied the same label as the majority of content by the same user, and users are labeled according to the majority label of their content. • Naive majority (AvgNeigh) (meso i or meso r , depending on the context): Users are labeled with the majority label in their group of neighbors in the network. Unlabeled content is given the label of its creator.
• Majority in the community (AvgComm) (meso e ): Users are grouped into communities, and each user is given the majority label of the users in their community. Content is given the label of its creator. • CRANK without community detection (meso r or meso i , depending on the context): The CRANK model described in Algorithm 1, but using original edges instead of applying community detection. • CRANK (meso e ): Before applying Algorithm 1, the communities between users are extracted and converted to user edges, i.e., users in the same community are connected by an edge.

•
Label propagation [38] (Speriosu): Based on the results reported in the original paper for these datasets.
We compared the accuracy of each of these models for several combinations of known content and user labels (ratio c and ratio u ). Table 4 shows a summary of the mean accuracy for each combination. We also provide a graph of the mean accuracy and standard deviation of each model (Figure 3-5).
Similarly to the user-classification case, if we focus on the RT Mind dataset, the CRANK algorithm outperformed all other models by a wide margin. In general, the baseline models that used social context had higher accuracy in this dataset than any contextlessapproach. This was more obvious when either more content was known (better micro features) or more users were known (better meso features). This evidence supported Hypotheses 2 and 3.
In this case, averaging the content of a user yielded poor results for all datasets, due to the low content count per user. If we look at all the results, we observe once again that the version of CRANK with community detection had consistently better accuracy, supporting Hypothesis 4. It should be noted that the Simon model [40] achieved the best performance among the contextless models and the overall best in the OMD dataset. Unfortunately, the results for that dataset were very similar for all the models, and the margins were small, so we could not draw any conclusions from that dataset.

Statistical Analysis
In order to assess the value of the comparison of the models, a statistical test was performed on the experimental results. More specifically, we used a combination of Friedman's test with the corresponding Bonferroni-Dunn post-hoc test, which is oriented toward the comparison of several classifiers on multiple datasets [41].
First of all, in Section 5.1, we claimed that the version of CRANK with community detection outperformed the version without it. To assess that claim, we compared all the user and content-level classification cases for both models. Friedman's test revealed the difference between both models was statistically different, with a chi-squared of 104 and a p-value of 2.9e −5 . The post-hoc Bonferroni-Dunn test also passed with a calculated difference of 0.63, which was above a critical difference of 0.27.
Secondly, we compared all the user-level models, ignoring the Average Content classifier. In that case, Friedman's test also rejected the null hypothesis, with a chi-squared of 27.4 and a p-value of 0.0006. In this case, we performed the Bonferroni-Dunn test, with the average of neighbors as the baseline, and both CRANK and CRANK without communities passed it. The results for average in community and average of neighbors were not conclusive.
Secondly, we performed a similar comparison for content-level classification. We compared the following approaches to the Sentiment140 baseline. The calculated critical difference for this case was 3.299. The results were that only CRANK, CRANK without communities, and Simon trained with the STS dataset were better than the baseline (Table 5). Unfortunately, we could not reject the null hypothesis for CRANK and Simon STS alone at the desired level of confidence, given the number of datasets. Nevertheless, if we reduced our test to the scenarios with the RT Mind dataset at different ratios of r u and r c , the null hypothesis could be rejected with α = 0.1.

Conclusions and Future Work
In this work, we proposed a model that united features from different levels of social context (micro, meso, and meso e ). This model was an extension of earlier models that were limited to user-level classification. Moreover, it employed community detection, which found weak relationships between users that were not directly connected in the network. We expected the combination to have an advantage at different levels of certainty about the labels in the context and with varying degrees of sparsity in the social network. The proposed model was shown to work for both types of classification in different scenarios.
To evaluate the model, we looked at different datasets. The need for a social context restricted the number of datasets that could be used in the evaluation. Of the three datasets included, the RT Mind dataset seemed to be the most appropriate, as it contained a more densely connected network of users. The results of evaluating CRANK with other baseline models in that dataset provided limited support for Hypothesis 4 (meso e features improve user classification). Moreover, the evidence from evaluating all the datasets supported Hypotheses 2 (micro features improve content classification) and 3 (meso features improve content classification). By comparing the two versions of CRANK (with and without community detection) in both user-and content-level classification, we also validated Hypothesis 4 (meso e features improve user and content classification). Nonetheless, the analysis of the datasets in Section 4.2 revealed the need for better datasets, which could be enriched with context, i.e., datasets with inter-connected users and more content per user. Hence, further evaluation would be needed, once richer datasets become available.
In addition to evaluating more domains and datasets, there are several lines of future research. In this work, we used a random user and content selection strategy to generate the evaluation datasets. A random sampling strategy for users and content led to higher sparsity. Since the performance of the model depended on having a densely connected graph, it would be interesting to evaluate the effect of different sampling algorithms, such as random walk, breadth-first search, and depth-first search. In particular, Breadth-First Search (BFS) sampling may be more appropriate for this scenario [42].