3.1. Proposed Model and Structure Investigation
In this section, we introduce the model underlying our framework. The purpose of this model is to represent, in a rich yet simple way, a scenario of discording communities, specifically the users involved, their membership to communities, and their interactions.
Let be a set of users interacting on a social platform where discording communities are present. Let be a set of communities on a social platform. A user, , is involved in a community, , if they post at least one comment on . In the Introduction, we have seen that is a set of discording communities at a certain instant, t, if, at that instant, each pair of communities of has a discordance degree greater than a certain threshold, . We have also introduced a discordance function, , that receives two communities, and , , and a time instant, t, and returns a value in the real interval , representing the normalized discordance degree between and at time t. In the Introduction, we have intentionally left the function generic, since we believe it is appropriate to define different versions of for different scenarios. For example, we could define different versions of based on the characteristics of the social platform involved and the goals we want to pursue. Here, we propose some examples:
Membership-based : On some social platforms, we can leverage information such as users’ participation in groups (think, for instance, of subreddits in Reddit or groups in Facebook) discussing a topic from a specific point of view (e.g., the topic may be the COVID-19 vaccine and the groups might be pro-vaxxers and no-vaxxers). These groups represent the communities of
. In this case, the value of
related to two communities,
and
, at a certain time instant,
t, is high if the two communities treat the topic from two very different perspectives, while it is low if they treat the topic from similar perspectives. This version of
can be used on those few social platforms that record users’ membership in groups or communities (e.g., Reddit). It provides excellent results when opinions on a topic can be easily separated (e.g., people in favor or against climate change). An example of membership-based
is reported in
Section 4.1.
Hashtag-based : The adoption of certain hashtags reveals a user’s opinion on a certain topic [
49]. Therefore, given a topic, we can think of defining
based on the hashtags users employed in their posts. In this case, the value of
relative to two communities,
and
, at time
t is high when the hashtags of the comments posted by the users of
and
until
t reveal different perspectives on the same topic. On the other hand, if the hashtags reveal similar perspectives, the value of
is low. This version of
can be employed in those social platforms where there is a high use of hashtags in the comments (e.g., X). In contrast, it is not suitable for those platforms that do not involve the use of hashtags (e.g., Reddit) or in cases where the hashtags employed are too generic, and therefore ineffective in describing specific views. An example of hashtag-based
is reported in
Section 4.5.
Embedding-based : In many cases, it is possible to compute embeddings of user comments employing Natural Language Processing models (e.g., BERT, T5, etc.). Therefore, given a topic, we can think of defining ) based on a measure of (dis)similarity (e.g., cosine similarity) computed on the embeddings of the comments that users posted on that topic until t. In this case, the value of relative to two communities, and , at time t is high when the average dissimilarity between the embeddings of the comments posted by the users of and until t is high. On the other hand, if the average dissimilarity is low, the value of will be low. This version of is particularly useful when we only have user comments. In this case, the discordance degree is determined only by machine learning models, rather than being directly inferred from user actions like explicit community memberships or hashtag usage.
Influencer-based : Influencers express their opinions on many topics of interest and can easily polarize their communities. So, we can think of leveraging influencers and their opinions about a topic to define a new version of . In this case, the value of relative to two communities, and , at time t is high when the users of and followed influencers having different opinions about the topic. On the other hand, if users followed the same influencers, or at least influencers with similar opinions, the value returned by is low. This version of can be used when, given a topic, we can identify the presence of influencers on that topic (e.g., on Instagram or X) and can define their opinions about it. Instead, it cannot be applied to social platforms without influencers (e.g., Reddit).
While each discordance function is effective within its specific context, it may struggle to capture the finer nuances of disagreement across various types of discussion. For topics with clearly defined opposing sides (e.g., political or ethical debates), functions based on membership or influencers provide more straightforward measurements of discordance. However, for more complex debates, such as those involving multiple viewpoints, a combination of approaches may be required to capture the subtleties. Whatever discordance function, , is chosen, if the set turns out to be discording based on the definition specified above, we will use the symbol to denote it.
Let be the set of comments posted by users of on the social platform. We assume that each comment can be published by a user in response to a post published on the social platform or in response to another comment already published by another user. We also assume that each comment consists of simple text and that we can always refer to it distinctly, i.e., there is a unique identifier for each comment. Furthermore, we assume that each comment can have one or more features. Given a comment , we denote with , the user who posted it, and with , the community of in which it was posted. As a feature of , we consider its score, ; this is a non-negative number indicating how much was appreciated.
Users of can interact with each other through comments. An interaction is the action a user takes to reply, through a comment, to another user’s comment. Let be the set of interactions. Each interaction, , consists of an ordered pair of comments and indicates that replies to . We call comment “active” and comment “passive”. Furthermore, we call (resp., ) the active (resp., passive) part of and say that (resp., ) is involved in as the active (resp., passive) user. It is worth pointing out that, based on what we said above, not all comments posted on the social platform are part of an interaction. In fact, all comments that are published directly in response to a post and receive comments from no other user are not part of any interaction. In other words, for a comment to be part of an interaction, there must be at least one other comment in response to it. Clearly, if a comment receives more comments in response to it, it will participate as a passive part of as many interactions, one for each comment posted in response to it. Finally, a comment can participate as an active part in at most one interaction; the latter will have as a passive part the comment to which it was intended to respond.
Having defined the basic sets of our model, let us now introduce some notions that will be useful, as follows:
contains all the information our framework needs to achieve its goals. However, its set-based representation does not make such achievement easy. In fact, the goals to be pursued are strongly related to the analysis of interactions, and it is well known that the most advantageous representation for studying relationships between different entities is the network-based one [
50]. Therefore, to represent the context of interest we introduce a network-based model,
, defined as:
Here:
V is the set of nodes of . There is a node for each user , and vice versa. Since there exists a bi-univocal correspondence between users of and nodes of V, we will employ the terms “user” and “node” interchangeably in the following.
A is the set of arcs of . An arc indicates that the node has interacted as an active user at least once with node that, in turn, behaved as a passive user. is a weighted network; in fact, each arc is associated with a weight.
is the weight function, which assigns a weight to each arc, . returns a non-negative value. Specifically, we chose as the function that receives an arc, , and returns the number of interactions between and in which acted as an active user and behaved as a passive user, i.e., .
Intuitively, the network modeled by represents the interactions between users. Each node denotes a user; an arc exists between two nodes if the corresponding users are involved in at least one interaction. Each arc has a weight that indicates the actual number of interactions in which (resp., ) interacted as an active (resp., passive) user.
Having a network-based model allows for a number of analyses related to network topology. Of great interest in this regard are centrality measures, investigated in social network analysis [
51], which allow us to define the importance of users within a social network. Thanks to them, it is possible to construct multiple rankings of users in a network. Each ranking is associated with a different centrality measure, which, in turn, reflects a certain property that we want to investigate. By having user rankings available, it is possible to introduce the concept of top users. In fact, given a ranking and an integer
, the first
t users in that ranking represent the top
t users of it. We apply the concept of top users to our model to identify the most important users with respect to a given property. In fact, studying the top users of a network allows for a more detailed analysis of the properties and interactions of its core members. Now, it is well known that almost all phenomena involving social networks follow a power law distribution [
50]. Therefore, knowing the properties and interactions of the core members of a network is equivalent to knowing most of the properties and interactions of the network as a whole. In our case, as we will see in
Section 4, the analysis of top users allowed us to derive important insights.
The model that we introduced may seem to be one-sided, since it seems to disregard a passive user’s reaction to a comment posted by an active user. Actually, is two-sided, although this happens indirectly. In fact, there are two popular methods for addressing the two-sideness of interactions, namely likes and reposts. Our model associates each comment, , with a score, , which performs a similar function to that performed by likes in many social media. Furthermore, the repost mechanism is handled indirectly when a passive user receives a comment, , and becomes an active user of another comment, , having the same content as .
The network, , as it is structured, allows for a range of structural analyses on the interactions that occurred between users belonging to communities. However, while it allows for the representation and management of the “who”, i.e., the interacting users, it is unable to model and manage the “what”, i.e., the reasons why users interacted. In order to handle the latter aspect, it is necessary to consider the corresponding content. To address this issue, in the next section we propose an approach to integrate content into our framework.
3.2. Content Investigation
In this section, we want to see how it is possible to augment our network, , with information derived from the content exchanged by users during their interactions. In doing so, we focus on the following goals: (i) we want to seamlessly integrate the content of , i.e., to keep the representation of by augmenting it so that it allows for the analyses of both interaction structure and content; (ii) we want to maintain a low complexity, which implies that content representation must be lightweight.
Based on these goals, we divided our network augmentation process into two parts. The first uses a representation learning approach to generate an embedding for each piece of content (e.g., for each comment). The second uses sentiment analysis algorithms to enrich comments with three quantitative values. In the following, we explain each of these two parts in detail.
Representation learning is a fundamental concept in machine learning and artificial intelligence. Its approaches aim to transform input data into a more informative and compact representation space. Generally, such transformation generates embeddings, which are low-dimensional vector representations of the elements of a given dataset. Embeddings are designed to capture syntactic and semantic information from a text.
We define an embedding function,
,
. It receives a comment,
, and returns a vector,
, representing
.
is called the embedding of
. Several approaches exist in the literature to generate embeddings from text, such as word2vec [
38], GloVe [
39], and BERTopic [
52]. In our experimental campaign, we used the last one.
Given two embeddings,
and
, it is useful to have a measure of similarity between them. In our case, we use the cosine similarity [
53]. It returns 1 if the two vectors are the same, –1 if they have opposite directions, and 0 if they have no correlation. Intermediate values indicate intermediate situations of similarity or dissimilarity. In the following, we use the notation
to indicate the cosine similarity between
and
.
Once the embeddings are computed, our framework can proceed with the second step, i.e., annotation. Specifically, it enriches each comment with three quantitative values based on techniques used in Natural Language Processing (NLP) and sentiment analysis techniques. The three quantitative values are obtained through the following functions:
: it receives a comment,
, and assigns the corresponding
sentiment value. In the literature, there are several approaches that could be used to implement
. In our experimental campaign, we used VADER (Valence Aware Dictionary and sEntiment Reasoner) [
54], a lexicon- and rule-based model specifically designed to evaluate sentiments expressed in social media. We chose VADER because it is highly accurate for short, informal text such as comments and posts commonly found on social platforms. Furthermore, it does not require any training data, which simplifies implementation and ensures consistent performance across different datasets. It computes the so-called compound score [
23,
55,
56]. The latter ranges within the real interval
; its value is obtained by summing the scores returned by VADER for each word in the lexicon, adjusted based on certain rules (describing common social media content), and normalized between –1 (most negative extreme) and 1 (most positive extreme). A sentiment value tending to 1 indicates that the author made an extremely positive comment; conversely, a sentiment value tending to –1 indicates that the comment is extremely negative. Finally, a sentiment value tending to 0 means that the comment is neutral. Any sentiment value, even zero, is worth considering and provides interesting information for our analysis. For example, extreme values (i.e., very high or very low ones) indicate that the corresponding comment contributes to increasing the level of polarization (and thus the level of discordance) of communities. Conversely, a null value indicates a comment that helps to moderate, and thus reduce, the level of polarization (and thus the level of discordance) of communities. Since we are interested in studying discording communities as thoroughly and broadly as possible, it is clear that the mechanisms that dampen polarization and discordance level are also worth investigating.
: it receives a comment and assigns to it a value called
subjectivity. Its values range in the real interval
, where 0 indicates that the comment is very objective while 1 denotes that it is extremely subjective. In the literature, there are several approaches that can be used to implement
[
57,
58]. In our experiments, we employed the algorithms provided by TextBlob [
59]. We chose TextBlob because it is a simple yet effective tool that leverages a lightweight rule-based approach to calculate subjectivity, which makes it both efficient and interpretable. Additionally, TextBlob’s pre-built functionality allows us to quickly and reliably compute subjectivity without needing to construct custom models or train on domain-specific data.
: it receives a comment and returns the
number of entities mentioned in its textual content. In fact, it implements a Named Entity Recognition (NER) task. This is an NLP task that involves identifying and categorizing named entities, e.g., names of people, organizations, locations, dates, and other specific terms within a text [
60]. In our experiments, for the implementation of
we used the algorithm provided by the SpaCy (
https://spacy.io/) library of Python 3.8, which is based on a machine learning algorithm known as Conditional Random Field (CRF) [
61]. We chose CRF because it is well suited for sequence tagging tasks such as Named Entity Recognition. CRF effectively models the relationships between adjacent words in a sequence. As a result, it is able to take into account the context of a word and make more accurate predictions. This results in improved precision and recall for identifying named entities, which is critical for ensuring the quality and reliability of the information extracted from comments.
Finally, in
Table 1 we present examples of the annotation process for one of the datasets (specifically, the climate change dataset) that we used in our experimental campaign.
Having explained the technical aspects of content investigation, let us now examine its time complexity. The content investigation process can be divided into two parts, namely content embedding and content annotation.
The time complexity of the first part essentially depends on how the embeddings are computed. In our case, we use BERTopic, whose inference consists of embedding the input documents, applying dimensionality reduction techniques to project the document embeddings into a lower dimensional space, and then assigning the topic based on the cluster to which the documents belong. The time complexity of these steps is
, where
N is the number of tokens in a document,
D is the dimensionality of the transformer model used in the process, and
S is the complexity of the projection used for the embedding process [
52].
Instead, the content annotation process depends on the functions
,
and
, and thus on the time complexity of the methods used to implement them. To implement the function
we used VADER. As discussed in the introductory paper [
54] and supported by the official website (
https://github.com/cjhutto/vaderSentiment, accessed on 1 January 2025), the time complexity of executing VADER is
, where
N is the length of the analyzed text. To implement the function
, we used TextBlob. As can be seen from its official documentation, the subjectivity computation is performed through a rule-based mechanism that, in this case, is
, where
N is the length of the text. Finally, to implement the function
, we used the NER algorithm provided in the SpaCy library, which exploits Conditional Random Field to label the tokens. Although it is not specified in the official SpaCy documentation, it is reasonable to assume that the algorithm is based on the linear-chain Conditional Random Field typically used when dealing with text elements [
61]. The inference time of such an algorithm is
, where
N is the number of tokens in the input text and
L is the number of possible labels per token. In conclusion, the total time complexity of the content annotation process can then be represented by the dominant complexity among the above functions, i.e.,
.
Therefore, the overall time complexity of content investigation is equal to .
3.3. Integrating Structure and Content
During this phase, our framework analyzes discording communities, investigating both their structure and their content. The study of structure can be carried out by analyzing the nodes, arcs, and weights of the network, , while the study of content is done using the functions , , , and .
The separate study of structure and content is interesting in itself, but the combined study of them becomes even more challenging. In fact, it allows for a series of analyses that take into account both points of view, thus enabling a more holistic investigation.
To formally integrate the properties of
,
,
,
, and
, we introduce an extension
of
defined as follows:
is constructed on top of
. It has the same sets,
V of nodes and
A of arcs, as well as the same weighting function,
, as
. Therefore, when we refer to the nodes and arcs of
we will employ the same sets,
V and
A, used for
.
is an arc augmentation function, which associates each arc of
with a set of features that accounts for interactions, embeddings, sentiment, subjectivity, and mentioned entities. Formally speaking, given an arc
,
can be defined as follows:
Here:
is the number of interactions between and , i.e., .
is the maximum similarity between the embeddings of the comments in the interactions involving and , i.e., .
is the minimum similarity between the embeddings of the comments in the interactions involving and , i.e., .
is the average sentiment value of all comments made by in the interactions involving and , i.e., .
is the average score of all comments made by in the interactions involving and , i.e., .
is the average subjectivity value of all comments made by in the interactions involving and , that is, .
is the average number of entities mentioned in all comments made by in the interactions involving and , that is, .
is the value obtained by applying a Kernel Density Estimation (KDE) [
62] on the values of the features
,
,
,
,
,
, and
. KDE is a non-parametric statistical technique used to estimate the probability distribution of one or more continuous variables. Given a dataset of
t observations,
, KDE estimates the probability density function,
, as:
Here:
- -
is a single data point in the dataset of observations; it consists of a vector that has a value for each of the features , , , , , , and mentioned above.
- -
is the estimated probability distribution for the data point y in the dataset of observations.
- -
t is the number of data points in the dataset.
- -
h is a smoothing parameter called the bandwidth, which controls the width of the kernel function.
- -
K is the kernel function. Typically, it is a symmetric, non-negative function centered at zero. Common choices of K include the Gaussian kernel and the linear kernel.
KDE is widely used in various fields, including statistics, data analysis, and machine learning [
63,
64]. It provides a flexible and powerful tool for understanding the underlying structure of a set of observations.
represents the probability that the writing styles and the opinions characterizing the comments of
and
are concordant. Specifically, a high value of
indicates that
replied to the comments of
with a similar writing style and/or showing concordant opinions. In contrast, a low value of
indicates that the writing style of the comments of
and
are dissimilar and/or that the opinions expressed in the comments are discordant.
To the best of our knowledge, no method to integrate structure and content has been proposed in the past literature; therefore, the method proposed in this section is the first one addressing this issue in the literature. For this reason, it is legitimate, and indeed proper, to raise the question of the rationality and reliability of this method. Regarding rationality, we observe that all the parameters composing the tuple returned by the function
, when it is applied on an arc,
, are well known in the literature. As for reliability, this can only be determined experimentally. Regarding this, it should be pointed out that the experiments described in
Section 4 confirm the correctness and usefulness of the parameters characterizing the integration method.
Finally, we have seen above that top users play a key role in the study of . The same is true for the network . In fact, by comparing the top users related to the same property in the networks and , it is possible to extract insights on the different role of structure and content in the user dynamics of discording communities.
Let us now examine the time complexity associated with structure and content integration activities. We start this characterization by assuming that all embeddings are already pre-computed and accessible in
. Indeed, this is exactly the case in our framework, where the previous phase, i.e., content investigation, is essentially performed only once for the dataset under investigation. Therefore, we can store this data in a hash table, which allows us to access it in a constant time. This is also true for all sets defined in
Section 3.1, which can be represented by data structures such as matrices or hash tables, allowing access to a single piece of information in
. Furthermore, the similarity between two embeddings can be computed only once and stored in a matrix whose access has time complexity
.
The core of this phase is the construction of the network . In this case, we can represent it as a simple adjacency matrix with non-empty entries, and thus the time complexity of its construction is , where V is the set of nodes. Nevertheless, has an arc augmentation function, , where is an arc of , and whose computation can be done during the construction of . The time complexity of coincides with the maximum time complexity needed to compute the features with which the arc will be associated. Thanks to the pre-computation of the embeddings and similarity values, almost all features can be computed in linear time, in particular in , where is the set of interactions between the users represented by the nodes and , respectively. The value of the feature is obtained by applying a KDE to the values of all the other features calculated in ; its time complexity is linear in the order of the number of features considered. Wrapping up, the final time complexity of this phase is , where is the maximum number of interactions between two users.