A Framework for Investigating Discording Communities on Social Platforms

Cauteruccio, Francesco; Corradini, Enrico; Marchetti, Michele; Ursino, Domenico; Virgili, Luca

doi:10.3390/electronics14030609

Open AccessArticle

A Framework for Investigating Discording Communities on Social Platforms

by

Francesco Cauteruccio

¹

,

Enrico Corradini

²

,

Michele Marchetti

²

,

Domenico Ursino

²

and

Luca Virgili

^2,*

¹

Dipartimento di Ingegneria dell’Informazione ed Elettrica e Matematica Applicata, University of Salerno, 84084 Fisciano, Italy

²

Dipartimento di Ingegneria dell’Informazione, Polytechnic University of Marche, 60121 Ancona, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 609; https://doi.org/10.3390/electronics14030609

Submission received: 2 January 2025 / Revised: 28 January 2025 / Accepted: 1 February 2025 / Published: 4 February 2025

(This article belongs to the Special Issue Application of Data Mining in Social Media)

Download

Browse Figures

Versions Notes

Abstract

In recent years, polarization on social media has risen significantly. Social platforms often feature a range of topics that give rise to communities of users with diametrically opposed views, who tend to avoid engaging with others having different perspectives. We call these types of communities “diverging communities”. Examples include communities of supporters and skeptics of climate change or COVID-19 vaccines. In this paper, we aim to investigate this phenomenon. To do so, we first propose a formal definition of discording communities. We then present a framework for investigating the behavior of users of discording communities on a social platform. Our framework is general in that it can be adapted to any social platform where users discuss a topic that polarizes them into communities with diametrically opposed viewpoints rejecting confrontation. Our framework considers not only the structure of communities but also the content of the messages posted by their users. Finally, it can also handle the temporal evolution of the polarization level of both communities and their users. In addition to proposing a formal definition of diverging communities and presenting our framework, we illustrate the results of an extensive experimental campaign carried out on two case studies involving Reddit and X and show how our framework is able to identify a number of features that distinguish the users of one diverging community from the users of the other.

Keywords:

social networks; representation learning; discording communities; echo chambers; polarization; reddit

1. Introduction

The advent of social platforms has transformed the way individuals communicate, leading to the emergence of vast online ecosystems where users interact, share beliefs and opinions, and form communities [1]. While serving as hubs for shared interests and beliefs, online communities give rise to a number of dynamics that influence their members and the broader digital landscape [2,3]. One such dynamic is polarization, in which individuals or groups gravitate toward extreme positions, often resulting in the formation of tightly knit communities that rarely interact with communities that hold views contrary to their own [4,5]. This can lead to echo chambers, environments in which similar views are amplified and reinforced while different perspectives are not exposed or otherwise considered [6,7,8].

It is precisely the scenario of polarized communities that underlies the topic of this paper, which aims to introduce the concept of “discording communities” and propose a framework for investigating them on social platforms. We use the term “discording communities” to refer to pairs of online communities that focus on the same topic but have diametrically opposed views on it. Consider, for example, the topic of climate change, Reddit and the subreddits /r/climatechange and /r/climateskeptics, or the topics of COVID-19 vaccines, X and the communities of pro-vaxxers and anti-vaxxers. Investigating discording communities means answering a number of questions, such as the following: (i) How do these communities coexist? (ii) What are the nature and frequency of interactions, if any, among their members? (iii) Are there bridges or intermediaries that connect them, or are they isolated silos? (iv) How can these communities influence the broader narratives around the topic they are discussing?

The investigation of discording communities and their dynamics on a social platform is particularly important in the current communication scenario, where we are witnessing an increasing polarization of communities within social media. This leads to a series of dynamics, such as the closure of each community to ideas different from those that prevail within it (with the creation of echo chambers and inner circles [9]) and thus the enormous difficulty of achieving a synthesis that eliminates, or at least mitigates, the polarization effect. Other phenomena that can arise from discording communities and that it makes sense to investigate are those related to fake news, which are becoming more sophisticated with the advent of Generative AI (deep fakes); the incitement to hatred in closed communities; the presence of troll users; and, if taken to extremes, phenomena that can lead to hatred, lawlessness, and even terrorism [10,11,12]. The investigation of discording communities helps to understand the underlying dynamics, and therefore when these dynamics degenerate into the above phenomena, as well as to identify countermeasures to prevent this from happening.

To investigate discording communities, in this paper we propose a framework consisting of three main components, namely:

A network-based model capable of representing interactions between users belonging to discording communities.
A representation learning process to enrich the model with the content shared during interactions between users from discording communities.
An arc augmentation workflow that enables the study of discording communities in terms of both their structures and their content.

These components make it possible to analyze both the structure of the interactions between users in discording communities and the content circulating within and outside of them.

Our framework is general and can be applied to any social platform where users exchange comments on a topic that polarizes them into communities with opposing viewpoints. To validate it, we conducted an extensive experimental campaign. This involved two different social networks, namely Reddit and X, and two different topics, namely climate change and COVID-19 vaccines. For each of these topics, we considered a pair of discording communities. Specifically, for climate change discussions on Reddit, we considered the two subreddits, /r/climatechange and /r/climateskeptics, while for COVID-19 vaccines on X we considered pro-vaxxers and anti-vaxxers.

In summary, the main contributions of this paper are as follows:

We introduce the concept of discording communities on a social platform.
We propose a framework for analyzing discording communities; our framework is general and can be applied to any social platform where there are two communities with very different positions on the same topic.
We apply such a framework to two case studies and show how it is able to extract several insights capable of characterizing one discording community over the other.

Regarding these contributions, we point out that our framework exploits several concepts and approaches from social network analysis already proposed in the literature. Therefore, the novelty of this paper is not to be found in the single technical approach within our framework, but:

In introducing a new concept (that of discording communities) not present in the past literature.
In integrating in a unique ecosystem different concepts and approaches of social network analysis to study the discording communities phenomenon. Regarding this, we specify that, although many of the concepts and approaches within our framework were already known, they had never been integrated into a unified ecosystem that would allow a systematic study of discording communities.
In demonstrating, through two case studies, the types of analyses, regarding both structure and content, that can be conducted for investigating discording communities.
In providing several insights into four communities (climate change supporters, climate change skeptics, pro-vaxxers, and anti-vaxxers) that have been among the most active in dominating public debate in recent years.

As will be seen later in the paper, some of the insights we derive about discording communities in general, and the four communities we examined in the case studies in particular, confirm insights already found in the literature [13,14,15,16]. In our view, this is a strength of our framework in that it provides indirect confirmation of the correctness of the insights it is able to find.

The outline of this paper is as follows. In Section 2, we provide a comprehensive overview of related literature. In Section 3, we introduce the proposed framework. In Section 4, we illustrate the experimental campaign we carried out on Reddit and X to evaluate the goodness of our framework. In Section 5, we present a discussion of the results obtained and the implications of our study. Finally, in Section 6 we draw our conclusions and take a look at possible future developments.

2. Related Literature

Online interactions with their various dynamics represent a much-investigated research topic in the age of social media [17]. Indeed, many analyses have been conducted to understand the structure and dynamics of online communities [18,19]. A crucial issue concerns interactions involving users and the content they exchange [20]. For example, the comments they share can provide valuable insights regarding their perspectives and views [21,22,23]. By understanding these interactions one can gain a deeper comprehension of user behavior and preferences that can be most useful in enhancing the overall user experience and in defining content that better fits user needs.

In the past, social network analysts have investigated very deeply the relationships between individuals. More recent trends in this discipline are focusing very much on characterizing communities [5,24]. Analyzing inter-community interactions in social networks is crucial because such dynamics can reflect, and possibly amplify, tensions, collaborations, or other emerging phenomena [25,26,27]. Recently, several studies focusing on community interactions have turned their attention to Reddit [28]. Indeed, due to its structure, this social platform naturally encourages users to aggregate with others sharing the same viewpoints. In addition, the design of this platform facilitates interactions between different communities, making it a fertile ground for the investigation of inter-community dynamics [29,30,31,32]. Among such analyses we highlight the one described in [29], where the authors investigate inter-community interactions on Reddit by constructing a “conflict network”. Their goal is to shed light on the conflict patterns that arise between different subreddits. The above approaches are able to analyze the discordance of communities from only one perspective. Furthermore, they do not provide a quantitative indicator of discordance, are based on structure while not considering content, and do not handle temporal aspects. Our framework overcomes these limitations by allowing the discordance of communities to be analyzed from different perspectives, thus allowing the user to choose the perspectives that best fit the context in which the communities operate. In addition, it provides a quantitative discordance indicator that, coupled with a threshold mechanism, allows the user to choose the minimum level of discordance they want to accept for communities to be considered discording. Moreover, it considers not only structure but also content, allowing a deeper analysis of communities and their corresponding users and interactions, which can be crucial in different application contexts [33,34]. Finally, it addresses temporal aspects, allowing for an analysis of how the discordance level of communities varies over time.

Communities in online platforms are often represented as nodes, which are associated with users, connected by a large number of arcs, which represent the interactions through which users express their beliefs and opinions. As specified in the Introduction, such communities often become echo chambers. The study of echo chambers has become pivotal in investigating the dynamics of interactions between users and between communities, especially when the involved subjects express strong sentiments [6,7]. Current social network analysis is particularly focused on understanding users’ behaviors, their interactions, and the nature of the discourses that characterize and, to some extent, shape such communities. In the context of opposing or contrasting online spaces, the characterization of the corresponding users is the core of these investigations. It provides insights into the strength of polarity, the nuances of shared opinions, and the sentiments that propel these interactions [35,36]. Differently from our framework, these approaches do not formalize the concept of discording communities and do not consider the evolution of their discordance level over time.

The integration of machine learning, particularly representation learning, has further enhanced the understanding of these dynamics [37]. The purpose of representation learning methods is to convert input data into a more concise and insightful representation. Typically, such methods produce embeddings, which are condensed representations of vectors of items (e.g., text) within a dataset. Popular approaches to converting textual data into numeric embeddings include Word2Vec [38], GloVe [39], and BERT [40]. As pointed out in [41], traditional knowledge graph embedding models have limitations when applied to social networks. In fact, they do not consider the rich textual communication among users, which is crucial in describing social relationships. Despite the significance of this information, it is largely neglected in existing network embedding methods. To address this issue, the authors of [41] introduce TransConv, an approach that integrates textual interactions between pairs of users to enhance representation learning. In the wake of the results of TransConv, many authors have begun to emphasize the importance of considering in the embedding process not only the structure of a network but also the information content exchanged among its users. Such considerations led to the emergence of methods for representing arcs with a main focus on their attributes that are somehow associated with the content exchanged by the corresponding users [42,43,44,45]. Specifically, in [42] the authors present ICANE (Inductive Content Augmented Network Embedding), a deep learning model designed to enhance resource provisioning and service discovery in edge computing settings. While traditional techniques often overlook the intricate details related to the neighborhoods of the nodes of a network, ICANE effectively incorporates both the structural aspects of a network and those related to the content exchanged among its users. Specifically, it integrates content information into a cohesive feature vector. Along the same lines, the authors of [44] propose AttrE2Vec, an unsupervised method that derives concise vector representation for edges in networks characterized by edge-associated attributes. The distance between embeddings can be exploited to derive new classifications and subdivisions of users based on a richer set of information [46,47,48]. For example, the authors of [46] employed this approach to cluster X users who posted their opinions on the 2018 Turkish elections. Similarly, the authors of [47] introduce ECS (Echo Chamber Score), a metric that quantifies the cohesion and separation of user communities in an embedding space. For this purpose, they use EchoGAE, a self-supervised graph encoder that allows them to capture ideological similarities among users based on their interactions and shared posts. EchoGAE leverages the embedding space to measure distances between users in such a way that it reflects the ideological proximity or distance. The embedding-based approaches mentioned above often compromise the integrity of the original structure of the network they want to study, since their strategy is based on representing network data by embeddings. Actually, the latter embed the structure of the network within them, making it difficult for humans to reconstruct it. Although these embeddings can capture complex details, they sometimes mask the natural relationships and patterns present in the original network, hindering the direct extraction of meaningful insights. In contrast, our framework employs a different strategy. In fact, it uses concise representations of network information through embeddings. However, it preserves the original structure of the network of interactions by using representation learning only to enrich the content of the network without canceling or masking its structure. This choice simplifies the analysis of user interactions on the network. It aims to enrich network arcs with accessible information while retaining the ability to perform analyses on the network structure (like computing the distance or the minimum path between its nodes). Moreover, it allows our framework to be extremely general. In fact, it could implement an embedding-based method for the discordance computation, in which case the embedding-based approaches seen above could be integrated into it. On the other side, since it retains the original structure of the interaction network, it can use completely different methods to define the discordance function, e.g., membership-based, hashtag-based, or influencer-based methods, as we will detail in the next section. This ensures extreme flexibility for our framework, which may highlight some aspects of user and/or community discordance rather than others. Last but not least, the ability of our framework to handle temporal aspects allows it to reconstruct the trend of the discordance level of the communities under consideration over time.

3. Description of the Proposed Framework

In this section, we provide a technical description of the proposed framework. Specifically, in Section 3.1 we illustrate the data model underlying our framework, which enables the structural analysis of user interactions. In Section 3.2, we describe the approaches used by our framework to derive information related to the content that users exchange through their comments. Finally, in Section 3.3 we illustrate how our framework integrates structure and content information to reconstruct a more holistic and complete picture of users, their relationships, and the communities to which they belong.

3.1. Proposed Model and Structure Investigation

In this section, we introduce the model underlying our framework. The purpose of this model is to represent, in a rich yet simple way, a scenario of discording communities, specifically the users involved, their membership to communities, and their interactions.

Let

U = {u_{1}, \dots, u_{n}}

be a set of users interacting on a social platform where discording communities are present. Let

D = {d_{1}, d_{2}, \dots, d_{k}}

be a set of

k \geq 2

communities on a social platform. A user,

U

, is involved in a community,

d_{r} \in D

, if they post at least one comment on

d_{r}

. In the Introduction, we have seen that

D

is a set of discording communities at a certain instant, t, if, at that instant, each pair of communities of

D

has a discordance degree greater than a certain threshold,

t h

. We have also introduced a discordance function,

Γ (\cdot, \cdot, \cdot)

, that receives two communities,

d_{r}

and

d_{s}

,

r \neq s

, and a time instant, t, and returns a value in the real interval

[0, 1]

, representing the normalized discordance degree between

d_{r}

and

d_{s}

at time t. In the Introduction, we have intentionally left the function

Γ (\cdot, \cdot, \cdot)

generic, since we believe it is appropriate to define different versions of

Γ (\cdot, \cdot, \cdot)

for different scenarios. For example, we could define different versions of

Γ (\cdot, \cdot, \cdot)

based on the characteristics of the social platform involved and the goals we want to pursue. Here, we propose some examples:

Membership-based $Γ (\cdot, \cdot, \cdot)$ : On some social platforms, we can leverage information such as users’ participation in groups (think, for instance, of subreddits in Reddit or groups in Facebook) discussing a topic from a specific point of view (e.g., the topic may be the COVID-19 vaccine and the groups might be pro-vaxxers and no-vaxxers). These groups represent the communities of $D$ . In this case, the value of $Γ (\cdot, \cdot, \cdot)$ related to two communities, $d_{r}$ and $d_{s}$ , at a certain time instant, t, is high if the two communities treat the topic from two very different perspectives, while it is low if they treat the topic from similar perspectives. This version of $Γ (\cdot, \cdot, \cdot)$ can be used on those few social platforms that record users’ membership in groups or communities (e.g., Reddit). It provides excellent results when opinions on a topic can be easily separated (e.g., people in favor or against climate change). An example of membership-based $Γ (\cdot, \cdot, \cdot)$ is reported in Section 4.1.
Hashtag-based $Γ (\cdot, \cdot, \cdot)$ : The adoption of certain hashtags reveals a user’s opinion on a certain topic [49]. Therefore, given a topic, we can think of defining $Γ (\cdot, \cdot, \cdot)$ based on the hashtags users employed in their posts. In this case, the value of $Γ (\cdot, \cdot, \cdot)$ relative to two communities, $d_{r}$ and $d_{s}$ , at time t is high when the hashtags of the comments posted by the users of $d_{r}$ and $d_{s}$ until t reveal different perspectives on the same topic. On the other hand, if the hashtags reveal similar perspectives, the value of $Γ (\cdot, \cdot, \cdot)$ is low. This version of $Γ (\cdot, \cdot, \cdot)$ can be employed in those social platforms where there is a high use of hashtags in the comments (e.g., X). In contrast, it is not suitable for those platforms that do not involve the use of hashtags (e.g., Reddit) or in cases where the hashtags employed are too generic, and therefore ineffective in describing specific views. An example of hashtag-based $Γ (\cdot, \cdot, \cdot)$ is reported in Section 4.5.
Embedding-based $Γ (\cdot, \cdot, \cdot)$ : In many cases, it is possible to compute embeddings of user comments employing Natural Language Processing models (e.g., BERT, T5, etc.). Therefore, given a topic, we can think of defining $Γ (\cdot, \cdot, \cdot$ ) based on a measure of (dis)similarity (e.g., cosine similarity) computed on the embeddings of the comments that users posted on that topic until t. In this case, the value of $Γ (\cdot, \cdot, \cdot)$ relative to two communities, $d_{r}$ and $d_{s}$ , at time t is high when the average dissimilarity between the embeddings of the comments posted by the users of $d_{r}$ and $d_{s}$ until t is high. On the other hand, if the average dissimilarity is low, the value of $Γ (\cdot, \cdot, \cdot)$ will be low. This version of $Γ (\cdot, \cdot, \cdot)$ is particularly useful when we only have user comments. In this case, the discordance degree is determined only by machine learning models, rather than being directly inferred from user actions like explicit community memberships or hashtag usage.
Influencer-based $Γ (\cdot, \cdot, \cdot)$ : Influencers express their opinions on many topics of interest and can easily polarize their communities. So, we can think of leveraging influencers and their opinions about a topic to define a new version of $Γ (\cdot, \cdot, \cdot)$ . In this case, the value of $Γ (\cdot, \cdot, \cdot)$ relative to two communities, $d_{r}$ and $d_{s}$ , at time t is high when the users of $d_{r}$ and $d_{s}$ followed influencers having different opinions about the topic. On the other hand, if users followed the same influencers, or at least influencers with similar opinions, the value returned by $Γ (\cdot, \cdot, \cdot)$ is low. This version of $Γ (\cdot, \cdot, \cdot)$ can be used when, given a topic, we can identify the presence of influencers on that topic (e.g., on Instagram or X) and can define their opinions about it. Instead, it cannot be applied to social platforms without influencers (e.g., Reddit).

While each discordance function is effective within its specific context, it may struggle to capture the finer nuances of disagreement across various types of discussion. For topics with clearly defined opposing sides (e.g., political or ethical debates), functions based on membership or influencers provide more straightforward measurements of discordance. However, for more complex debates, such as those involving multiple viewpoints, a combination of approaches may be required to capture the subtleties. Whatever discordance function,

Γ (\cdot, \cdot, \cdot)

, is chosen, if the set

D = {d_{1}, d_{2}, \dots, d_{k}}

turns out to be discording based on the definition specified above, we will use the symbol

DC = {d_{1}, d_{2}, \dots, d_{k}}

to denote it.

Let

C = {c_{1}, \dots, c_{m}}

be the set of comments posted by users of

U

on the social platform. We assume that each comment can be published by a user in response to a post published on the social platform or in response to another comment already published by another user. We also assume that each comment consists of simple text and that we can always refer to it distinctly, i.e., there is a unique identifier for each comment. Furthermore, we assume that each comment can have one or more features. Given a comment

c_{i} \in C

, we denote with

u s r (c_{i})

, the user who posted it, and with

d c (c_{i})

, the community of

DC

in which it was posted. As a feature of

c_{i} \in C

, we consider its score,

s c r (c_{i})

; this is a non-negative number indicating how much

c_{i}

was appreciated.

Users of

U

can interact with each other through comments. An interaction is the action a user takes to reply, through a comment, to another user’s comment. Let

I = {ι_{1}, \dots, ι_{o}}

be the set of interactions. Each interaction,

ι_{l} \in I

, consists of an ordered pair

(c_{i}, c_{j})

of comments and indicates that

c_{i}

replies to

c_{j}

. We call comment

c_{i}

“active” and comment

c_{j}

“passive”. Furthermore, we call

c_{i}

(resp.,

c_{j}

) the active (resp., passive) part of

ι_{l}

and say that

u s r (c_{i})

(resp.,

u s r (c_{j})

) is involved in

ι_{l}

as the active (resp., passive) user. It is worth pointing out that, based on what we said above, not all comments posted on the social platform are part of an interaction. In fact, all comments that are published directly in response to a post and receive comments from no other user are not part of any interaction. In other words, for a comment

c_{j}

to be part of an interaction, there must be at least one other comment

c_{i}

in response to it. Clearly, if a comment

c_{j}

receives more comments in response to it, it will participate as a passive part of as many interactions, one for each comment posted in response to it. Finally, a comment can participate as an active part in at most one interaction; the latter will have as a passive part the comment to which it was intended to respond.

Having defined the basic sets of our model, let us now introduce some notions that will be useful, as follows:

Let $u_{a}$ be a user of $U$ ; we denote by $I^{u_{a}}$ the set of interactions of $I$ in which $u_{a}$ acts as an active user. $I^{u_{a}}$ can be formalized as follows:

$I^{u_{a}} = {ι_{l} | ι_{l} = (c_{i}, c_{j}) \in I, u s r (c_{i}) = u_{a}}$

(1)
Let $u_{b}$ be a user of $U$ ; we denote by $I_{u_{b}}$ the set of interactions of $I$ in which $u_{b}$ acts as a passive user. $I_{u_{b}}$ can be formalized as follows:

$I_{u_{b}} = {ι_{l} | ι_{l} = (c_{i}, c_{j}) \in I, u s r (c_{j}) = u_{b}}$

(2)
Let $u_{a}$ and $u_{b}$ be two users of $U$ ; we denote by $I_{u_{b}}^{u_{a}}$ the set of interactions of $I$ in which $u_{a}$ acts as an active user and $u_{b}$ operates as a passive user. $I_{u_{b}}^{u_{a}}$ can be formalized as follows:

$I_{u_{b}}^{u_{a}} = I^{u_{a}} \cap I_{u_{b}}$

(3)
Let $u_{a}$ be a user of $U$ and let $d_{r}$ be a community of $DC$ ; we denote by $I^{u_{a}} (d_{r})$ the function returning the interactions of $I$ in which $u_{a}$ acts as an active user by posting their comments on $d_{r}$ . In other words, $I^{u_{a}} (d_{r})$ is the subset of $I^{u_{a}}$ containing only those interactions whose active comment was posted on $d_{r}$ . It can be formalized as follows:

$I^{u_{a}} (d_{r}) = {ι_{l} | ι_{l} = (c_{i}, c_{j}) \in I, u s r (c_{i}) = u_{a}, d c (c_{i}) = d_{r}}$

(4)

In an analogous way, $I_{u_{b}} (d_{r})$ and $I_{u_{b}}^{u_{a}} (d_{r})$ can be defined.

I

contains all the information our framework needs to achieve its goals. However, its set-based representation does not make such achievement easy. In fact, the goals to be pursued are strongly related to the analysis of interactions, and it is well known that the most advantageous representation for studying relationships between different entities is the network-based one [50]. Therefore, to represent the context of interest we introduce a network-based model,

N

, defined as:

N = 〈 V, A, μ (\cdot) 〉

(5)

Here:

V is the set of nodes of $N$ . There is a node $v_{p} \in V$ for each user $u_{p} \in U$ , and vice versa. Since there exists a bi-univocal correspondence between users of $U$ and nodes of V, we will employ the terms “user” and “node” interchangeably in the following.
A is the set of arcs of $N$ . An arc $a_{p q} = (v_{p}, v_{q})$ indicates that the node $v_{p}$ has interacted as an active user at least once with node $v_{q}$ that, in turn, behaved as a passive user. $N$ is a weighted network; in fact, each arc is associated with a weight.
$μ (\cdot)$ is the weight function, which assigns a weight to each arc, $a_{p q} \in A$ . $μ (\cdot)$ returns a non-negative value. Specifically, we chose as $μ (\cdot)$ the function that receives an arc, $a_{p q} = (v_{p}, v_{q})$ , and returns the number of interactions between $v_{p}$ and $v_{q}$ in which $v_{p}$ acted as an active user and $v_{q}$ behaved as a passive user, i.e., $μ (a_{p q}) = | I_{v_{q}}^{v_{p}} |$ .

Intuitively, the network modeled by

N

represents the interactions between users. Each node denotes a user; an arc exists between two nodes if the corresponding users are involved in at least one interaction. Each arc

(v_{p}, v_{q})

has a weight that indicates the actual number of interactions in which

v_{p}

(resp.,

v_{q}

) interacted as an active (resp., passive) user.

Having a network-based model allows for a number of analyses related to network topology. Of great interest in this regard are centrality measures, investigated in social network analysis [51], which allow us to define the importance of users within a social network. Thanks to them, it is possible to construct multiple rankings of users in a network. Each ranking is associated with a different centrality measure, which, in turn, reflects a certain property that we want to investigate. By having user rankings available, it is possible to introduce the concept of top users. In fact, given a ranking and an integer

t > 0

, the first t users in that ranking represent the top t users of it. We apply the concept of top users to our model to identify the most important users with respect to a given property. In fact, studying the top users of a network allows for a more detailed analysis of the properties and interactions of its core members. Now, it is well known that almost all phenomena involving social networks follow a power law distribution [50]. Therefore, knowing the properties and interactions of the core members of a network is equivalent to knowing most of the properties and interactions of the network as a whole. In our case, as we will see in Section 4, the analysis of top users allowed us to derive important insights.

The model

N

that we introduced may seem to be one-sided, since it seems to disregard a passive user’s reaction to a comment posted by an active user. Actually,

N

is two-sided, although this happens indirectly. In fact, there are two popular methods for addressing the two-sideness of interactions, namely likes and reposts. Our model associates each comment,

c_{i}

, with a score,

s c r (c_{i})

, which performs a similar function to that performed by likes in many social media. Furthermore, the repost mechanism is handled indirectly when a passive user receives a comment,

c_{i}

, and becomes an active user of another comment,

c_{k}

, having the same content as

c_{i}

.

The network,

N

, as it is structured, allows for a range of structural analyses on the interactions that occurred between users belonging to communities. However, while it allows for the representation and management of the “who”, i.e., the interacting users, it is unable to model and manage the “what”, i.e., the reasons why users interacted. In order to handle the latter aspect, it is necessary to consider the corresponding content. To address this issue, in the next section we propose an approach to integrate content into our framework.

3.2. Content Investigation

In this section, we want to see how it is possible to augment our network,

N

, with information derived from the content exchanged by users during their interactions. In doing so, we focus on the following goals: (i) we want to seamlessly integrate the content of

N

, i.e., to keep the representation of

N

by augmenting it so that it allows for the analyses of both interaction structure and content; (ii) we want to maintain a low complexity, which implies that content representation must be lightweight.

Based on these goals, we divided our network augmentation process into two parts. The first uses a representation learning approach to generate an embedding for each piece of content (e.g., for each comment). The second uses sentiment analysis algorithms to enrich comments with three quantitative values. In the following, we explain each of these two parts in detail.

Representation learning is a fundamental concept in machine learning and artificial intelligence. Its approaches aim to transform input data into a more informative and compact representation space. Generally, such transformation generates embeddings, which are low-dimensional vector representations of the elements of a given dataset. Embeddings are designed to capture syntactic and semantic information from a text.

We define an embedding function,

δ : C \to R^{z}

,

z > 0

. It receives a comment,

c_{i} \in C

, and returns a vector,

e_{i} \in R^{z}

, representing

c_{i}

.

e_{i}

is called the embedding of

c_{i}

. Several approaches exist in the literature to generate embeddings from text, such as word2vec [38], GloVe [39], and BERTopic [52]. In our experimental campaign, we used the last one.

Given two embeddings,

e_{i}

and

e_{j}

, it is useful to have a measure of similarity between them. In our case, we use the cosine similarity [53]. It returns 1 if the two vectors are the same, –1 if they have opposite directions, and 0 if they have no correlation. Intermediate values indicate intermediate situations of similarity or dissimilarity. In the following, we use the notation

s i m (e_{i}, e_{j})

to indicate the cosine similarity between

e_{i}

and

e_{j}

.

Once the embeddings are computed, our framework can proceed with the second step, i.e., annotation. Specifically, it enriches each comment with three quantitative values based on techniques used in Natural Language Processing (NLP) and sentiment analysis techniques. The three quantitative values are obtained through the following functions:

$γ (\cdot)$ : it receives a comment, $c_{i} \in C$ , and assigns the corresponding sentiment value. In the literature, there are several approaches that could be used to implement $γ (\cdot)$ . In our experimental campaign, we used VADER (Valence Aware Dictionary and sEntiment Reasoner) [54], a lexicon- and rule-based model specifically designed to evaluate sentiments expressed in social media. We chose VADER because it is highly accurate for short, informal text such as comments and posts commonly found on social platforms. Furthermore, it does not require any training data, which simplifies implementation and ensures consistent performance across different datasets. It computes the so-called compound score [23,55,56]. The latter ranges within the real interval $[- 1, 1]$ ; its value is obtained by summing the scores returned by VADER for each word in the lexicon, adjusted based on certain rules (describing common social media content), and normalized between –1 (most negative extreme) and 1 (most positive extreme). A sentiment value tending to 1 indicates that the author made an extremely positive comment; conversely, a sentiment value tending to –1 indicates that the comment is extremely negative. Finally, a sentiment value tending to 0 means that the comment is neutral. Any sentiment value, even zero, is worth considering and provides interesting information for our analysis. For example, extreme values (i.e., very high or very low ones) indicate that the corresponding comment contributes to increasing the level of polarization (and thus the level of discordance) of communities. Conversely, a null value indicates a comment that helps to moderate, and thus reduce, the level of polarization (and thus the level of discordance) of communities. Since we are interested in studying discording communities as thoroughly and broadly as possible, it is clear that the mechanisms that dampen polarization and discordance level are also worth investigating.
$β (\cdot)$ : it receives a comment and assigns to it a value called subjectivity. Its values range in the real interval $[0, 1]$ , where 0 indicates that the comment is very objective while 1 denotes that it is extremely subjective. In the literature, there are several approaches that can be used to implement $β (\cdot)$ [57,58]. In our experiments, we employed the algorithms provided by TextBlob [59]. We chose TextBlob because it is a simple yet effective tool that leverages a lightweight rule-based approach to calculate subjectivity, which makes it both efficient and interpretable. Additionally, TextBlob’s pre-built functionality allows us to quickly and reliably compute subjectivity without needing to construct custom models or train on domain-specific data.
$ζ (\cdot)$ : it receives a comment and returns the number of entities mentioned in its textual content. In fact, it implements a Named Entity Recognition (NER) task. This is an NLP task that involves identifying and categorizing named entities, e.g., names of people, organizations, locations, dates, and other specific terms within a text [60]. In our experiments, for the implementation of $ζ (\cdot)$ we used the algorithm provided by the SpaCy (https://spacy.io/) library of Python 3.8, which is based on a machine learning algorithm known as Conditional Random Field (CRF) [61]. We chose CRF because it is well suited for sequence tagging tasks such as Named Entity Recognition. CRF effectively models the relationships between adjacent words in a sequence. As a result, it is able to take into account the context of a word and make more accurate predictions. This results in improved precision and recall for identifying named entities, which is critical for ensuring the quality and reliability of the information extracted from comments.

Finally, in Table 1 we present examples of the annotation process for one of the datasets (specifically, the climate change dataset) that we used in our experimental campaign.

Having explained the technical aspects of content investigation, let us now examine its time complexity. The content investigation process can be divided into two parts, namely content embedding and content annotation.

The time complexity of the first part essentially depends on how the embeddings are computed. In our case, we use BERTopic, whose inference consists of embedding the input documents, applying dimensionality reduction techniques to project the document embeddings into a lower dimensional space, and then assigning the topic based on the cluster to which the documents belong. The time complexity of these steps is

O (N \times D + S)

, where N is the number of tokens in a document, D is the dimensionality of the transformer model used in the process, and S is the complexity of the projection used for the embedding process [52].

Instead, the content annotation process depends on the functions

γ (\cdot)

,

β (\cdot)

and

ζ (\cdot)

, and thus on the time complexity of the methods used to implement them. To implement the function

γ (\cdot)

we used VADER. As discussed in the introductory paper [54] and supported by the official website (https://github.com/cjhutto/vaderSentiment, accessed on 1 January 2025), the time complexity of executing VADER is

O (N)

, where N is the length of the analyzed text. To implement the function

β (\cdot)

, we used TextBlob. As can be seen from its official documentation, the subjectivity computation is performed through a rule-based mechanism that, in this case, is

O (N)

, where N is the length of the text. Finally, to implement the function

ζ (\cdot)

, we used the NER algorithm provided in the SpaCy library, which exploits Conditional Random Field to label the tokens. Although it is not specified in the official SpaCy documentation, it is reasonable to assume that the algorithm is based on the linear-chain Conditional Random Field typically used when dealing with text elements [61]. The inference time of such an algorithm is

O (N \times L^{2})

, where N is the number of tokens in the input text and L is the number of possible labels per token. In conclusion, the total time complexity of the content annotation process can then be represented by the dominant complexity among the above functions, i.e.,

O (N \times L^{2})

.

Therefore, the overall time complexity of content investigation is equal to

m a x (O (N \times D + S), (N \times L^{2}))

.

3.3. Integrating Structure and Content

During this phase, our framework analyzes discording communities, investigating both their structure and their content. The study of structure can be carried out by analyzing the nodes, arcs, and weights of the network,

N

, while the study of content is done using the functions

δ (\cdot)

,

γ (\cdot)

,

β (\cdot)

, and

ζ (\cdot)

.

The separate study of structure and content is interesting in itself, but the combined study of them becomes even more challenging. In fact, it allows for a series of analyses that take into account both points of view, thus enabling a more holistic investigation.

To formally integrate the properties of

N

,

δ (\cdot)

,

γ (\cdot)

,

β (\cdot)

, and

ζ (\cdot)

, we introduce an extension

M

of

N

defined as follows:

M = 〈 N, ω (\cdot) 〉

(6)

M

is constructed on top of

N

. It has the same sets, V of nodes and A of arcs, as well as the same weighting function,

μ (\cdot)

, as

N

. Therefore, when we refer to the nodes and arcs of

M

we will employ the same sets, V and A, used for

N

.

ω (\cdot)

is an arc augmentation function, which associates each arc of

M

with a set of features that accounts for interactions, embeddings, sentiment, subjectivity, and mentioned entities. Formally speaking, given an arc

a_{p q} = (v_{p}, v_{q}) \in A

,

ω (\cdot)

can be defined as follows:

ω (a_{p q}) = 〈 w_{p q}, s m_{p q}^{+}, s m_{p q}^{-}, s e_{p q}, s c_{p q}, s b_{p q}, e n_{p q}, κ_{p q} 〉

(7)

Here:

$w_{p q}$ is the number of interactions between $v_{p}$ and $v_{q}$ , i.e., $w_{p q} = μ (a_{p q}) = | I_{v_{q}}^{v_{p}} |$ .
$s m_{p q}^{+}$ is the maximum similarity between the embeddings of the comments in the interactions involving $v_{p}$ and $v_{q}$ , i.e., $s m_{p q}^{+} = m a x_{ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}} {s i m (δ (c_{i}), δ (c_{j}))}$ .
$s m_{p q}^{-}$ is the minimum similarity between the embeddings of the comments in the interactions involving $v_{p}$ and $v_{q}$ , i.e., $s m_{p q}^{-} = m i n_{ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}} {s i m (δ (c_{i}), δ (c_{j}))}$ .
$s e_{p q}$ is the average sentiment value of all comments made by $v_{p}$ in the interactions involving $v_{p}$ and $v_{q}$ , i.e., $s e_{p q} = a v g_{ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}} {γ (c_{i})}$ .
$s c_{p q}$ is the average score of all comments made by $v_{p}$ in the interactions involving $v_{p}$ and $v_{q}$ , i.e., $s c_{p q} = a v g_{ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}} {s c r (c_{i})}$ .
$s b_{p q}$ is the average subjectivity value of all comments made by $v_{p}$ in the interactions involving $v_{p}$ and $v_{q}$ , that is, $s b_{p q} = a v g_{ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}} {β (c_{i})}$ .
$e n_{p q}$ is the average number of entities mentioned in all comments made by $v_{p}$ in the interactions involving $v_{p}$ and $v_{q}$ , that is, $e n_{p q} = a v g_{ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}} {ζ (c_{i})}$ .
$κ_{p q}$ is the value obtained by applying a Kernel Density Estimation (KDE) [62] on the values of the features $w_{p q}$ , $s m_{p q}^{+}$ , $s m_{p q}^{-}$ , $s e_{p q}$ , $s c_{p q}$ , $s b_{p q}$ , and $e n_{p q}$ . KDE is a non-parametric statistical technique used to estimate the probability distribution of one or more continuous variables. Given a dataset of t observations, ${y_{1}, y_{2}, \dots, y_{t}}$ , KDE estimates the probability density function, $f (y)$ , as:

$f (y) = \frac{1}{t h} \sum_{i = 1}^{t} K (\frac{y - y_{i}}{h})$

(8)

Here:
-
$y_{i}$ is a single data point in the dataset of observations; it consists of a vector that has a value for each of the features $w_{p q}$ , $s m_{p q}^{+}$ , $s m_{p q}^{-}$ , $s e_{p q}$ , $s c_{p q}$ , $s b_{p q}$ , and $e n_{p q}$ mentioned above.
-
$f (y)$ is the estimated probability distribution for the data point y in the dataset of observations.
-
t is the number of data points in the dataset.
-
h is a smoothing parameter called the bandwidth, which controls the width of the kernel function.
-
K is the kernel function. Typically, it is a symmetric, non-negative function centered at zero. Common choices of K include the Gaussian kernel and the linear kernel.
KDE is widely used in various fields, including statistics, data analysis, and machine learning [63,64]. It provides a flexible and powerful tool for understanding the underlying structure of a set of observations. $κ_{p q}$ represents the probability that the writing styles and the opinions characterizing the comments of $v_{p}$ and $v_{q}$ are concordant. Specifically, a high value of $κ_{p q}$ indicates that $v_{p}$ replied to the comments of $v_{q}$ with a similar writing style and/or showing concordant opinions. In contrast, a low value of $κ_{p q}$ indicates that the writing style of the comments of $v_{p}$ and $v_{q}$ are dissimilar and/or that the opinions expressed in the comments are discordant.

To the best of our knowledge, no method to integrate structure and content has been proposed in the past literature; therefore, the method proposed in this section is the first one addressing this issue in the literature. For this reason, it is legitimate, and indeed proper, to raise the question of the rationality and reliability of this method. Regarding rationality, we observe that all the parameters composing the tuple returned by the function

ω (\cdot)

, when it is applied on an arc,

a_{p q}

, are well known in the literature. As for reliability, this can only be determined experimentally. Regarding this, it should be pointed out that the experiments described in Section 4 confirm the correctness and usefulness of the parameters characterizing the integration method.

Finally, we have seen above that top users play a key role in the study of

N

. The same is true for the network

M

. In fact, by comparing the top users related to the same property in the networks

N

and

M

, it is possible to extract insights on the different role of structure and content in the user dynamics of discording communities.

Let us now examine the time complexity associated with structure and content integration activities. We start this characterization by assuming that all embeddings are already pre-computed and accessible in

O (1)

. Indeed, this is exactly the case in our framework, where the previous phase, i.e., content investigation, is essentially performed only once for the dataset under investigation. Therefore, we can store this data in a hash table, which allows us to access it in a constant time. This is also true for all sets defined in Section 3.1, which can be represented by data structures such as matrices or hash tables, allowing access to a single piece of information in

O (1)

. Furthermore, the similarity between two embeddings can be computed only once and stored in a matrix whose access has time complexity

O (1)

.

The core of this phase is the construction of the network

M

. In this case, we can represent it as a simple adjacency matrix with non-empty entries, and thus the time complexity of its construction is

O (| V | \times | V |)

, where V is the set of nodes. Nevertheless,

M

has an arc augmentation function,

ω (a_{p q})

, where

a_{p q}

is an arc of

M

, and whose computation can be done during the construction of

M

. The time complexity of

ω (a_{p q})

coincides with the maximum time complexity needed to compute the features with which the arc will be associated. Thanks to the pre-computation of the embeddings and similarity values, almost all features can be computed in linear time, in particular in

O (| I_{v_{q}}^{v_{p}} |)

, where

I_{v_{q}}^{v_{p}}

is the set of interactions between the users represented by the nodes

v_{p}

and

v_{q}

, respectively. The value of the feature

κ_{p q}

is obtained by applying a KDE to the values of all the other features calculated in

ω (a_{p q})

; its time complexity is linear in the order of the number of features considered. Wrapping up, the final time complexity of this phase is

O (| V | \times | V | \times I^{*})

, where

I^{*}

is the maximum number of interactions between two users.

4. Experiments

In this section, we describe the experiments we conducted to evaluate our framework. Specifically, in Section 4.1, we illustrate the dataset chosen for our analysis. In Section 4.2, we test the ability of our framework to extract cohesive “subcommunities”. In Section 4.3, we show how our framework allows us to detect the most influential users. In Section 4.4 we investigate the structural and behavioral characteristics of influential users belonging to discording communities. Finally, in Section 4.5 we describe the application of our framework to a second dataset that led us to repeat most of the experiments performed on the first dataset and to compare the results obtained with the two datasets.

4.1. Dataset

To test our framework, we needed to find a dataset that included at least two discording communities whose users interacted with each other by writing comments. There are many examples of discording communities in the social platform scenario; think, for example, of anti-vaxxers and pro-vaxxers, Democrats and Republicans in the United States, climate activists and deniers, and so on [65,66,67,68]. Our framework can handle each of these pairs of discording communities and, therefore, we could choose any of them for our dataset. Currently, the climate change phenomenon is much discussed and will be a challenge for new generations [69]. Researchers are making great efforts to understand the dynamics of users that support or deny the climate change phenomenon within social networks [67,70,71]. For that reason, we decided to follow this trend and created a dataset of comments on climate change from Reddit. This is a popular online social platform based on user-generated content and discussions. It is organized into various thematic communities known as subreddits. Each subreddit is dedicated to a specific topic; users can join any subreddit to participate in discussions, share content, and engage with other users interested in the same topic. Often, a subreddit, besides being thematic, reflects a certain line of thinking about the topic to which it refers. In any case, subreddits provide a valuable space for building communities and connecting people having similar interests, experiences, and, very often, similar viewpoints. Content posted into a subreddit is managed by moderators, who are dedicated volunteers responsible for overseeing and maintaining the corresponding communities. Their role is essential in creating an organized environment for Reddit users. Thanks to them, attempts to disrupt subreddits, by, for example, posting controversial and/or provoking comments with respect to the line of thinking on the reference topic for the subreddit, have minimal impact. These peculiarities of Reddit allow us to categorize users based on their interests and viewpoints, as these are reflected in their active participation in the corresponding subreddit.

In our specific case, we selected two subreddits in which climate change is discussed from two opposing perspectives. The first is /r/climatechange, animated by climate activists, while the second is /r/climateskeptics, animated by those who believe that climate change is a fraud. Since Reddit allows users to be divided into subreddits and the two communities of interest for our analysis correspond to the two subreddits, it seemed logical to use a membership-based

Γ (\cdot, \cdot, \cdot)

. Specifically, for our experiments, we set

DC = {d_{1}, d_{2}}

,

d_{1}

=/r/climatechange,

d_{2}

=/r/climateskeptics, t = 1 January 2023. In this case,

Γ (d_{1}, d_{2}, t) = 1 - J (d_{1_{t}}, d_{2_{t}})

, where

J (d_{1_{t}}, d_{2_{t}})

is the Jaccard coefficient between

d_{1_{t}}

and

d_{2_{t}}

, i.e.,

J (d_{1_{t}}, d_{2_{t}}) = \frac{| d_{1_{t}} \cap d_{2_{t}} |}{| d_{1_{t}} \cup d_{2_{t}} |}

. Here,

d_{1_{t}}

(resp.,

d_{2_{t}}

) represents the set of users of

d_{1}

(resp.,

d_{2}

) at instant t. After that, we created a dataset of comments. Each comment has certain features, namely, its author, the subreddit in which it was posted, its text, its timestamp, and the possible link to a comment or text of which it is a response. The interested reader can find this dataset at the address: https://github.com/lucav48/climate-change-reddit (accessed on 1 January 2025). In Table 2, we report some statistics about our dataset.

As Table 2 shows, our dataset contains 100,745 comments posted by 9269 users in the two subreddits of interest. The first analysis we conducted on the dataset consisted of calculating

Γ (d_{1}, d_{2}, t)

to evaluate the discordance level of

d_{1}

and

d_{2}

at instant t. Table 2 reports that only 2.87% of users posted comments in both communities. The low overlap of users suggests a limited cross-community interaction. As evidence of this, the value of

Γ (d_{1}, d_{2}, t)

is 0.9713, which indicates a significant level of discordance between the two communities. This observation suggests that users are primarily engaged within their own subreddit rather than engaging in cross-community dialogue. This behavior is consistent with the concept of discording communities, where individuals tend to gravitate toward spaces that reinforce their pre-existing beliefs. The fact that 97.13% of users engage exclusively within their own subreddit further emphasizes the lack of dialogue between the users of the two subreddits.

From the description of the dataset, we can infer that it provides a comprehensive view of the interactions within two discording communities. This dataset serves as the foundation for our experimental campaign, which we describe in the next subsections. To clarify in detail how this campaign will be carried out, and the various steps involved, in Figure 1 we introduce a framework diagram that highlights the various tasks of our analysis.

As shown in this figure, our experimental campaign consists of six steps, namely, (i) data collection and pre-processing, (ii) network construction, (iii) feature extraction, (iv) network augmentation, (v) community detection and analysis, and (vi) user centrality and influence evaluation. Each of these steps aims to understand a part of the dynamics of online discussions. Overall, the framework allows for detailed exploration of both the structure and content of interactions involving users belonging to discording communities. The following analyses will detail the methodologies and results of each step of the framework.

Proceeding with the analysis of our dataset, in Figure 2 we show the number of comments per week posted by users in the two subreddits. From the analysis of this figure we can see that the two communities are quite active, with a minimum (resp., maximum) number of comments per week equal to 280 (resp., 2632). This figure also shows an increasing interest in the community /r/climateskeptics, while the number of comments posted in /r/climatechange decreased in the last months of 2022. This increase in comments could be attributed to events that occurred at the end of 2022, for instance, Europe experiencing its third warmest fall on record, with temperatures approximately 1 °C above the long-term average. This unusual temperature anomaly, particularly pronounced in western Europe (https://climate.copernicus.eu/surface-air-temperature-november-2022, accessed on 1 January 2025), likely intensified discussions about climate change, potentially leading to intense debates within the climate change skeptics community, which has had its beliefs challenged by events. Actually, extreme weather events like this have often provoked heightened debates around climate science, which may have contributed to the spike of comments observed in the /r/climateskeptics community during this period.

After that, we evaluated user activity in the two discording communities. We started by computing the distribution of users against comments in each of the two communities. The results obtained are shown in Figure 3. From the analysis of this figure, we can observe that these distributions follow a power law, which is typical in the context of social networks [72,73]. In both communities, there are few users who post many comments and many users who post few comments. There are no significant differences between the two communities from this perspective.

Next, we examined whether or not users interact with each other, which is essential for the application of our framework. In Figure 4, we show the distribution of the percentage of comments against the number of replies they received.

Analyzing this figure, we see that 45.28% of the comments in /r/climatechange and 38.26% of the comments in /r/climateskeptics received no replies. This indicates that a significant portion of comments did not result in further interaction. However, most comments in both communities received at least one reply, suggesting that discussions were often initiated. In addition, the figure shows that users in /r/climateskeptics are more likely to start a discussion, as a higher percentage of their comments received at least one reply compared to users in /r/climatechange. While this difference in behavior exists, it is not large. In /r/climateskeptics, users are more likely to start discussions, which may indicate a higher level of engagement or a stronger desire to assert and defend their viewpoints. In contrast, the /r/climatechange community has fewer replies, which may indicate a greater tendency for comments to go unanswered, reflecting a more passive level of engagement.

Starting from our dataset, we constructed the network

N

of the users and their interactions. Specifically, we considered, as an interaction,

ι_{l} = (c_{i}, c_{j}) \in I_{v_{q}}^{v_{p}}

as the action of

v_{p}

replying to a comment of

v_{q}

. The weight of the arc

(v_{p}, v_{q})

is calculated by the function

μ (\cdot)

introduced in Section 3.1 and is equal to the number of comments posted by

v_{p}

as replies to comments posted by

v_{q}

, i.e.,

| I_{v_{q}}^{v_{p}} |

. Some statistics regarding

N

are provided in Table 3. From the analysis of this table, we can see that

N

has fewer nodes than the number of users present in our dataset (see Table 2) because we removed isolated users, that is, those who wrote comments that received no reply. However,

N

still has 6960 nodes and 24,040 arcs. Both its density and average clustering coefficient are not particularly high.

N

has 106 connected components; in particular, it has a large maximum connected component that comprises 96.78% of the total number of its nodes. Finally, the maximum number of comments posted by a user to reply to another user’s comments is 362.

To graphically visualize the interactions present in

N

, in Figure 5 we present a visualization of its largest connected component. It includes 6736 nodes out of a total of 6960 nodes in

N

, accounting for more than 96% of the network, as we have seen above. Consequently, the structural characteristics of

N

and its largest connected component (e.g., density and clustering coefficient) are remarkably similar. We chose to show the largest connected component, instead of the whole network, for layout reasons, since the remaining nodes were isolated.

To build

M

(thereby enriching knowledge about interactions), we extracted additional features from comment texts. In particular, we used BERTopic to extract topics and embeddings from comments [52]. BERTopic retrieved 835 different topics (820 for /r/climateskeptics and 783 for /r/climatechange); 768 of them (i.e., 91.98%) are common between the two communities. Furthermore, we extracted sentiment through VADER [54], subjectivity through TextBlob [59], and the number of entities reported in the comments through the Conditional Random Field algorithm of SpaCy [61]. In this way, we created a vector of features for each comment, representing its content and writing style. Such a vector is essential for embedding user interactions within

M

.

M

has the same structure as

N

in terms of nodes and arcs. The crucial difference between the two networks concerns the feature vectors added to the arcs of

M

, which will be used for the next experiments.

4.2. Detection and Evaluation of User Circles

One way to test our framework consists of computing and analyzing user circles in the networks

N

and

M

. Preliminarily, we point out that when we talk about user circles in this section we mean homogeneous groups of users that can be derived from

N

or

M

by applying one of the community detection techniques available in social network analysis. Instead of the term “user circle”, we could use the term “community” to refer to each of these groups. We do not do this so as not to confuse since, in this paper, we use the term “community” to talk about the concept of “discording communities”. As an example, we use the term “community” to refer to the subreddits /r/climatechange and /r/climateskeptics. User circles are generally much smaller than subreddits and in principle might comprise users who operate on one or both of these subreddits.

N

serves as baseline while

M

is a more information-rich network since each arc

a_{p q} = (v_{p}, v_{q})

has associated the values of the features

w_{p q}

,

s m_{p q}^{+}

,

s m_{p q}^{-}

,

s e_{p q}

,

s c_{p q}

,

s b_{p q}

,

e n_{p q}

, and

κ_{p q}

introduced in Section 3.3. In this experiment, we want to see if and how each feature affects the computation of user circles. Before proceeding, we should point out that considering

M

with the feature

w_{p q}

as the weight of the arc,

a_{p q} = (v_{p}, v_{q})

, is equivalent to considering

N

, since

N

and

M

share the same nodes and arcs and the weight of the arcs on

N

is obtained by computing

μ (a_{p q})

, which returns exactly the value of the feature

w_{p q}

of

M

(see Section 3.3). An additional aspect we want to highlight is the relevance of the feature

κ_{p q}

. This is obtained by computing the KDE of the other features. Consequently, using

κ_{p q}

as the weight of the arc

a_{p q} = (v_{p}, v_{q})

in

M

allows all the other features introduced in Section 3.3 to be taken into account simultaneously, albeit indirectly. For this reason, in the following we will focus mainly on these two cases and call

N^{*}

the network

M

having

w_{p q}

as the weight of the arc,

(v_{p}, v_{q})

, and

M^{*}

the network

M

that has

κ_{p q}

as the weight of the arc

(v_{p}, v_{q})

.

To extract user circles from

N^{*}

and

M^{*}

, we adopted the Louvain algorithm [74]. This algorithm requires specifying the value of a parameter called resolution, which allows us to force it to return smaller or larger communities. We decided to use two resolution values, namely 0.5 (in which case the algorithm returns smaller user circles) and 10 (which causes the algorithm to return larger user circles). To evaluate the user circles obtained, we used three indicators. The first is modularity [75], which is a measure of the compactness of user circles. It can take real values between 0 and 1; a high value implies the presence of strongly cohesive and weakly coupled user circles, while a low value indicates the presence of weakly cohesive and partially coupled user circles. The second and the third indicators concern the number of user circles and their average size. In Table 4, we report the results obtained by applying the Louvain algorithm with the two resolution values specified above to the network

M

, assigning its arcs a weight equal to the value of each feature identified in Section 3.3.

From the analysis of this table, we can see that our arc augmentation approach based on representation learning actually allows for higher modularity values in both resolutions. Because of what we said above, the two extreme cases to be observed are those associated with the features

w_{p q}

, which is equivalent to using the network

N^{*}

, and

κ_{p q}

, which is equivalent to using the network

M^{*}

. When the resolution is 0.5, we can observe that by using

κ_{p q}

we obtain a growth of modularity of 21.84% compared to the case of using

w_{p q}

; furthermore, the number of user circles is slightly smaller, while their average size is slightly larger. When the resolution is 10, the modularity increment we get using

κ_{p q}

compared to the case of using

w_{p q}

is 52.47%; the number of user circles is slightly greater, while their average size is slightly smaller. Both cases show us that the use of features obtained through representation learning allows a significant increase in the quality of results. We would also like to note that appreciable improvements in the results compared to the baseline (i.e., using

w_{p q}

) can also be obtained by using some of the other features instead of

κ_{p q}

. For example, we mention the increase in modularity produced by

s e_{p q}

(i.e., 0.386 vs. 0.357 and 0.394 vs. 0.310) and

s b_{p q}

(i.e., 0.379 vs. 0.357 and 0.415 vs. 0.310).

At this point, we decided to compare the user circles obtained using

N^{*}

and those obtained using

M^{*}

. The goal was to evaluate whether or not using features obtained through representation learning results in more cohesive user circles in terms of writing styles and topics handled by users. To conduct this experiment, we first applied BERTopic to all available comments for selecting the most discussed topics. At the end of this task, we obtained that the most discussed topics were Temperature, Electric vehicles, Science/Scientist, Sea level, Nuclear power, Renewable energy, Lake Michigan, and Overpopulation. Next, for each topic and each user circle, we considered the comments made by users within that circle and concerning that topic. For each of these comments, we derived a corresponding embedding by applying BERTopic. Finally, for each topic and each user circle, we calculated the average cosine similarity of the embeddings of the comments regarding that topic posted by users in that circle. In Figure 6, we report the boxplots of the similarity distributions thus obtained.

From the analysis of this figure, we can see that, at a resolution of 0.5, the medians of the similarity distributions for users in the circles of

N^{*}

are slightly higher than those for users in the circles of

M^{*}

. However, users in

N^{*}

show more variability than those in

M^{*}

. This suggests that there may be diverse behaviors within these user circles, leading to less cohesion in comments and ideas. As a result, users in these circles may be expressing different viewpoints and opinions, reflecting more fragmented interactions within the community. This variability may indicate that, despite being grouped together, users may not be fully aligned in their beliefs or attitudes, which could lead to less constructive dialogue and more controversial interactions.

In contrast, when the resolution is increased to 10, the similarity distributions for the two cases become much closer across all topics, as seen in Figure 6b. This could be due to the fact that user circles are larger at higher resolutions, resulting in more users and comments. As a result, discussions become longer and more stable, leading to greater uniformity of opinions within the circles.

These findings have interesting implications because they highlight the dynamics of online communities. Smaller user circles with greater variability may foster echo chambers or reinforce existing beliefs, while larger circles with more stable discussions may provide an opportunity for greater alignment. This suggests that both the size and structure of user groups have a direct impact on the nature of discussions, potentially influencing the degree of polarization or cohesion within the community.

4.3. User Centrality Evaluation

Another experiment to test the effectiveness of our proposal involved measures taking into account the importance given to users in the network. First, we expected that the most central users extracted from

N^{*}

and those extracted from

M^{*}

are not the same. If this is true, the difference is precisely due to the analysis of user interactions carried out by means of representation learning. We considered two centrality measures, namely weighted degree centrality and weighted betweenness centrality [50]. The former gives an idea of the connection level of each node, while the latter highlights the importance of a node based on the shortest paths in which it is involved. To test whether or not there is an overlap between the top users of

N^{*}

and those of

M^{*}

, we computed the Jaccard coefficient of these two sets [76]. We recall that the values of this coefficient range in the real interval

[0, 1]

; the higher the value and the higher the similarity between the two sets.

In Figure 7, we show the Jaccard coefficient among the top users with the highest weighted degree centrality (on the left) and weighted betweenness centrality (on the right) for the networks

N^{*}

and

M^{*}

. From the analysis of this figure, we can see that there is a partial overlap between the two sets. Specifically, in Figure 7a we have only 33.3% common users in the top 10 users. This percentage grows to 53.8% in the top 50 users; thereafter, it has an irregular but rather constant trend until the top 1500 users; then, it increases rapidly from the top 3000 users and up. In Figure 7b, the overlap is more significant. In fact, we have 66.7% overlap for the top 10 users. This percentage already reaches 81.8% with the top 1000 users and then grows, again approaching 100%. At the end of this analysis, we can conclude that there are interesting differences in the top users for weighted degree centrality, while differences are much less marked for weighted betweenness centrality.

At this point, we analyzed the structural differences between the top users of

N^{*}

and

M^{*}

. To this end, we extracted the ego network for each top user and averaged their density and clustering coefficient. These two measures give an idea of the connection level of each top user and the cohesion within her neighborhood. The corresponding results are shown in Figure 8 and Figure 9.

From the analysis of Figure 8, we can observe that the values of the density of the ego networks of the top users in the case of weighted degree centrality are, on average, smaller than the value of the density of the ego networks of the top users in the case of weighted betweenness centrality. Furthermore, we can observe that, for weighted degree centrality, the density of the ego networks of the top users of

N^{*}

is greater than that of the ego networks of the top users of

M^{*}

, except when the number of top users is very high. As for the weighted betweenness centrality, the density of the ego networks of the top users of

N^{*}

and

M^{*}

are generally comparable. However, even for weighted degree centrality, the differences in the density of the ego networks of the top users of

N^{*}

and

M^{*}

are not significant. In fact, the maximum (resp., minimum) difference is obtained for weighted degree centrality when the number of top users is 200 (resp., 1000). In this case, the value of the difference is equal to 0.05 (resp., 0.008).

In contrast, in Figure 9 we observe a completely different behavior than in Figure 8, since we note an important difference between the top users of

N^{*}

and those of

M^{*}

. In fact, for both types of centrality, the average weighted clustering coefficients are much higher in

M^{*}

than in

N^{*}

. In fact, the differences of the average clustering coefficient of the ego networks of the top users of

N^{*}

and

M^{*}

range between 0.040 and 0.163 for weighted degree centrality and between 0.028 and 0.124 for weighted betweenness centrality. This result, coupled with the average density result seen above, indicates that the ego networks of the top users in

M^{*}

(and, thus, the user circles formed around them) are much more cohesive than the ego networks of the top users in

N^{*}

(and, thus, the user circles formed around them). In fact, having approximately the same density means that both cases have approximately the same number of arcs. However, in the ego networks of the top users of

M^{*}

, these arcs are arranged to form closed triads much more frequently than in the ego networks of the top users of

N^{*}

. Interestingly, this result is a confirmation of the ones on the strength of community structures presented in Section 4.2.

4.4. Profiling the Most Influential Users

In this section, we further investigate the most influential users for the two discording communities /r/climatechange and /r/climateskeptics. Therefore, we focus on the nodes that have the highest values of weighted degree centrality in

M^{*}

, which are those that most stimulate discussions on climate change. In particular, to perform an in-depth analysis, we focused on the 100 users with the highest weighted degree centrality in /r/climatechange and in /r/climateskeptics. The same type of analysis can be extended to more top users if necessary. As a first task, we extracted the ego networks of these users and calculated their average density and average weighted clustering coefficient. The results obtained are shown in Figure 10.

Analyzing this figure, we observe some notable differences between the two discording communities. In particular, the top users of /r/climateskeptics have slightly denser (i.e., +12.5% denser) ego networks than the top users of /r/climatechange. At the same time, the average weighted clustering coefficient of the former is significantly higher (i.e., +45.3%) than that of the latter. These results suggest that the most influential users in /r/climateskeptics create neighborhoods characterized by higher levels of trust and robustness. In contrast, the most influential users in /r/climatechange do not achieve the same levels of trust and robustness within their neighborhoods. These findings highlight the stronger cohesion and more interconnected nature of user interactions in /r/climateskeptics, possibly leading to a shared ideological user circle. Conversely, /r/climatechange shows a slightly less cohesive network, possibly indicating more diverse or less tightly knit discussions within the community.

Given the presence of many triads in the ego networks of the top users, we decided to conduct a triad census. In fact, such a census could reveal different interaction patterns between the two subreddits. To perform such an analysis, we extracted the subgraphs induced by the top users of /r/climatechange and /r/climateskeptics (e.g., the subgraphs derived from

M^{*}

consisting of the top users and the arcs connecting them) and counted the occurrences of each type of triad. The results obtained are reported in Figure 11.

Analyzing this figure, we see that the most common triad pattern is 021D, which accounts for 69.5% of the triads on /r/climatechange and 80.7% of the triads on /r/climateskeptics. In this pattern, one user replies to two other users without receiving replies. This type of triad does not generate discussions because there is no reciprocal engagement from the recipients of the original message. The prevalence of this pattern suggests that much of the interaction in both communities is characterized by one-way communication, rather than meaningful dialogue, contributing to an echo chamber effect where individuals can reinforce their own views without engaging in constructive discussion. Another interesting triad pattern is 021C, which represents 25.1% of the triads in /r/climatechange and 14.3% in /r/climateskeptics. In this case, a user,

v_{p}

, writes to another user,

v_{q}

, who then writes to another user,

v_{r}

. This pattern indicates the emergence of a discussion chain, where a message (originating from

v_{p}

) flows through the community, potentially sparking further dialogue. The higher percentage of this pattern in /r/climatechange suggests that this community is more likely to share ideas and spread discussions through a chain of messages, possibly reflecting a more indirect and less centralized form of engagement. Finally, the 030T triad pattern, while less common, is particularly noteworthy. This closed triad shows a continuous flow of information between three users who trust each other and want to engage in an ongoing discussion. The percentage of triads following this pattern is relatively small compared to the others. It is more prevalent in /r/climateskeptics (2.0%) compared to /r/climatechange (1.5%). This last result is consistent with the higher average weighted clustering coefficient observed for this subreddit, as seen in Figure 10. The prevalence of the 030T pattern in /r/climateskeptics indicates a more cohesive and connected community, where users are more likely to trust each other and engage in deeper, more sustained discussions. In contrast, /r/climatechange, with fewer closed triads, may have more fragmented conversations, where trust is lower and discussions are less likely to be sustained over time. This has important social implications, as higher levels of trust and cohesion in a community can lead to potentially greater influence in shaping opinion. At this point, we thought of analyzing network structures larger than triads to see if it was possible to extract further knowledge regarding the influential users of the two discording communities. Also in this analysis, we considered the subgraph of

M^{*}

induced by the top 100 users of /r/climatechange and /r/climateskeptics and, starting from this, calculated cliques consisting of more than three nodes (to avoid overlap with triad analyses). For each user in the induced subgraph (and thus for each top user), we calculated the number of cliques to which they belonged, the average size of those cliques, and the average weights of the arcs present in them. Then, we aggregated the results for the two subreddits. These results are shown in Table 5.

From the analysis of this table, we observe notable differences in the behavior of the top users in /r/climatechange and /r/climateskeptics. Specifically, the top users in /r/climateskeptics participated in significantly more cliques (+508.01%) than those in /r/climatechange, and the average size of cliques in /r/climateskeptics is also significantly larger (+89.08%). This suggests that users in /r/climateskeptics are more actively involved in multiple discussions across different comment chains, leading to broader engagement within larger, more diverse network structures. Despite the larger cliques in /r/climateskeptics, an interesting contrast emerges in the average weight of arcs within the cliques. In fact, this measure is higher (+18.64%) in /r/climatechange cliques than in /r/climateskeptics, indicating that, while the cliques in /r/climateskeptics are more numerous and larger, the cliques in /r/climatechange tend to be more cohesive. Users within /r/climatechange cliques share more similar opinions and writing styles, creating tighter-knit groups that are likely to foster stronger bonds.

After investigating the structure of the network, we moved on to content analysis. In particular, we started to analyze the comments made by the top users in /r/climatechange and /r/climateskeptics. First, we decided to assess any similarities among the top users in the same subreddit to see if there are any differences in writing styles and opinions among them. To this end, for each pair of top users from the same subreddit, we extracted the comments made by them, retrieved the corresponding embeddings through BERTopic, and calculated their cosine similarity. Proceeding in this way, for all comments and all pairs of top users, we obtained a distribution of the average cosine similarity of each pair of top users in the same subreddit. We report the results obtained in Figure 12. From the analysis of this figure, we can see that the distribution of similarities in /r/climatechange has a higher median and smaller IQR than those in /r/climateskeptics. This result indicates that the top users in /r/climatechange post comments that are more similar (and ultimately reveal more similar writing styles and opinions) than those posted by the top users in /r/climateskeptics. However, it should be pointed out that the differences are very slight. This is in line with what we have seen in some previous analyses that had shown that comments in /r/climateskeptics tend to involve a larger number of users, which increases the likelihood that these users have different opinions (as evidence of this, we can point out that there were pairs of comments in the previous calculation whose cosine similarity was less than 0).

Finally, we deepened our analysis by considering the most discussed topics and repeating the previous experiments after grouping comments by topic. The goal was to see if there was a topic for which the behavior of top users in /r/climatechange and /r/climateskeptics deviated from the general behavior seen in Figure 12. The results obtained are shown in Figure 13.

Analyzing this figure, we can see that the distributions of comment similarities between the top users of /r/climatechange and /r/climateskeptics are similar for several topics. For example, in the case of Temperature, the medians are nearly identical (0.643 and 0.647, respectively). In the case of Electric vehicles, Science/Scientist, and Sea level, the differences are minimal; in particular, in all of these cases, the median of /r/climatechange is slightly higher than that of /r/climateskeptics (with the differences between the medians being 0.025 for Electric vehicles, 0.033 for Science/Scientist, and 0.034 for Sea level). However, if we focus on the differences between the two subreddits, we can observe the following:

Regarding the Science/Scientist topic, the range of values in the boxplot of /r/climateskeptics is much greater than that of /r/climatechange. This result, along with the median differences reported above, suggests that users of /r/climatechange are much more cohesive on this topic, with their opinions more aligned than those of users of /r/climateskeptics.
As for Nuclear power, we observe that users of /r/climateskeptics have much more similar opinions compared to users of /r/climatechange. This is evident from the median value, which is higher for /r/climateskeptics than for /r/climatechange (0.721 vs. 0.634), as well as from most of the distribution. Finally, it is worth noting that the 25th percentile of /r/climateskeptics is roughly the same as the 75th percentile of /r/climatechange (0.661 and 0.665, respectively).
Regarding Renewable energy, the median of the users of /r/climateskeptics is greater than that of the users of /r/climatechange (i.e., 0.623 vs. 0.585), but the range of the distribution of the latter is much smaller than that of the former. This implies that the comments related to Renewable energy include a wider range of opinions in /r/climateskeptics, while in /r/climatechange users present more similar opinions to each other.
A very special case is that of Lake Michigan. In fact, we can see that the boxplot related to the users of /r/climateskeptics is strongly squeezed on the median and that the value of the latter is very low (i.e., 0.228). In contrast, for /r/climatechange we have a much greater median (i.e., 0.587) and extreme variability in the distribution. Observing this scenario, we hypothesized that this topic is of extreme interest and highly debated by the users of /r/climatechange, while it is of little interest to the users of /r/climateskeptics. To confirm our hypothesis, we looked at the number of comments on this topic in the two subreddits and saw that it is equal to 587 for /r/climatechange and 187 for /r/climateskeptics. Obviously, the number of comments in a dataset is not the only indicator of the interest in a topic. However, we can certainly say that it is one of the possible indicators of such interest. Therefore, in order to better verify this intuition, we conducted some investigations on the past literature. They showed that climate change in the Lake Michigan region is a highly debated topic, as can be seen from the several papers on this topic published in the scientific literature (see, for instance, [77,78,79], just to mention three papers on this topic published in three different time periods). Conversely, no papers expressing skepticism about climate change affecting this lake can be found in the literature, with the exception of [80], which discusses skepticism about climate change by Midwestern farmers. In conclusion, we believe that what we found about the number of comments, together with the literature review mentioned above, allows us to confirm the previous hypothesis.
As for Overpopulation, the average similarity of comments of users of /r/climatechange is much higher than that of users of /r/climateskeptics (the median of the former is 0.621, compared to 0.529 for the latter), and the corresponding distribution interval is also narrower. These two characteristics suggest that users of /r/climatechange are more cohesive and uniform in their opinions on this topic, while users of /r/climateskeptics show some, albeit limited, divergence in their views.

4.5. Application of Our Framework on a Second Dataset

As specified in the previous sections, our approach is generic and can be applied to any online social platform where discording communities can be identified. To test our approach in a completely different scenario from that seen in the previous experiments, we decided to apply it to a second dataset containing discussions about COVID-19 vaccines made on X, previously known as Twitter [81]. This dataset has been proposed in [65] within the experimental campaign performed to test another framework. In it, two discording communities, namely that of pro-vaxxers (i.e., people in favor of the use of COVID-19 vaccines) and that of anti-vaxxers (i.e., people against the use of COVID-19 vaccines), can be identified.

This second dataset contains 17,008 tweets posted by verified users over two years. There are three possible interactions between two users, namely, (i) “retweet”, when a user quotes a tweet; (ii) “reply” when a user directly replies to another tweet; and (iii) “mention”, when a user is mentioned in a tweet. To obtain a network containing as many interactions as possible, we decided to consider all three types of interaction simultaneously by merging them into a new one that expresses a generic interaction between two users. Accordingly, the nodes of the network

N

represent users who posted at least one tweet, while the arcs of

N

indicate that the corresponding users retweeted, replied to, or mentioned each other.

During a pre-processing step, we filtered out tweets referring to tweets that were not in the dataset (e.g., a tweet that retweeted another tweet that was not in the dataset). Furthermore, we kept only tweets posted by pro-vaxxers (in this experiment, we considered a user as a pro-vaxxer (resp., anti-vaxxer) if at least 80% of their tweets contained hashtags in favor of (resp., against) COVID-19 vaccines) and anti-vaxxers and removed those posted by neutral users. Since, in X, users employ many hashtags that are often related to their opinion on a topic [49], we employed a hashtag-based

Γ (\cdot, \cdot, \cdot)

. Specifically, for our experiments, we set

DC = {d_{1}, d_{2}}

;

d_{1}

consists of pro-vaxxers while

d_{2}

consists of anti-vaxxers,

t

= 22 April 2023. Also in this case,

Γ (d_{1}, d_{2}, t) = 1 - J (d_{1_{t}}, d_{2_{t}})

, where

d_{1_{t}}

(resp.,

d_{2_{t}}

) represents the set of users of

d_{1}

(resp.,

d_{2})

at instant t and

J (d_{1_{t}}, d_{2_{t}})

is the Jaccard coefficient between

d_{1_{t}}

and

d_{2_{t}}

. After that, we calculated a set of parameters related to the final dataset. The results are shown in Table 6.

As part of our analysis, we calculated the value of the function

Γ (d_{1}, d_{2}, t)

to evaluate the level of discordance between the two communities, as shown in Table 6. The dataset shows that only 5.38% of users belong to both the pro-vaxxer and anti-vaxxer communities, which implies that

Γ (d_{1}, d_{2}, t) = 0.9462

. This indicates a high degree of discordance between the two groups, suggesting significant ideological separation and minimal interaction. The low overlap and high discordance suggest that both communities are operating within their own circle, reinforcing their pre-existing beliefs. The lack of engagement between pro-vaxxers and anti-vaxxers reflects a behavioral divide, in which members of each group are largely isolated from opposing viewpoints, further entrenching their ideological positions. As reported in Table 6, the dataset contains 2732 comments posted by 1357 users. There are more pro-vaxxers than anti-vaxxers in this set of users (i.e., 759 vs. 532). However, anti-vaxxers posted more tweets than pro-vaxxers (i.e., 1260 vs. 1472), which indicates a greater activity by them. Starting from this dataset, we constructed the corresponding network

N

(see Section 3.1), which we call

N_{X}

in the following. In Table 7, we report the main characteristics of this network.

From the analysis of this table, we can see that

N_{X}

has 1357 nodes and 1835 arcs. The density of

N_{X}

is higher than that of the network

N

associated with the first dataset (which we call

N_{r e d d i t}

in the following), but its clustering coefficient is much lower. In addition,

N_{X}

has many small connected components, which is an indicator of a large fragmented network. The differences between

N_{X}

and

N_{r e d d i t}

in terms of clustering coefficient and number of connected components could be due to the nature of the two social networks from which the two datasets are derived. In fact, in Reddit, subreddits form communities of people having specific interests; therefore, they are thought to exactly aggregate people. As a result, users participating in them are inherently more likely to interact, and thus be more connected to each other. On the other hand, there is no natural form of aggregation on X, except hashtags. The latter are weaker aggregators than subreddits because their goal is not to create communities but to aggregate tweets based on their topics.

At this point, we constructed the network

M

associated with the second dataset (which we call

M_{X}

in the following) by performing the tasks reported in Section 3.3. These are the same tasks carried out for constructing the network

M

associated with the first dataset (which we call

M_{r e d d i t}

in the following). The only difference concerns the parameter

s c_{p q}

, which, we recall, represents the average score of the comments from the user

v_{p}

to the user

v_{q}

. Since there is no explicit concept of score in X, we decided to consider the number of likes received by tweets as an indicator of the score. In fact, we believe that the number of likes still represents a concept close to the one of score that we could choose for X. Once we obtained

M_{X}

, for each feature, we extracted the user circles using the Louvain algorithm with two resolution values (i.e., 0.5 and 10), similar to what we did for

M_{r e d d i t}

in Section 4.2. Then, we computed the corresponding modularity, number, and average size of communities. The obtained results are reported in Table 8.

From the analysis of this table, we can observe again that the arc augmentation approach based on representation learning obtains higher modularity values for both resolutions. Specifically, when the resolution is 0.5,

κ_{p q}

achieves a modularity increase of 12.84% compared to

w_{p q}

; when the resolution is 10, the modularity growth with

κ_{p q}

compared to

w_{p q}

can still be observed, but it is more limited (4.69%). With both resolutions, we also observe that

κ_{p q}

finds larger communities than all other features.

Following the same steps carried out for the first dataset, we defined

N_{X}^{*}

(resp.,

M_{X}^{*}

) as the network

M_{X}

, having

w_{p q}

(resp.,

κ_{p q}

) as the weight of arcs (see Section 3.3). Next, we extracted the top users from

N_{X}^{*}

and

M_{X}^{*}

to check the possible benefits of our representation learning approach.

Specifically, we first calculated the Jaccard coefficient between the top users of

N_{X}^{*}

and

M_{X}^{*}

, having the highest weighted degree centrality and weighted betweenness centrality. The results obtained are shown in Figure 14.

From the analysis of this figure, we can see that there is a limited overlap between the weighted degree centrality-based top users of

N_{X}^{*}

and

M_{X}^{*}

(e.g., 32.45% with 400 users), while the overlap is much greater between the weighted betweenness centrality-based top users of

N_{X}^{*}

and

M_{X}^{*}

(e.g., 53.85% with 10 users and 99.00% with 400 users). If we compare these results with the corresponding ones for the first dataset (see Figure 7), we can see that, in this second dataset, the overlap of weighted degree centrality-based top users is lower, while the one of weighted betweenness centrality-based top users is higher. The high overlap of weighted betweenness centrality-based top users in the second dataset could be due to the small number of connected components in

N_{X}^{*}

and

M_{X}^{*}

, which causes the importance of the nodes acting as bridges between different components to remain the same. Instead, the lower overlap of weighted degree centrality-based top users in the second dataset indicates that our representation learning approach changed the importance degree of users when passing from

N_{X}^{*}

to

M_{X}^{*}

.

As the next test, we analyzed the structural differences between the top users of

N_{X}^{*}

and

M_{X}^{*}

by calculating the average density and clustering coefficient of the ego networks associated with their nodes. The results obtained are shown in Figure 15 and Figure 16.

From the analysis of Figure 15, we observe that, for the case of weighted degree centrality, we note that the average density values of the ego networks of the nodes of

N_{X}^{*}

and

M_{X}^{*}

are very similar for 10, 50, and 100 users, while the gap between the two networks increases with 200, 300, and 400 users. This trend is different from the one observed in Figure 8, where the density gap is larger with few users and smaller with many users. As for weighted betweenness centrality, the average density between the two networks is almost the same, except for 10 users, where we had also obtained the lowest Jaccard coefficient among the top users of the two networks. This trend is very similar to that observed in Figure 8 for the first dataset.

As for Figure 16 concerning the clustering coefficient analysis, we have a similar trend to that observed for the first dataset. In fact, except for the case of 10 users, the top users of

M_{X}^{*}

have much higher average clustering coefficient values than the top users of

N_{X}^{*}

. This result is a further confirmation that the ego networks of the top users of

M_{X}^{*}

are more cohesive than those of the top users of

N_{X}^{*}

.

Finally, we analyzed the discussions of the top users, grouping them by topic. For this purpose, similar to what we did in Section 4.4, we extracted the topics through BERTopic and calculated the distribution of the average comment similarity for the pro-vax and anti-vax top users. We report the results obtained in Figure 17.

From the analysis of this figure, we see that the distributions of comment similarities are quite different between the two discording communities. In particular, the medians of pro-vaxxer distributions, along with their overall distribution ranges, are higher than those of the anti-vaxxers in all cases, except School Administration. In addition, anti-vaxxers have many more outliers, suggesting greater variability in their opinions. Overall, we can see that, in this second dataset, for each topic, the boxplots of the two discording communities show a larger gap between them, indicating a higher level of divisiveness between the two groups. This suggests that polarization is more pronounced in the context of COVID-19 vaccine discussions compared to the climate-related topics. This increased divisiveness could be attributed to the different natures of the two social networks associated with the datasets (Reddit vs. X) as well as the topics discussed (Climate change vs. COVID-19 vaccines). The social and behavioral implications of these findings are interesting. The higher levels of polarization in the anti-vaxxer and pro-vaxxer communities suggest that these groups are more ideologically rooted, with less room for overlap or dialogue. The greater number of outliers among anti-vaxxers suggests that there is more variance in opinions within this group, possibly reflecting a broader spectrum of beliefs or a lack of cohesion. In contrast, the more uniform distribution of pro-vaxxers suggests a stronger ideological alignment and more cohesive discussions. This dynamic could lead to echo chambers in which each group further reinforces its pre-existing beliefs, with little opportunity for meaningful cross-group interaction. This polarization highlights the challenges of fostering dialogue between opposing groups, especially when discussions involve controversial issues such as COVID-19 vaccines. The different behaviors in these two communities underscore the importance of understanding how the platform (Reddit vs. X) and the topic contribute to the depth of ideological divides and the potential for constructive engagement.

4.6. Application of Our Framework to a Pair of Non-Discording Communities

So far, we have applied our framework to two pairs of discording communities and we have seen that it is able to identify a set of features whose values are different in them. In order to have a counter-evidence that our framework really works, in this section we want to apply it to two non-discording communities. In this case, we expect the feature values associated with the two communities returned by it to be very similar. In fact, if the two communities are non-discording, the features that typically allow us to characterize discording communities should be similar.

To proceed with our experiment, we considered two non-discording subreddits, namely /r/datascience and /r/MachineLearning, and derived a dataset from them. The two subreddits focus on topics related to data science and machine learning, which are closely related and often contain posts with very similar discussions. The characteristics of this dataset are shown in Table 9.

This table shows that the dataset contains 9430 comments, of which 6582 belong to /r/datascience and 2848 belong to /r/MachineLearning. The total number of users is 3919. However, the most interesting thing is that 20.09% of the users posted comments in both communities. This number is much higher than the corresponding number in the datasets on climate change (where it was 2.87%) and vaccines (where it was 5.38%). This indicates a high level of cross-community interaction, which reflects the nature of the topics covered by the two subreddits, as data science and machine learning have strong overlaps.

Similar to the previous experiments, we constructed the network

N

, representing users and their interactions. Recall that an interaction consists of a user’s reply to a comment posted by another user. The weight of an arc between two nodes in

N

corresponds to the number of comments posted by the user associated with the first node in response to comments posted by the user associated with the second node. Table 10 shows the main characteristics of

N

. From its analysis, we can see that

N

contains 2630 nodes corresponding to users who posted at least one reply in one of the subreddits. The density is

1.001 \cdot 10^{- 3}

and is in line with the networks related to climate change and vaccines. The clustering coefficient is 0.037, also consistent with the two networks above. Finally, as in the two networks above, the maximum connected component contains almost all the nodes (i.e., 89.85% of them).

At this point, we computed the network

M

and analyzed the top 100 users using the same methodologies as in the previous two cases. We then calculated the average number of cliques, the average clique size, and the average weight of the clique arcs involving the top 100 users. The results are shown in Table 11. From the analysis of this table, we can see that the difference between the values of the three parameters in the two subreddits is very small. Specifically, the difference in the average number of cliques is 27.75% of the lowest value in this case, while it is 508.01% in the climate change case. Similarly, the difference in average clique size (resp., average weight of the clique arcs) is 12.86% (resp., 3.84%) in this case, while it is 89.07% (resp., 18.64%) in the climate change case.

This behavior is further confirmed by the average density and the average weighted clustering coefficient of the ego networks of the top 100 users of the two subreddits /r/datascience and /r/MachineLearning, as shown in Figure 18. From the analysis of this figure, we can see that the values of the two parameters are very close for the two subreddits. In particular, the difference of the average density (resp., average weighted clustering coefficient) is 3.47% (resp., 2.44%) of the lowest value. These differences are much smaller than those found in the climate change case (see Figure 10) where the difference of the average density (resp., average weighted clustering coefficient) is 10.00% (resp., 51.54%) of the lowest value.

In Figure 19, we report the triad census in the ego networks of the top 100 users for /r/datascience and /r/MachineLearning. This figure again shows that the differences in the distributions between the two subreddits are very small. But there is another interesting aspect that can be observed in this figure. In fact, it highlights a notable difference in interaction patterns compared to previous cases; indeed, in this case users employ a variety of triad types to communicate. The presence of triad types with bidirectional arcs (e.g., 111D, 111U, 201) indicates a greater willingness of users to engage in back-and-forth discussions. This phenomenon, absent in discording communities, suggests that users in non-discording communities foster more interactive communication.

A final analysis performed by applying our framework to the two non-discording communities involved the content of comments posted during interactions. Specifically, we calculated the distribution of the average comment similarity of the top users of the two subreddits /r/datascience and /r/MachineLearning after partitioning comments by topics. The corresponding results are reported in Figure 20. This figure shows that the two communities have very similar distributions. Comparing this figure with Figure 12, which shows the same distributions but for discording communities related to climate change, we can observe that, in the latter case, the two distributions are more different. For example, in Figure 20 the two medians are 0.4871 and 0.5162 and their difference is 5.97% of the median with the lowest value. Instead, in Figure 12 the two medians are 0.5385 and 0.6026 and their difference is 11.90% of the median with the lower value.

5. Discussion

In this section, we want to provide a critical discussion of our framework and the experiments we conducted on Reddit and X. Specifically, in Section 5.1 we propose an overview of the results and insights we obtained. In Section 5.2, we explain in more detail the theoretical and practical implications of our framework that we mentioned in the Introduction. In Section 5.3, we describe some limitations of our framework. Finally, in Section 5.4 we position our paper in relation to the state-of-the-art.

5.1. Result Overview and Detected Insights

The results presented in the previous section show that our framework is capable of enriching interactions with users to better understand the discussions they have online. In fact, we showed that adding information about writing styles and comment content into the initial network (thus passing from

N^{*}

to

M^{*}

) allows the detection of user circles whose users are characterized by similar writing styles and opinions. In addition, we observed that the most influential users extracted by our framework have less ego networks but with a high clustering coefficient. The combination of these two factors allows us to think that these users trust and support each other. Finally, we analyzed the main characteristics of the top users of the discording communities in Reddit (i.e., climate change supporters and skeptics) and in X (i.e., pro-vaxxers and anti-vaxxers) and highlighted the differences in user behavior for each pair of discording communities. This allows us to conclude that the users of two discording communities have different ways to express their opinions, and this happens in both the climate change and COVID-19 vaccine scenarios.

In more detail, the results of the experiments highlight several insights that are very useful for gaining a deeper understanding of users’ viewpoints and behaviors for the topics we decided to investigate. For example, regarding user circles, the higher modularity values obtained indicate that users within the extracted circles tend to interact with each other more frequently, resulting in more coherent and well-defined communities.

The influence of representation learning on the identification of central users is also demonstrated by the partial overlap of the top users identified for the two discording communities, namely weighted degree centrality and weighted betweenness centrality. It is worth noting that the more pronounced differences in the top users for weighted degree centrality suggest that this measure is sensitive to the features involved in the representation learning.

The study of cliques also provided interesting insights into user interactions in the two discording communities. For instance, top users in the climate change skeptic community participate in larger cliques, reflecting their propensity to engage in multiple comment chains. Conversely, cliques in the climate change supporter community tend to be smaller but more cohesive, indicating strong linkages among users.

Again, the insights obtained from the experiments highlight the multifaceted nature of online communities, where users form different circles, engage in various types of discussions, and show distinct behaviors depending on the topics and dynamics of interactions that characterize the discording community to which they belong. As evidence of this, our study of the most discussed topics provides several insights into user behavior. For example, we observed that users of the climate change supporter community showed greater cohesion in the Science/Scientist topic, which in turn could imply a shared confidence in the scientific experience. In contrast, in topics such as Nuclear power and Renewable energy, users of the climate change skeptic community show greater cohesion, which suggests that they share more common ground on these topics.

5.2. Implications

The proposed framework has some theoretical implications. For example, it evaluates user interactions by adopting both structure and content perspectives. In fact, it operates by augmenting the information on the arcs and condensing all available information into a graph. Once this is done, any social network analysis technique can be applied to identify the most common structures and the most influential users. The contribution brought by our framework, compared to the direct application of social network analysis techniques on the original network without arc enrichment, is that, during the arc enrichment phase, the writing styles and opinions of users are analyzed and the results of this investigation are integrated into the network. In this way, our framework allows us to evaluate the different writing styles and opinions characterizing the users of discording communities to convey their content.

Another theoretical implication of our framework concerns the analysis of the polarization of users. Thanks to the enrichment introduced into the network arcs by our framework, we are able to say whether or not a user tends to exhibit polarization traits. In fact, we can check whether or not they have very strong arcs toward people who hold the same line of thought as them and, at the same time, have very weak arcs or no arcs toward people who hold different views. If this occurs, it means that the user supports only their own ideas and avoids understanding the ideas of others, which is the typical behavior of a polarized user.

A further theoretical implication concerns the identification of people who participate in discussions in a community with the primary goal of provoking, irritating, or upsetting others (think, for instance, of troll users). We can identify these users through our framework because they have weak arcs, since the weight of the arcs depends essentially on the similarity of writing styles and opinions.

A final theoretical implication is that our framework is not specific to a social platform, but can be applied to any social medium where users can comment on each other. It requires only the list of comments and the topic that could give rise to discording viewpoints. To demonstrate this, we presented two case studies, one applying our framework to Reddit and the other to X.

As for practical implications, first we highlight that our framework can make a good contribution to the discussion on climate change and COVID-19 vaccines. In fact, it allows us to identify the most important users in terms of writing styles and opinions, which is critical in identifying the leaders of people of the two discording communities. Once we have identified the leaders, we can study their communication patterns and observe how they attract the attention of other users and how they interact with people whose opinions differ from their own.

Another interesting practical implication relates to the possibility of monitoring communities and observing whether or not they tend to become polarized. Such an analysis could be done, for example, by checking whether or not they discourage comments that express opposite viewpoints or, even, whether or not they discourage the permanence of users who express ideas differing from those expressed by the majority of their members.

Our framework could also allow us to identify groups of users who are too extreme, even for the polarized community to which they belong and do not contribute to public discussion except by expressing their own extremist ideas without accepting any kind of confrontation with others.

Finally, our framework can be employed to detect fake news and monitor its spread. Regarding this issue, it can identify possible fake news by supporting the investigation of the underlying dynamics of user interactions. In particular, similar to what is done by other approaches in the literature [82], our framework can help the analysis of user interactions to target different phenomena related to fake news, such as infodemic, click-bait, and hate speech. Again, through our framework, it is possible to detect unusual interactions among users, for example, by identifying anomalies in the degree centrality of

N

. Once such anomalies are identified, it is possible to analyze the content of the corresponding interactions, for example, by performing crowdsourced fact-checking or human-in-the-loop processes [82,83,84]. Interestingly, there is currently no “one-fits-all” approach for the identification of fake news [82]. Our framework, through the study of interactions via network representation, can help to address this issue by aiding in the identification of potential fake news and the users who disseminate it. After a piece of fake news is identified, our framework allows us to assess its spread (through a structural analysis) and its effect (through a content investigation). In particular, it makes it possible to observe if and to what extent the fake news is able to alter the weights of the network arcs associated with a community. Then, once the community has identified that the news was fake, it is possible to investigate how its users react and whether or not the community as a whole is able to restore the weights of the arcs to values equal or close to the ones existing before the spread of the fake news.

5.3. Limitations

So far, we have mainly seen the positive aspects of our framework. However, it also has some limitations.

First, it cannot identify in advance the line of thought of the two discording communities. In fact, it currently requires us to first identify which are the two discording communities we want to analyze, as well as to specify the membership of the users to the two communities. Only once this information is available is our framework able to operate and return all the results and insights we have seen above.

In addition, our framework is able to compute the similarity of comments on a given topic (and thus whether or not two users agree or disagree on something), but is not able to identify comment semantics. To better illustrate this limitation, suppose we have two comments related to the use of nuclear energy. Our framework is able to tell if these two comments have a high similarity degree and if the two users have similar writing styles and opinions, but is unable to specify what opinions about nuclear energy was expressed by them. To answer the latter question, it would be necessary to perform further content analysis using Natural Language Processing approaches designed for such a task, which are not currently integrated in our framework.

Another limitation concerns the representation of comment exchanges between users by means of a single arc type. In some social networks (e.g., Instagram), users can interact through different mechanisms, such as tagging others in posts or replying to comments on their posts. These interactions represent different types of user engagement (e.g., tagging often serves to engage specific users in broader discussions, while replies foster more direct conversational threads). Our framework does not currently distinguish between these types of interaction, treating them as a single type of relationship. This limitation could affect the analysis of discording communities, as different types of interaction may contribute to different levels of discordance. By treating all interaction types uniformly, several subtle differences in how discordance occurs within and between communities may be overlooked.

A final limitation arises from the use of Kernel Density Estimation (KDE) to aggregate different features into a probability distribution. This technique allowed us to compute an arc weight represented by a single value, thus allowing the straightforward application of classical community detection and centrality metrics. However, KDE suffers from the curse of dimensionality: as the number of features increases, data sparsity, computational complexity, and sampling inefficiency become significant challenges, potentially compromising the accuracy and reliability of the resulting arc weight. Therefore, as long as our framework uses KDE to aggregate different features, we cannot add many more features beyond those already in use.

5.4. Positioning Our Paper in Relation to the State-of-the-Art

In Section 2, we presented the related literature and saw what weaknesses the previous approaches present and how our framework manages to overcome them. In this section, we want to continue this analysis and, after having seen in detail the technical aspects of our framework and the results it is able to achieve, we want to see how it compares to the related approaches. First, it should be noted that the concept of discording communities is introduced for the first time in this paper. As seen above, our framework is focused exactly on the investigation of discording communities. Therefore, we cannot make a qualitative or quantitative comparison with other approaches or frameworks that address exactly the same problem as ours, because such frameworks do not exist in the literature. However, we can position our framework in relation to approaches that share at least some similarities with it in such a way as to allow for a quick identification of its aspects that are shared with other approaches and those that are not.

In order to make this qualitative comparison, we focus on those aspects that we believe are particularly important for investigating interactions within and between different communities, especially in the case of discording communities. Specifically, the aspects that guide our analysis are the following:

the presence, at least at the qualitative level, of the concept of discordance, or at least dissimilarity, between different user communities on a social platform;
the use of quantitative measures to analyze this phenomenon;
the use of structural properties of networks to analyze this phenomenon;
the use of the content present in the messages exchanged by users to analyze this phenomenon;
the ability to manage the temporal evolution of this phenomenon;
the ability to manage content enrichment techniques, allowing a multidimensional analysis of the content, which can then be used to better analyze this phenomenon.

In Table 12, we report these aspects and, for each of them, show which approaches include it and which do not. The approaches considered are our framework and the related approaches examined in Section 2.

By analyzing this table, we can derive several insights. In particular, we observe the following:

The methods proposed in [29,47] consider some notion of community discordance, or at least community dissimilarity. However, they do not formalize the concept of discording communities or anything similar.
Only the approach of [47] provides a quantitative measure of the discordance between two communities, similar to our framework.
Many approaches focus on the structure of the networks but do not consider the content exchanged between users in their communities. As a consequence, they can only capture user interactions.
The approaches integrating content, such as that in [41,42], rely on embedding, like our framework, which allows for representation enrichment and more detailed analysis.
None of the methods, with the exception of our framework, support temporal analysis, which is extremely important for understanding how the dynamics and polarization level of communities and their users evolve over time.

At the end of this analysis, we can say that this paper indeed contributes to the state-of-the-art concerning the study of communities on social platforms, both because it introduces a novel concept and because the corresponding framework simultaneously has a set of properties, some of which are not found in any other approach, while others are found individually in some approaches, but never all at once.

6. Conclusions

In this paper, we introduced and explored the concept of discording communities, i.e., a pair of communities that operate on the same social platform, are interested in the same topic, but have diametrically opposed positions on it. We then proposed a framework for studying these communities, which includes a network-based model, a representation learning process, and an arc augmentation task. Each of these components plays a role in understanding the structure of the interactions between the users of the communities and the content flowing through them. Our framework is general and can be applied to any social platform where there are discording communities on the same topic. We then conducted an experimental campaign applying our framework to two pairs of discording communities. The first pair involved Reddit, the topic was climate change, and the two discording communities were climate change supporters and climate change skeptics. The second pair involved X, the topic was COVID-19 vaccines, and the two discording communities were pro-vaxxers and anti-vaxxers. From the experiments conducted, we derived several insights that attest to the different behavior of users, especially influencers, on two discording communities.

Our framework has a variety of theoretical implications; in particular, it allows us (i) to assess interactions between users by considering both the structure of communities and the content flowing into them; (ii) to identify the different writing styles and opinions that characterize users of discording communities; (iii) to test whether or not a user tends to exhibit polarizing traits; and (iv) to identify people who participate in a community only to provoke, irritate, or upsetting others.

Our framework also has a number of practical implications; in particular, it allows us (i) to enrich our knowledge of the topics discussed in the discording communities under investigation, especially by identifying the different positions of users, influencers, troublemakers, etc.; (ii) to monitor communities to see if they tend to polarize; (iii) to identify groups of users who are too extreme, even for the polarized communities to which they belong, and whose extremism could lead to dangerous phenomena such as hatred, persecution, or even terrorism; (iv) to search for fake news and monitor its spread.

Our framework also has limitations, some of which we hope to address in the future. Specifically, (i) it requires a priori determination of which communities are discording and who are the users belonging to each of them; (ii) it can compute the similarity between comments to see whether they are concording or discording, but is unable to process the semantics of the corresponding content; (iii) it represents all user interactions with a single arc type, overlooking the distinctions between interaction types (e.g., tagging vs. replying); (iv) it uses Kernel Density Estimation for aggregating features into a single arc weight, which is constrained by the curse of dimensionality.

In the future, we plan to continue our research efforts in several directions. First, we want to add machine-learning models to process images posted by users. In fact, many social networks allow users to attach images to their comments. We will consider extracting relevant features from the images posted by users, including the corresponding sentiments, texts describing them, their captions, and so forth. We can then integrate these features into our arc augmentation function, further enriching the information represented by our framework.

Second, we plan to add a Graph Neural Network (GNN) model that could enable the arc augmentation function to provide new features regarding the neighborhoods of the nodes of the arc to which it is applied. To this end, we can use a GNN to derive embeddings for these nodes capable of capturing the corresponding structural information. Then, we can compute the similarity between the node embeddings to assess their structural affinity. This similarity could be a new feature returned by the arc augmentation function, which takes into account both the neighborhoods of the nodes and the comments exchanged by users associated with them.

Third, we would like to address the current limitation of representing all user interactions with a single arc type. In particular, we could extend our framework to include multiple arc types, each corresponding to a different interaction mechanism (e.g., tagging, replying). By adopting a multiplex network approach, where each layer represents a specific interaction type, we could more accurately capture the nuanced differences in user interactions within discording communities.

Finally, to address the limitations posed by the use of Kernel Density Estimation and allow for the inclusion of more features in the arc weights, we could explore alternative methods for feature aggregation while mitigating the curse of dimensionality. One promising direction is to use dimensionality reduction techniques (e.g., Principal Component Analysis or autoencoders) to capture the most relevant aspects of the feature set in a lower-dimensional space, thereby reducing data sparsity and computational complexity, before applying KDE to compute arc weights.

Author Contributions

Conceptualization, F.C. and L.V.; methodology, E.C. and D.U.; software E.C. and M.M.; validation L.V.; formal analysis, F.C.; investigation, M.M.; resources, L.V.; data curation, M.M.; writing—original draft preparation, E.C. and M.M.; writing—review and editing, D.U. and L.V.; visualization, E.C. and F.C.; supervision, D.U.; project administration, D.U. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was funded by the PNRR project FAIR—Future AI Research (PE00000013), Spoke 9—AI, under the NRRP MUR program funded by the NextGenerationEU.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We used data accessible at the following address: https://github.com/lucav48/climate-change-reddit (accessed on 1 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Baym, N.K. Personal Connections in the Digital Age; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Kross, E.; Chandhok, S. How do online social networks influence people’s emotional lives. In Proceedings of the Sydney Symposium of Social Psychology, Applications of Social Psychology (SSSP’20), Visegrad, Hungary, 27–31 July 2020. [Google Scholar]
Walsh, J.P. Social media and moral panics: Assessing the effects of technological change on societal reaction. Int. J. Cult. Stud. 2020, 23, 840–859. [Google Scholar] [CrossRef]
Stoner, J. Risky and cautious shifts in group decisions: The influence of widely held values. J. Exp. Soc. Psychol. 1968, 4, 442–459. [Google Scholar] [CrossRef]
Schmidt, A.L.; Zollo, F.; Scala, A.; Betsch, C.; Quattrociocchi, W. Polarization of the vaccination debate on Facebook. Vaccine 2018, 36, 3606–3612. [Google Scholar] [CrossRef]
Du, S.; Gregory, S. The echo chamber effect in Twitter: Does community polarization increase? In Proceedings of the International Workshop on Complex Networks and Their Applications (COMPLEX NETWORKS 2016), Milan, Italy, 30 November–2 December 2016; Springer: Berlin/Heidelberg, Germany, 2017; pp. 373–378. [Google Scholar]
Williams, H.T.; McMurray, J.R.; Kurz, T.; Lambert, F.H. Network analysis reveals open forums and echo chambers in social media discussions of climate change. Glob. Environ. Change 2015, 32, 126–138. [Google Scholar] [CrossRef]
Subramani, N.; Easwaramoorthy, S.V.; Mohan, P.; Subramanian, M.; Sambath, V. A gradient boosted decision tree-based influencer prediction in social network analysis. Big Data Cogn. Comput. 2023, 7, 6. [Google Scholar] [CrossRef]
Gunn, H.K. Filter bubbles, echo chambers, online communities. In The Routledge Handbook of Political Epistemology; Routledge: London, UK, 2021; pp. 192–202. [Google Scholar]
Rafee, A. Polarization on Social Media Platforms Consequences for Politics and Security. In The Digital Age, Cyber Space, and Social Media: The Challenges of Security & Radicalization; Institute for Policy, Advocacy and Governance, IAPG: Pantha, Bangladesh, 2020; p. 173. [Google Scholar]
Carpenter, J.; Brady, W.; Crockett, M.; Weber, R.; Sinnott-Armstrong, W. Political polarization and moral outrage on social media. Conn. Law Rev. 2020, 52, 1107. [Google Scholar]
Hunter, L.; Biglaiser, G.; McGauvran, R.; Collins, L. The effects of social media on domestic terrorism. Behav. Sci. Terror. Political Aggress. 2024, 16, 556–580. [Google Scholar] [CrossRef]
Garimella, V.; Weber, I. A long-term analysis of polarization on Twitter. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM’17), Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 528–531. [Google Scholar]
Smith, L.; Thomas, E.; Bliuc, A.; McGarty, C. Polarization is the psychological foundation of collective engagement. Commun. Psychol. 2024, 2, 41. [Google Scholar] [CrossRef] [PubMed]
Bliuc, A.; Betts, J.; Vergani, M.; Bouguettaya, A.; Cristea, M. A theoretical framework for polarization as the gradual fragmentation of a divided society. Commun. Psychol. 2024, 2, 75. [Google Scholar] [CrossRef] [PubMed]
Modgil, S.; Singh, R.; Gupta, S.; Dennehy, D. A confirmation bias view on social media induced polarisation during COVID-19. Inf. Syst. Front. 2024, 26, 417–441. [Google Scholar] [CrossRef]
Anastasiei, B.; Dospinescu, N.; Dospinescu, O. Word-of-mouth engagement in online social networks: Influence of network centrality and density. Electronics 2023, 12, 2857. [Google Scholar] [CrossRef]
Malinen, S. Understanding user participation in online communities: A systematic literature review of empirical studies. Comput. Hum. Behav. 2015, 46, 228–238. [Google Scholar] [CrossRef]
Zoé, M.; Parmentier, G. Drivers and mechanisms for online communities performance: A systematic literature review. Eur. Manag. J. 2022, 41, 590–606. [Google Scholar]
Bonifazi, G.; Corradini, E.; Ursino, D.; Virgili, L. New approaches to extract information from posts on COVID-19 published in Reddit. Int. J. Inf. Technol. Decis. Mak. 2022, 21, 1385–1431. [Google Scholar] [CrossRef]
Himel, D.; Ali, M.; Hashem, T. User interaction based community detection in online social networks. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA’19), Bali, Indonesia, 21–24 April 2014; Volume 580. [Google Scholar]
Omodei, E.; Domenico, M.D.; Arenas, A. Characterizing interactions in online social networks during exceptional events. Front. Phys. 2015, 3, 59. [Google Scholar] [CrossRef]
Cauteruccio, F.; Kou, Y. Investigating the emotional experiences in eSports spectatorship: The case of League of Legends. Inf. Process. Manag. 2023, 60, 103516. [Google Scholar] [CrossRef]
Phillips, S.C.; Carley, K.M. An organizational form framework to measure and interpret online polarization. Inf. Commun. Soc. 2023, 27, 1163–1195. [Google Scholar] [CrossRef]
Datta, S.; Adar, E. Extracting Inter-Community Conflicts in Reddit. In Proceedings of the International Conference on Web and Social Media (ICWSM 2019), Munich, Germany, 11–14 June 2019; pp. 146–157. [Google Scholar]
Minakawa, N.; Izumi, K.; Sakaji, H.; Sano, H. Graph representation learning of banking transaction network with edge weight-enhanced attention and textual information. In Proceedings of the ACM Web Conference (WWW’22), Virtual Event, 25–29 April 2022; pp. 630–637. [Google Scholar]
Datta, S.; Phelan, C.; Adar, E. Identifying misaligned inter-group links and communities. Proc. ACM Hum.-Comput. Interact. 2017, 1, 1–23. [Google Scholar] [CrossRef]
Anderson, K. Ask me anything: What is Reddit? Libr. Hi Tech News 2015, 32, 8–11. [Google Scholar] [CrossRef]
Kumar, S.; Hamilton, W.; Leskovec, J.; Jurafsky, D. Community Interaction and Conflict on the Web. In Proceedings of the World Wide Web Conference (WWW 2018), Lyon, France, 23–27 April 2018; pp. 933–943. [Google Scholar]
Soliman, A.; Hafer, J.; Lemmerich, F. A Characterization of Political Communities on Reddit. In Proceedings of the ACM Conference on Hypertext and Social Media (HT’19), Hof, Germany, 17–20 September 2019; pp. 259–263. [Google Scholar]
Hale, B.J.; Grabe, M.E. Visual war: A content analysis of Clinton and Trump subreddits during the 2016 campaign. J. Mass. Commun. Q. 2018, 95, 449–470. [Google Scholar] [CrossRef]
Sawicki, J.; Ganzha, M. Exploring Reddit Community Structure: Bridges, Gateways and Highways. Electronics 2024, 13, 1935. [Google Scholar] [CrossRef]
Du Plessis, C. The role of content marketing in social media content communities. S. Afr. J. Inf. Manag. 2017, 19, a866. [Google Scholar] [CrossRef]
Li, J.; Stephens, K.K.; Zhu, Y.; Murthy, D. Using social media to call for help in Hurricane Harvey: Bonding emotion, culture, and community relationships. Int. J. Disaster Risk Reduct. 2019, 38, 101212. [Google Scholar] [CrossRef]
Zahra, K.; Azam, F.; Ilyas, W.B.F. A framework for user characterization based on tweets using machine learning algorithms. In Proceedings of the International Conference on Network, Communication and Computing (ICNCC 2018), Taipei City, Taiwan, 14–16 December 2018; pp. 11–16. [Google Scholar]
Tuna, T.; Akbas, E.; Aksoy, A.; Canbaz, M.A.; Karabiyik, U.; Gonen, B.; Aygun, R. User characterization for online social networks. Soc. Netw. Anal. Min. 2016, 6, 1–28. [Google Scholar] [CrossRef]
Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Network representation learning: A survey. IEEE Trans. Big Data 2018, 6, 3–28. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’13), Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3111–3119. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the International Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Lai, Y.; Neville, J.; Goldwasser, D. Transconv: Relationship embedding in social networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4130–4138. [Google Scholar]
Yuan, B.; Panneerselvam, J.; Liu, L.; Antonopoulos, N.; Lu, Y. An inductive content-augmented network embedding model for edge artificial intelligence. IEEE Trans. Ind. Informatics 2019, 15, 4295–4305. [Google Scholar] [CrossRef]
Wang, H.; Yang, R.; Huang, K.; Xiao, X. Efficient and Effective Edge-wise Graph Representation Learning. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’23), New York, NY, USA, 6–10 August 2023; pp. 2326–2336. [Google Scholar]
Bielak, P.; Kajdanowicz, T.; Chawla, N. Attre2vec: Unsupervised attributed edge representation learning. Inf. Sci. 2022, 592, 82–96. [Google Scholar] [CrossRef]
Kakisim, A.G. Enhancing attributed network embedding via enriched attribute representations. Appl. Intell. 2022, 52, 1566–1580. [Google Scholar] [CrossRef]
Rashed, A.; Kutlu, M.; Darwish, K.; Elsayed, T.; Bayrak, C. Embeddings-based clustering for target specific stances: The case of a polarized turkey. In Proceedings of the International AAAI Conference on Web and Social Media (ICWSM’21), Virtual Event, 7–10 June 2021; Volume 15, pp. 537–548. [Google Scholar]
Alatawi, F.; Sheth, P.; Liu, H. Quantifying the Echo Chamber Effect: An Embedding Distance-based Approach. arXiv 2023, arXiv:2307.04668. [Google Scholar]
Brzozowski, L.; Siudem, G.; Gagolewski, M. Community detection in complex networks via node similarity, graph representation learning, and hierarchical clustering. arXiv 2023, arXiv:2303.12212. [Google Scholar]
Giovanni, M.D.; Corti, L.; Pavanetto, S.; Pierri, F.; Tocchetti, A.; Brambilla, M. A content-based approach for the analysis and classification of vaccine-related stances on Twitter: The Italian scenario. In Proceedings of the International AAAI Conference on Web and Social Media (AAI’21), Virtual, 7–10 June 2021; pp. 1–6. [Google Scholar]
Tsvetovat, M.; Kouznetsov, A. Social Network Analysis for Startups: Finding Connections on the Social Web; O’Reilly Media, Inc.: Newton, MA, USA, 2011. [Google Scholar]
Newman, M. Networks; Oxford University Press: Oxford, UK, 2018. [Google Scholar]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
Singhal, A. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 2001, 24, 35–43. [Google Scholar]
Hutto, C.; Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM’14), Ann Arbor, MI, USA, 1–4 June 2014; pp. 216–225. [Google Scholar]
Horne, B.; Adali, S.; Sikdar, S. Identifying the social signals that drive online discussions: A case study of Reddit communities. In Proceedings of the International Conference on Computer Communication and Networks (ICCCN’17), Vancouver, BC, Canada, 31 July–3 August 2017; pp. 1–9. [Google Scholar]
Keneshloo, Y.; Wang, S.; Han, E.S.; Ramakrishnan, N. Predicting the Popularity of News Articles. In Proceedings of the International Conference on Data Mining (SDM’19), Miami, FL, USA, 5–7 May 2016; pp. 441–449. [Google Scholar]
Tsytsarau, M.; Palpanas, T. Survey on mining subjective data on the web. Data Min. Knowl. Discov. 2012, 24, 478–514. [Google Scholar] [CrossRef]
Kasmuri, E.; Basiron, H. Subjectivity analysis in opinion mining—a systematic literature review. Int. J. Adv. Soft Comput. Its Appl. 2017, 9, 132–159. [Google Scholar]
Loria, S. TextBlob Documentation, Release 0.16; 2018; p. 269. Available online: https://textblob.readthedocs.io/en/dev/ (accessed on 28 January 2025).
Goyal, A.; Gupta, V.; Kumar, M. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 2018, 29, 21–43. [Google Scholar] [CrossRef]
Sutton, C.; McCallum, A. An introduction to conditional random fields. Found. Trends® Mach. Learn. 2012, 4, 267–373. [Google Scholar] [CrossRef]
Scott, D.W. Kernel density estimation. In Wiley StatsRef: Statistics Reference Online; Wiley Online Library: Hoboken, NJ, USA, 2014; pp. 1–7. [Google Scholar]
Wang, S.; Wang, J.; Chung, F. Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Trans. Cybern. 2013, 44, 1–20. [Google Scholar] [CrossRef] [PubMed]
Cho, J.; Hwang, G.; Suh, C. A fair classifier using kernel density estimation. Adv. Neural Inf. Process. Syst. 2020, 33, 15088–15099. [Google Scholar]
Bonifazi, G.; Breve, B.; Cirillo, S.; Corradini, E.; Virgili, L. Investigating the COVID-19 vaccine discussions on Twitter through a multilayer network-based approach. Inf. Process. Manag. 2022, 59, 103095. [Google Scholar] [CrossRef]
Sylwester, K.; Purver, M. Twitter language use reflects psychological differences between democrats and republicans. PLoS ONE 2015, 10, e0137422. [Google Scholar] [CrossRef]
Samantray, A.; Pin, P. Credibility of climate change denial in social media. Palgrave Commun. 2019, 5, 127. [Google Scholar] [CrossRef]
Basile, V.; Cauteruccio, F.; Terracina, G. How dramatic events can affect emotionality in social posting: The impact of COVID-19 on reddit. Future Internet 2021, 13, 29. [Google Scholar] [CrossRef]
Abbass, K.; Qasim, M.; Song, H.; Murshed, M.; Mahmood, H.; Younis, I. A review of the global climate change impacts, adaptation, and sustainable mitigation measures. Environ. Sci. Pollut. Res. 2022, 29, 42539–42559. [Google Scholar] [CrossRef]
Effrosynidis, D.; Karasakalidis, A.; Sylaios, G.; Arampatzis, A. The climate change Twitter dataset. Expert Syst. Appl. 2022, 204, 117541. [Google Scholar] [CrossRef]
Oswald, L.; Bright, J. How Do Climate Change Skeptics Engage with Opposing Views Online? Evidence from a Major Climate Change Skeptic Forum on Reddit. Environ. Commun. 2022, 16, 805–821. [Google Scholar] [CrossRef]
Corradini, E.; Nocera, A.; Ursino, D.; Virgili, L. Investigating the phenomenon of NSFW posts in Reddit. Inf. Sci. 2021, 566, 140–164. [Google Scholar] [CrossRef]
Corradini, E.; Nocera, A.; Ursino, D.; Virgili, L. Investigating negative reviews and detecting negative influencers in Yelp through a multi-dimensional social network based model. Int. J. Inf. Manag. 2021, 60, 102377. [Google Scholar] [CrossRef]
Blondel, V.; Guillaume, J.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Newman, M. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef]
Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists (IMECS’13), Hong Kong, China, 13–15 March 2013; Volume 1, pp. 380–384. [Google Scholar]
Brooks, A.; Zastrow, J. The potential influence of climate change on offshore primary production in Lake Michigan. J. Great Lakes Res. 2002, 28, 597–607. [Google Scholar] [CrossRef]
Robertson, D.; Saad, D.; Christiansen, D.; Lorenz, D. Simulated impacts of climate change on phosphorus loading to Lake Michigan. J. Great Lakes Res. 2016, 42, 536–548. [Google Scholar] [CrossRef]
Cherkauer, K.; Sinha, T. Hydrologic impacts of projected future climate change in the Lake Michigan region. J. Great Lakes Res. 2010, 36, 33–50. [Google Scholar] [CrossRef]
Doll, J.; Petersen, B.; Bode, C. Skeptical but adapting: What Midwestern farmers say about climate change. Weather Clim. Soc. 2017, 9, 739–751. [Google Scholar] [CrossRef]
Chandra, R.; Sonawane, J.; Lande, J. An Analysis of Vaccine-Related Sentiments on Twitter (X) from Development to Deployment of COVID-19 Vaccines. Big Data Cogn. Comput. 2024, 8, 186. [Google Scholar] [CrossRef]
Ruffo, G.; Semeraro, A.; Giachanou, A.; Rosso, P. Studying fake news spreading, polarisation dynamics, and manipulation by bots: A tale of networks and language. Comput. Sci. Rev. 2023, 47, 100531. [Google Scholar] [CrossRef]
Tschiatschek, S.; Singla, A.; Rodriguez, M.G.; Merchant, A.; Krause, A. Fake News Detection in Social Networks via Crowd Signals. In Proceedings of the International World Wide Web Conference (WWW’18)—Companion Volume, Lyon, France, 23–27 April 2018; pp. 517–524. [Google Scholar]
Demartini, G.; Mizzaro, S.; Spina, D. Human-in-the-loop Artificial Intelligence for Fighting Online Misinformation: Challenges and Opportunities. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 2020, 43, 65–74. [Google Scholar]

Figure 1. Framework diagram of our analysis.

Figure 2. Number of comments per week posted by users in /r/climatechange and /r/climateskeptics.

Figure 3. Distribution of users against comments (log-log scale).

Figure 4. Distribution of the percentage of comments against the number of reply comments.

Figure 5. Largest connected component of the network

N

.

Figure 5. Largest connected component of the network

N

.

Figure 6. Average similarity distribution of comments by topics and users within user circles. (a) Resolution 0.5; (b) Resolution 10.

Figure 7. Jaccard coefficient of the top users of

N^{*}

and

M^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 7. Jaccard coefficient of the top users of

N^{*}

and

M^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 8. Average density of the ego networks of the top users of

N^{*}

and

M^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 8. Average density of the ego networks of the top users of

N^{*}

and

M^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 9. Average weighted clustering coefficient of the ego networks of the top users of

N^{*}

and

M^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 9. Average weighted clustering coefficient of the ego networks of the top users of

N^{*}

and

M^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 10. Average density and average weighted clustering coefficient of the ego networks of the top 100 users of /r/climatechange and /r/climateskeptics.

Figure 11. Top users triad census in /r/climatechange and /r/climateskeptics.

Figure 12. Distribution of the average comment similarity for the top users in /r/climatechange and /r/climateskeptics.

Figure 13. Distribution of the average comment similarity for the top users in r/climatechange and r/climateskeptics after partitioning the comments by topics.

Figure 14. Jaccard coefficient of the top users of

N_{X}^{*}

and

M_{X}^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 14. Jaccard coefficient of the top users of

N_{X}^{*}

and

M_{X}^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 15. Average density of the ego networks of the top users of

N_{X}^{*}

and

M_{X}^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 15. Average density of the ego networks of the top users of

N_{X}^{*}

and

M_{X}^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 16. Average weighted clustering coefficient of the ego networks of the top users of

N_{X}^{*}

and

M_{X}^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 16. Average weighted clustering coefficient of the ego networks of the top users of

N_{X}^{*}

and

M_{X}^{*}

. (a) Weighted degree centrality; (b) Weighted betweenness centrality.

Figure 17. Distribution of the average comment similarity for the pro-vax and anti-vax top users after partitioning the comments by topics.

Figure 18. Average density and average weighted clustering coefficient of the ego networks of the top 100 users of /r/datascience and /r/MachineLearning.

Figure 19. Top users triad census in /r/datascience and /r/MachineLearning.

Figure 20. Distribution of the average comment similarity for the top 100 users in /r/datascience and /r/MachineLearning.

Table 1. Examples of the annotation process.

Body	$γ (\cdot)$	$β (\cdot)$	$ζ (\cdot)$
So it shouldn’t be said that this last year was the 6th hottest in the last 20,000, and that the last 6 years were also in the top 6? Because…	0.48	0.17	5
Fear mongers will need to up the anti… They’re already pushing the climate agenda hard here in Australia, give it time, Australians will fall in line with the climate…	−0.88	0.58	3
Odd that China would have triple the solar power installed compared to the US if they thought it was a fraud and that they installed 30% worldwide of all new…	−0.73	0.35	4
Carbon capture is an absolute necessity in the long term (though we need to get emissions down to zero first). Why? Global warming is irreversible without carbon capture. If…	0.89	0.38	7
It changes the eco system, that’s true. But there’s vegetation below solar panels. On top and below of nuclear plants there’s typically nothing. But maybe I see it too negatively; …	0.94	0.53	3

Table 2. Main characteristics of our dataset.

Statistic	Value
Number of comments	100,745
Number of users	9269
Date and time of the first comment	1 January 2022 00:01:08
Date and time of the last comment	31 December 2022 23:58:47
Number of users of `/r/climatechange`	5818
Number of users of `/r/climateskeptics`	3717
Number of common users	266 (2.87%)
Number of comments of `/r/climatechange`	36,546
Number of comments of `/r/climateskeptics`	64,199

Table 3. Main characteristics of

N

.

Table 3. Main characteristics of

N

.

Statistic	Value
Number of nodes	6960
Number of arcs	24,040
Density	$0.434 \times 10^{- 3}$
Clustering coefficient	0.068
Number of connected components	106
Size of the maximum connected component	6736
Minimum weight of an arc	1
Maximum weight of an arc	362

Table 4. Results of the user circle extraction obtained by applying the Louvain algorithm to

M

with different arc weights.

Table 4. Results of the user circle extraction obtained by applying the Louvain algorithm to

M

with different arc weights.

Resolution	Feature(s)	Modularity	Number of Communities	Average Size of Communities
0.5	$w_{p q}$	0.357	1144	6.084
	$s m_{p q}^{+}$	0.347	900	7.733
	$s m_{p q}^{-}$	0.352	991	7.023
	$s e_{p q}$	0.386	970	7.175
	$s c_{p q}$	0.308	1141	6.1
	$s b_{p q}$	0.379	1694	4.109
	$κ_{p q}$	0.435	1113	6.253
10	$w_{p q}$	0.310	326	21.35
	$s m_{p q}^{+}$	0.360	358	19.441
	$s m_{p q}^{-}$	0.386	366	19.016
	$s e_{p q}$	0.394	366	19.016
	$s c_{p q}$	0.348	371	18.76
	$s b_{p q}$	0.415	1057	6.585
	$κ_{p q}$	0.488	387	17.984

Table 5. Clique analysis of the top 100 users of r/climatechange and r/climateskeptics.

Subreddit	Average Number of Cliques	Average Clique Size	Average Weight of the Clique Arcs
`/r/climatechange`	60.17	5.95	0.7
`/r/climateskeptics`	365.84	11.25	0.59

Table 6. Main characteristics of the second dataset.

Statistic	Value
Number of comments	2732
Number of users	1357
Date and time of the first comment	20 October 2020 15:51:09
Date and time of the last comment	21 April 2021 21:58:14
Number of pro-vaxxers	759
Number of anti-vaxxers	532
Number of common users	73 (5.38%)
Number of comments of pro-vaxxers	1260
Number of comments of anti-vaxxers	1472

Table 7. Main characteristics of

N_{X}

.

Table 7. Main characteristics of

N_{X}

.

Statistic	Value
Number of nodes	1357
Number of arcs	1835
Density	$0.997 \times 10^{- 3}$
Clustering coefficient	0.013
Number of connected components	408
Size of the maximum connected component	790
Minimum weight of an arc	1
Maximum weight of an arc	128

Table 8. Results of the user circle extraction obtained by applying the Louvain algorithm to

M_{X}

with different arc weights.

Table 8. Results of the user circle extraction obtained by applying the Louvain algorithm to

M_{X}

with different arc weights.

Resolution	Feature(s)	Modularity	Number of Communities	Average Size of Communities
0.5	$w_{p q}$	0.654	256	3.086
	$s m_{p q}^{+}$	0.655	233	3.391
	$s m_{p q}^{-}$	0.682	220	3.591
	$s e_{p q}$	0.636	222	3.559
	$s b_{p q}$	0.668	292	2.705
	$κ_{p q}$	0.738	134	5.896
10	$w_{p q}$	0.640	262	3.015
	$s m_{p q}^{+}$	0.638	248	3.185
	$s m_{p q}^{-}$	0.657	241	3.278
	$s e_{p q}$	0.612	237	3.333
	$s b_{p q}$	0.586	369	2.141
	$κ_{p q}$	0.670	212	3.726

Table 9. Main characteristics of our dataset.

Statistic	Value
Number of comments	9430
Number of users	3919
Date and time of the first comment	27 February 2024 17:26:40
Date and time of the last comment	17 January 2025 10:49:00
Number of users of `/r/datascience`	2684
Number of users of `/r/MachineLearning`	1317
Number of common users	820 (20.09%)
Number of comments of `/r/datascience`	6582
Number of comments of `/r/MachineLearning`	2848

Table 10. Main characteristics of

N

for the r/datascience and r/MachineLearning subreddits.

Table 10. Main characteristics of

N

for the r/datascience and r/MachineLearning subreddits.

Statistic	Value
Number of nodes	2630
Number of arcs	3461
Density	$1.001 \times 10^{- 3}$
Clustering coefficient	0.037
Number of connected components	111
Size of the maximum connected component	2363
Minimum weight of an arc	1
Maximum weight of an arc	6

Table 11. Clique analysis of the top 100 users of /r/datascience and /r/MachineLearning.

Subreddit	Average Number of Cliques	Average Clique Size	Average Weight of the Crlique Arcs
`/r/datascience`	3.59	5.09	0.78
`/r/MachineLearning`	2.81	4.51	0.81

Table 12. A qualitative comparison of our framework and the approaches related to it.

	[18,19]	[29]	[35,36]	[41,42]	[42]	[44]	[47]	Our Framework
Focus on discordance	-	✓	-	-	-	-	✓	✓
Quantitative indicator of discordance	-	-	-	-	-	-	✓	✓
Consideration of network structure	✓	✓	✓	-	✓	✓	-	✓
Consideration of content	-	-	-	✓	✓	✓	✓	✓
Temporal analysis	-	-	-	-	-	-	-	✓
Content enriching	-	-	-	✓	✓	✓	✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cauteruccio, F.; Corradini, E.; Marchetti, M.; Ursino, D.; Virgili, L. A Framework for Investigating Discording Communities on Social Platforms. Electronics 2025, 14, 609. https://doi.org/10.3390/electronics14030609

AMA Style

Cauteruccio F, Corradini E, Marchetti M, Ursino D, Virgili L. A Framework for Investigating Discording Communities on Social Platforms. Electronics. 2025; 14(3):609. https://doi.org/10.3390/electronics14030609

Chicago/Turabian Style

Cauteruccio, Francesco, Enrico Corradini, Michele Marchetti, Domenico Ursino, and Luca Virgili. 2025. "A Framework for Investigating Discording Communities on Social Platforms" Electronics 14, no. 3: 609. https://doi.org/10.3390/electronics14030609

APA Style

Cauteruccio, F., Corradini, E., Marchetti, M., Ursino, D., & Virgili, L. (2025). A Framework for Investigating Discording Communities on Social Platforms. Electronics, 14(3), 609. https://doi.org/10.3390/electronics14030609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Framework for Investigating Discording Communities on Social Platforms

Abstract

1. Introduction

2. Related Literature

3. Description of the Proposed Framework

3.1. Proposed Model and Structure Investigation

3.2. Content Investigation

3.3. Integrating Structure and Content

4. Experiments

4.1. Dataset

4.2. Detection and Evaluation of User Circles

4.3. User Centrality Evaluation

4.4. Profiling the Most Influential Users

4.5. Application of Our Framework on a Second Dataset

4.6. Application of Our Framework to a Pair of Non-Discording Communities

5. Discussion

5.1. Result Overview and Detected Insights

5.2. Implications

5.3. Limitations

5.4. Positioning Our Paper in Relation to the State-of-the-Art

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI