1. Introduction
Reddit is a very large social network, with information divided into topical fora called subreddits. In March 2023, Reddit had about 52 million active users (
https://backlinko.com/reddit-users, accessed on 13 February 2024) and over 3.4 million subreddits (
https://www.businessdit.com/how-many-subreddits-are-there, accessed on 13 February 2024). Analysis of pertinent literature shows that there are two main approaches to using Reddit-extracted data in research. The first focuses on a
single subreddit and analyzes various phenomena within it. Here, existing studies have been focused, among other topics, on politics [
1], gaming [
2], online harassment [
3], mental health [
4,
5], suicide prevention [
6], dermatology [
7], parenting [
8] or teaching [
9], and others [
10]. On the other hand, Reddit-based data,
treated as a single dataset, have been used in natural language [
11] or image processing [
12]. Here, the Reddit structure(s) was (were) ignored, and all considered posts were treated as a single large collection of texts. However, recent researchresearch ([
13]) suggested that additional big scale studies are needed, to properly capture Reddit information structure.
In this context, to the best of our knowledge, no research used a large Reddit dataset, while taking into account the existence and caveats (such as crossposts) of thematic subfora. This is somewhat surprising, taking into account the fact that it is the existence of the topical subfora that distinguishes Reddit from the other social networks. Therefore, the aim of this contribution is to deliver a more comprehensive understanding of the information structure of Reddit. In particular, the focus of reported work is on automatically establishing common topics of interest shared by readers of individual subreddits.
Here, we note that there exists a Reddit-only phenomenon that is completely absent from the Reddit research. These are crossposts, i.e., posts that were posted to one subreddit and later linked in another one. In other words, crossposts represent the situation when individual users of a specific subreddit believe that selected posts could be of interest to readers of another subreddit. As such, crossposts explicitly capture how individual users conceptualize similarities between subreddits. However, it has to be acknowledged that crossposts are relatively infrequent (vis-à-vis the volume of data posted regularly on Reddit). This means that their appearance can be used only as a corroboration of the existence of common interests between readers of different subreddits (i.e., it cannot be treated as a ground truth for subreddit similarity). Nevertheless, it is our belief that they are worthy of a closer examination.
In this context, the aim of this contribution was to establish (1) ways to automatically uncover topical similarities between subreddits and (2) the role that crossposts can play in this process. The proposed approach to reach these goals is based on natural language processing (NLP), Named Entity Recognition (NER), and graph networks. Here, NER is used to find entities which are then used to build graph networks. The goal of the process is to capture and elaborate on similarities between found named entities.
We note that this work can have practical application. Let us assume that researchers are interested in topics related to mental health (as they are represented in the Reddit posts). In this case, the most natural subreddit for their work would be r/mentalhealth. However, if it would be possible to establish which other subreddits involve topics similar to these discussed in r/mentalhealth, then such subreddits could be also included in the research. Similarly, researchers interested in former President Donald Trump could consider all subreddits where Donald Trump is the “common topic”, instead of focusing only on the r/The_Donald.
The remaining parts of this contribution are organized as follows.
Section 2 presents the state of the art pertinent to the area of interest of this contribution, i.e., research related to use of natural language processing and graph networks as applied to Reddit. Next,
Section 3 describes the collected dataset. The proposed approach is introduced in
Section 4 and followed by analysis of experimental results presented in
Section 5. Finally, the research is summarized in
Section 6, where future research directions are also elaborated. In addition, this work is accompanied by supplementary information in
Appendix A.
2. Pertinent State of the Art of Reddit-Related Research
Two recent overviews of Reddit-related research [
14,
15] suggested that natural language processing and graph networks are the most popular techniques used to analyze various aspects of Reddit-derived datasets. The following sections discuss the state of the art of the NLP methods, graph networks, and performance evaluation found in Reddit-related work.
2.1. Reddit and Natural Language Processing
One of the main NLP-anchored research directions found in Reddit-related literature concerns extracting the main point(s) of a text [
16]. The umbrella term for these methods is
topic modeling. In this research, typically, one attempts to extract “main topics” from the textual data. Next, the output is further modeled, e.g., with graph networks. In this context, we note that graph network model relations are easier to find for a small space of unambiguous features, which clearly point to specific things, phenomena, people, etc. Hence, of particular value are methods with a reasonably small output space. Such methods also deliver results that are easier to comprehend by humans. Therefore, let us now briefly discuss the selected most popular and recent topic modeling methods [
17,
18,
19] and justify the choice of the method used in this work.
The prime example of a “classical” methods is Non-negative Matrix Factorization (NMF). This method utilizes algebraic tools for matrix factorization to turn the word-document matrix into matrices of document-topic and word-topic dimensions. Algebraically speaking, given a set of text documents (D) and words (W), a matrix is built where rows are word embeddings with dimensions (i.e., a word-frequency matrix). The resulting matrix is always non-negative since word frequency has to be always positive (a word can only appear a positive number of times in a document). Next, using one of the methods for matrix factorization, two matrices are calculated. Their product would return the original matrix. The two matrices have dimensions and and provide information regarding which word belongs to which topics (T) and which topic belongs to which document, respectively. Hence, the output (the topics) are a combination of the words from the input documents.
The second approach is Latent Dirichlet Allocation [
20] (LDA). LDA is a generative model that uses Dirichlet distribution to assign documents their topics based on the words within them. This method is outlined in Algorithm 1.
Algorithm 1: LDA pseudocode (simplified). |
|
In LDA, there are two hyperparameters not mentioned in the simplified version of Algorithm 1. These are:
, a document density factor (i.e., weight of topic in a document) controlling the number of topics expected in the document (the higher the value, the more topics are envisioned to exist in a document),
, a topic word density factor (i.e., weight of a word in a topic) controlling the distribution of words per topic in the document (the higher the value, the larger number of words may belong to each topic).
Similarly to NMF, the output of this approach consists of a probability that a given word is a part of a specific topic in a document. The output space consists, again, of different combinations of words from the input documents. We note that the two hyperparameters provide some form of “manual control” of the expected size of the output space.
Other methods that can be used for statistical topic modeling are, for instance, the correlated topic model (CTM) [
21] and the Pachinko Allocation Model (PAM) [
22]. However, these two (and many others) share the same problem, namely the size of the output space. Since the output topic model is built from the word within the documents, the number of possible topics is enormous. Even though the number of topics and the number of words in a topic can be manipulated (as described in the case of LDA), the final range of topics remains gigantic as, theoretically, any word from any document can be included in a topic. Therefore, methods from this area are not easily applicable to the analysis of large-scale Reddit-derived dataset(s).
Moving to more recent approaches, it is well known that the transformer [
23] architecture revolutionized the Natural Language Processing field [
24]. It can also be used for topic modeling. Here, BERT-like [
24] models use a pre-trained model which is capable of capturing the general semantic meaning of text in low-dimensional vectors (most commonly reported size is 768 or 1024). These models are then fine-tuned on a particular task, e.g., topic modeling. Among many fine-tuned BERT-like models, in the context of this contribution, the most popular would be BERTopic [
25], which has shown its applicability to Reddit-derived data in previous studies [
26]. BERTopic’s process can be explained in the following simplified steps:
Conversion of text to vectors using a pre-trained BERT model.
Dimensionality reduction with the UMAP model [
27].
Clustering using the HDBSCAN method [
28].
Conversion from vectors to topics using TF-IDF (i.e., extraction of most meaningful words for each cluster).
Even with its highly accurate results in various applications [
29,
30,
31], this approach, again, is problematic due to the fact that the topics are built from the whole space of words in the documents. Moreover, there are no simple mechanisms to control the size of the output space.
Depending on the definition, one may also consider text summarization [
32] as part of text modeling methods. There are two main categories of text summarizing models [
33]: extractive (which extract particular sentences or sub-text) and abstractive (which generate the summary not necessarily using the words from the input text).
An example of an extractive method is Latent Semantic Analysis (LSA) [
34]. As in many other extractive methods, instead of selecting words and assigning them to topics, LSA looks for the most meaningful sentences (or other “larger units” of text) and uses them to construct a summary. This, however, produces an even larger space of possible topics extracted from the text because each extracted topic is basically a set of sentences.
Now, let us consider exemplary methods from the abstractive text summarization category. These methods generate a new text (the summary) based on the input text. The output does not necessarily need to include any of the sentences (or even words) from the original text. Here, the most recent research used deep neural networks, with an encoder–decoder architecture as the backbone of summarization models [
35]. As representatives of this approach, the most popular models are BERTSUM [
36], Pegasus [
37] (further challenged by SimCLS [
38]) or Prophetnet [
39].
Apart from the neural networks approach, there are also probability-based methods such as BRIO [
40]. BRIO challenges the deterministic approach to modeling and uses a novel training paradigm assuming a non-deterministic distribution. We note that the abstractive approach is still a subject to the gigantic topic (summary) space, the vocabulary of which can easily extend to any word similar to the input text in terms of text embeddings.
Finally, feature extraction has been approached using Named Entity Recognition (NER) [
41,
42,
43]. Although NER is not strictly a topic modeling method, it allows extracting crucial features—named entities. A named entity is a phrase that clearly identifies a person (PER), location/place (LOC), organization (ORG), or others (MISC). The most recent NER models are based on BERT-like [
24] transformers with the attention mechanism. The most popular NER model found in the literature is dslim/bert-base-NER (
https://huggingface.co/dslim/bert-base-NER, accessed 1 February 2024). This approach is based on a pre-trained model for general language modeling.
It has been further fine-tuned in the context of named entity recognition (which is a subset of feature extraction). Specifically, dslim/bert-base-NER has been trained on a single NVIDIA V100 GPU with recommended hyperparameters from the original BERT paper [
24] using CoNLL-2003 [
44]. Here, performance of a 91.3% F1 score, a 90.7% precision and a 91.9% recall on the test dataset was reported. Although it is not a necessity for this contribution, the model also returns classes of extracted named entities.
The main advantage of the NER models over the previously covered topic modeling methods is their output. They return results from the domain of named entities, which is far smaller than previously mentioned techniques. This reduction in the size of the output set is very beneficial, considering the next step of the method—building graph networks. Furthermore, building graph networks and finding similarities between them is much easier and unambiguous if the nodes (named entities) refer to particular real-life entities instead of sets of words (like in the previous topic modeling methods). Obviously, it is possible to build networks from topics [
45,
46] or even key phrases [
47]. However, in what follows, the aim is to build simpler and easily cross-referenceable graphs.
In this context, we consider an example of two pairs of graphs (two graphs made with named entities and two graphs made from keywords/topics). A graph built from named entities NASA, Stephen Hawking and The Sun can be easily matched with the second graph, with nodes NASA, USA and United States Congress. Here, NASA is unambiguously the common ground topic (entity). If one graph had topics built from keywords, e.g., national-astronomy-organization, famous-science-people, astronomy-planets and the other captured concepts american-organizations, countries-national and government-organization, finding the correspondence would require use of additional techniques such as semantic metrics, text embedding similarity, or others, which would effectively hinder the methods’ accuracy and readability. Obviously, this is by no means the best metaphor for comparing topic modeling and NER, rather a visualization of an important issue that has to be considered when developing data processing pipelines.
For completeness, let us note that, in the case of disambiguity, there are also models for Named Entity Linking (NEL) and/or Named Entity Disambiguation (NED) [
48] which determine whether two or more named entities refer to the same entity. However, at this stage, these approaches are out of scope of the reported results. Nevertheless, their application may be worth investigating.
2.2. Graph Networks
The second part of the literature review concerns graph networks and their construction and utilization in social media research. Specifically, in the context of this work, after topics are extracted, graph networks can be formed and analyzed from the perspectives of selected aspects of the data [
49]. Since in this contribution the relations that are captured are bidirectional, and since their significance varies (as described in
Section 4), weighted undirected graphs are of particular interest.
First, let us discuss the selected most common metrics used in the majority of graph analysis. Here, let us assume the node of the graph is denoted as v.
Degree—the number of edges connected to a node v. For weighted graphs, the sum of weights of all edges connected to a node v may also be used.
Degree centrality (normalized)—the fraction of all nodes that a given node v is connected to.
Average neighbor degree [
50]—the average of degrees of all nodes that node
v is connected to:
, where
is the neighbors of
v.
Clustering—the fraction of all possible triangles (a K3 graph, a complete graph of Size 3) that could pass through a node
v. A triangle is a set of three nodes, which are all connected to each other (i.e., a complete graph of Size 3). The derived formula is
, where
is the actual number of triangles that pass through node
v [
51].
Let us now shortly outline the relevant graph network methods. There are many applications of building graphs from textual features: from general ontologies [
52], modeling the real world with graph relations and knowledge graphs [
53] enabling reasoning methods, to particular applications in biology [
54] and social graph networks [
55]. The last one is of particular interest, provided that the dataset under study is derived from Reddit, which is a social network.
In the literature, graph networks modeling social media most commonly apply the user-as-node [
56,
57,
58], and community-as-node [
59,
60,
61,
62] representations. There, connections (edges) are built on the basis of interactions (users commenting/talking) or common belonging (e.g., users subscribing to the same groups/subreddits). However, in this work, a different approach is undertaken. Specifically, the features (mined from the posts) are used to find similarities between communities. In this context, the closest domain covered in the literature is the work related to broadly understood recommender systems. Therefore, while appreciating the difference, works related to such system were reviewed to find general methods, performance evaluation methods, etc.
First, there are works concerned with user recommendation. For example, there is a user-to-user recommendation studied in a work about X (formerly Twitter) [
63]. Interestingly, this work is based on application of LDA. In particular, the topics extracted using LDA are used to rank users as potential recommendation targets and to build the final recommendation set. The results were tested for a 2010 dataset of tweets. Depending on the subset of the dataset, the recommending systems achieved 20–50% recall. Since the specific
recall metric was not defined, the general definition,
(also known as hit rate, hit ratio, sensitivity, or power), was assumed. As it is clear from multiple related works, recall is the main method for evaluating rankings of recommendation systems. Additionally, the 2010 contribution claims that “graph-based methods have high precision, since graph information is known to be a reliable estimator of social influence, but also have low recall due to possible low connectivity”. This claim is further addressed when presenting obtained results in
Section 5.
Moving closer to the core topic of this work, there are hashtag recommendation systems. Hashtags are conceptually similar to named entities, since they point to a specific phenomenon, person, event, organization, etc. One of the works [
64] builds the graphs using hashtags, mentions, following information, and topics. Next, with a graph community detection algorithm (the Clique percolation method [
65], the Louvain algorithm [
66], and the label propagation algorithm [
67]), the existing communities are determined. Finally, the original algorithm produces top
N hashtags to be used in the recommendation. The method achieved better results in terms of hit rate than previous similar research (43% vs. state-of-the-art 37%).
Another work [
68] notices the sparsity of hashtags in tweets and focuses on utilizing external sources to build the tweet–tweet similarity. This method employs disambiguation based on Wikipedia and The Guardian searches, word embeddings, and translation to extract the most significant hashtags for recommendation. This work provided results in terms of
precision@k and
recall@k for
, where
k is the number of top results returned by the model. The state-of-the-art results provided in the work reach recall in the range of 30–40% and precision in the range of 20–35% (exact numbers were not provided). The proposed method performed better, and reached recall in the range of 40–50% and precision of about 40–50% (again, exact numbers were not provided).
Moving away from hashtags and towards Reddit, there were studies where Reddit was used together with other sources. For example, a study used Reddit and Pinterest for ownership recommendation [
69]. Since this topic is far from the research reported here, the specific methodology is skipped. However, an interesting part is performance evaluation. This recommender system uses hit rate (recall) similarly to the previous study. In addition, it also employs
area under the receiver operating characteristic curve (ROC AUC). The use of the ROC AUC metric is particularly important because it prevents the algorithm from maximizing recall by recommending all potential choices.
Another approach, dealing with two data sources, is a study based on Reddit and Twitter [
70]. There, the goal is to cross-reference the user’s tweets with potentially interesting Reddit threads (this term was not explicitly explained and is assumed to be equivalent to a post or a post + comments). Applying natural language processing with WordNet to tweets, the authors built models (Naive Bayes, Random Forest and SVM) to generate an interest profile for the user and then to recommend the matching genres. Here, again, the important aspect is the evaluation methodology which accounts for top five results and evaluates them using the
accuracy,
precision and
recall metrics.
Finally, the closest to the topic of recommendation on Reddit is a work from 2019 [
71]. Here, the authors built two networks: a User–subreddit–User (UsU) network, where nodes are users, and edges exist if users have at least seven subreddits in common; and a Subreddit–user–Subreddit (SuS) network, where nodes are subreddits, and edges exist if subreddits have at least one user in common. The UsU network contained 2751 nodes and 845,128 edges, while the SuS network contained 847 nodes and 22,940 edges. Hence, the analysis ultimately focused on 2751 users and 847 subreddits. The features of each node were standard graph metrics: node degree (weighted and unweighted), degree centrality, closeness centrality (weighted and unweighted), betweenness centrality (weighted and unweighted), clustering coefficient (weighted and unweighted), HITS hub score [
72,
73] and PageRank score (weighted and unweighted) [
73,
74]. Additionally, in the node-to-node interaction, two more metrics were included: hop count and weighted distances between the corresponding nodes. Then, modularity-based community detection was used to discover 1965 communities in the UsU network and 292 communities in the SuS network. The community label was also added to the node feature set. To further augment the feature set, based on the network architecture, a node embedding method was employed. Here, Node2Vec [
75] was used. Briefly speaking, Node2Vec is a graph network node embedding model based on the Word2Vec [
76,
77] concept of building embeddings of nodes (words) based on their context. The needed context was generated using random walks. Further, the paper proposed using content-based analysis to extend the features. It used keywords extracted from users’ posts with the TF-IDF method [
78].
Using all these features, a vector representing a user–subreddit relation was built. The negative examples (cases where a recommendation of a subreddit to a user was incorrect) were sampled from the set of users and subreddits which they do not belong to. This way, the dataset was augmented with negative cases. The evaluated models were logistic regression, a neural network, and a random forest classifier. The best results were achieved using all features with a random forest classifier with a 93% accuracy, a 93% precision and a 93% recall, a 93% F1-score and a 99% ROC AUC. The research concluded by highlighting the relevance of both network and content linguistic features when fusing different sources of information. The 2019 research is a particular inspiration for the contribution of this paper, where extending the graph network information with meta-data of posts (see
Section 4) is explored.
Additionally, there appeared a meta study for the recommender systems, which raised an important issue of bias in the domain of recommendation data [
79]. It stated that biases based on gender, ethnicity, race, etc., need to be avoided for ethical reasons. Therefore, in the design of this contribution, manual labeling and any external interference (such as dataset manipulation, subjective subreddit choice) that could introduce bias in the dataset was eliminated. Moreover, it can be claimed that none of the applied methods are known to introduce bias of the type listed above to the resulting models.
To summarize, while there are many works on hashtag, user, topic, and subreddit recommendation, the literature has very few examples of content-based small output space discovery methods. In this context, what follows uses a small output space method, named entity recognition, to build graph networks for the large-scale Reddit dataset. Moreover, the proposed evaluation methods are based on methods that have been used in similar works (reported above).
2.3. Use of Crossposts in Reddit Information Structure and Content Analysis
As noted earlier, one of Reddit’s features that is almost non-existent in the literature are the so-called crossposts. Crossposts were introduced in 2017 (
https://www.reddit.com/r/modnews/comments/7a5ubn/crossposting_coming_soon_to_your_subreddit, accessed on 13 February 2024). These are posts which appeared in one subreddit and were manually linked (crossposted) to a different one by a user. Since crossposts reference the original subreddit, they (1) show which posts (and named entities found in them) are seen by the readers of one subreddit as being of likely interest to the readers of another subreddit; and (2) establish a directional link between subreddits. We note that crossposts are the only user-generated content that explicitly indicates potential existence of shared interests between separate communities (the originating subreddit and the subreddit the post was crossposted to). In this context, authors of [
80] suggested that crossposts may help to better understand the information structure of Reddit [
14].
Obviously, knowledge brought by crossposts needs to be taken with caution. The fact that a user A believes that a post X from subreddit would be of interest to the readers of subreddit does not represent universal truth. It is possible that the user A is mistaken (or their crosspost is a result of a deliberate misinformation campaign). However, after manual inspection of a randomly picked sample of 200 crossposts, it was established that only less than 5% can be seen as potentially malicious or misguided. Hence, application of crossposts in analysis of subreddits is further explored in this contribution.
Nevertheless, it is important to note that since crossposts are scarce and may be accidental (rather than a result of in-depth analysis performed by the user), they should be seen more as a hint than hard evidence. So, their use in analysis of data should not apply standard effectiveness metrics, e.g., accuracy, precision, etc. In the literature, classification problems with an uncompleted response variable are often addressed with the
positive-unlabeled (PU) approach [
81], which offers two solutions: (1) assumption that that unlabeled cases are negative, or (2) estimation of the distribution of
Y. However, in the case of crosspost-based analysis, missing are features that the PU needs. First, there are no negative samples (i.e., topics that are definitely not interesting to the two subreddit communities). Second, the positives (i.e., topics interesting to the two groups) are captured too rarely. Moreover, estimators of
Y are not known to exist (and cannot be expected to be established in the future). Therefore, instead of PU-evaluation, the use of
recall and
AUC was selected. Here, we notice that recall rewards capturing the entities that actually appeared in crossposts, while AUC counters the bias in using recall alone. Here, let us recall that this approach was also used for ranking evaluation in similar problems described in
Section 2.2, e.g., subreddit recommendation [
82,
83], hashtag recommendation [
64,
68,
84,
85,
86], or in the case of user recommendation systems [
63]. There, quality of rankings was evaluated using hit-rate (i.e., recall@5, or recall@10) and ROC AUC. Taking into account the size of the available dataset, it was decided to apply
recall@10 and
ROC AUC to estimate how well the entities detected with the proposed method as being common to two subreddits fit with the entities found in the crossposts.
4. The Proposed Approach
Let us now describe the method, which is the core contribution of the reported research. For easier understanding,
Figure 1 describes the process step by step, starting with already described data acquisition up to the similarity calculation.
To offer a better understanding of the approach, the following simplified pseudoalgorithm summarizes it as well:
Acquire posts from two subreddits.
Preprocess the posts.
- 2.1
Clean deleted/removed posts;
- 2.2
Filter meaningful posts based on their score;
- 2.3
Extract and store the required metadata for each post (score, number of comments; total awards received, does it contain an image).
Extract named entities from posts using NLP models and deduplicate them, if needed.
Create a graph network using named entities for each of the subreddits; here,
- 4.1
Named entities become nodes;
- 4.2
An edge joins two named entities if they appear in a post together. Multi-edges (when the entities appear jointly in multiple posts) are merged and (for each metrics) their weight becomes the sum of scores originating from posts where the two named entities appear jointly.
Calculate graph network characteristics for the two subreddit graphs separately and assign to each node (degree, clustering, average neighbor degree).
Filter out named entities (nodes and their edges) that appear only in a single subreddit from the pair (from here on, only subgraphs with named entities common to both subreddits are considered).
For each measure (metadata-based and network characteristics) independently, apply the following:
- 7.1
For each (common) named entity,
- 7.1.1
Calculate the similarity between the measures for the named entities in the two subgraphs.
- 7.2
Create a ranking of named entities based on their similarity;
- 7.3
Obtain the top entries (named entities).
The final result for two subgraphs representing two subreddits is a set of rankings (consisting of 10 named entities each) obtained independently for each of the metadata-based and network characteristics-based similarities. For better understanding, a step-by-step example is provided in
Appendix A to this work.
Let us now provide more details of each step and describe the method fully.
The dataset described above was used to create a graph network for each subreddit. As noted, thus far, graph networks have been created with users or subreddits as nodes. As one of the contributions of this work, graph networks were built for each subreddit, with named entities as nodes. Hence, rather than focusing on relations between users and/or communities (as considered in the past), the spotlight was shifted to the content of individual subreddits.
The first step of the approach consists of extracting and preprocessing all posts, as described in
Section 3.
In the second step, a post is assigned aggregated metadata. Specifically, the used metadata were (i)
score—the sum of the number of upvotes reduced by number of downvotes, for each post containing a given entity; (ii)
number of comments; (iii)
number of awards received—awards are a special, paid badges that users can assign to other user posts (
https://www.reddit.com/r/awards, accessed on 13 February 2024); and (iv)
does a post contain an images—number of posts that, in addition to the entity under consideration, contained also an image (in any format).
It should be noted that the following metadata were also tried:
is_meta (mean) (
https://reddit.com/r/NoStupidQuestions/comments/b9xe4d/what_exactly_is_a_meta_post, accessed on 13 February 2024),
is_original_content (mean) (
https://www.reddit.com/r/OutOfTheLoop/comments/1vlegj/what_does_oc_mean, accessed on 13 February 2024),
is_video (mean),
over_18 (mean),
spoiler (mean),
upvote_ratio (mean). However, they did not visibly influence the results, most likely due to them being represented as boolean values (except the upvoe_ratio). We interpret this fact as follows. The boolean information was not enough for any tried similarity measure to influence the similarity. The zero or one values can either be equal or not, there is no more nuanced similarity that could be represented, as it is in case of, e.g., floating point numbers. However, joint use of all the metadata, together with all available posts, may be explored in the future.
The next step is named entity extractions with dslim/bert-base-NER (
https://huggingface.co/dslim/bert-base-NER, accessed on 13 February 2024) (pre-trained BERT large models fine-tuned on the CoNLL-2003 [
44] dataset), which was already covered in
Section 2.1. Extracted entities are deduplicated. The result of this step is a set of named entities extracted from posts.
The following step is the creation of graphs representing each subreddit. Here, an edge connects a pair of nodes (recognized named entities if they appear in at least one post together. Multi-edges are combined into single edges and their edge weight equals to the sum of scores of posts in which the connected entities appear. The result is a graph representing the structure of the information content of a given subreddit. It includes all recognized named entities and their features encoded in nodes, edges and edge weights.
All individual subreddit graphs are available in a Zenodo repository (
https://zenodo.org/record/8037573, accessed on 13 February 2024).
Table 1 summarizes the basic characteristics of obtained graphs. It also illustrates their scale. What is important to note is the large heterogeneity of obtained graphs. This is clearly visible by the differences between the means and the medians of, for instance, the node count (1919 vs. 918) and the edge count (2397 vs. 495).
Furthermore, the graphs have nonuniform distribution of periphery sizes [
90]. In 47% of graphs, 10% of the nodes contribute to a periphery. On the other hand, in 25% of graphs, over 50% of nodes belong to a periphery. A similar situation emerges for isolates (nodes with zero degree). A total of 40% of graphs have over 50% of isolated nodes. This is consistent with graph density [
91] being, on average, barely 0.002. To further highlight the “1% rule” of the graph structure, it is worth mentioning that in 74% of graphs, the average shortest path is less than three, meaning that, on average, any node is connected with any other node with a distance of two.
5. Experimental Results and Their Analysis
The outcomes of applying the proposed method to the collected dataset are now discussed.
Let us start from describing the evaluation process. For each pair of subreddits, named entities from posts and named entities from crossposts are considered separately. We note that named entities that were extracted from crossposts from subreddit A to subreddit B and from B to A are combined into one set (regardless of the direction of a crosspost). In other words, named entities from crossposts are treated as being present in both subreddits. This can be understood in the following way. If a post with entities , and was crossposted from subreddit A to subreddit then these three named entities materialized in the set of named entities in crossposts in subreddit B.
We recall that for each individual similarity measure, named entities (from posts appearing in pairs of subreddits) are compared and ordered according to a given similarity measure (rankings resulting from the application of the proposed approach). Next, the set of named entities from posts and the set of named entities from crossposts (only) are compared and evaluated using metrics applied in similar problems (as described in
Section 2.2)—recall@10 and AUC. Specifically, recall@10 and AUC metrics are calculated between the ranking of similarity of named entities originating from posts and crossposts.
Recall is calculated using the following formula:
, where
X is the set of named entities found in crossposts between two considered subreddits,
Y is the set of named entities found in the two considered subreddits using the proposed method, and
is the size of the set
X. AUC is calculated with
roc_auc_score from the sklearn library [
93] based on the overlap of named entities found in crossposts and those selected from all posts in the two considered subreddits.
The best results, in terms of the recall and AUC, are achieved for summing the metadata (e.g., summing the aggregated score of an entity in subreddits A with B) and calculating the negative absolute value of network entities (e.g., calculating ).
The aggregated results for all pairs of subreddits are presented in
Table 2. Individual results for each subreddit pair can be found in the Zenodo repository (
https://zenodo.org/record/8037573, accessed on 13 February 2024). The main observation is that metadata metrics achieved better recall, while network characteristic provided better AUC results.
To evaluate the results, a reference point is needed. Although there is no direct comparison, previous works about recommendation systems use similar metrics (recall@10 and AUC) [
63,
64,
68,
69,
70,
79,
82,
85,
94,
95]. These works are discussed in
Section 2.2 and consider the problems such as hashtag recommendation, subreddit recommendation, and user recommendation. While these problems are different to the task studied in this contribution, the top results, from studies of similar scale datasets, range from 25 to 60% for the recall and 60 to 90% for the AUC. While concerning different problems, they provide some indication as to what could be reasonably expected performance measures when dealing with Reddit data. In
Table 2, the best recall results are achieved for the metadata:
score (38%),
number of comments (35%) and
total awards received (35%). Moreover, the best AUC results achieved for the network characteristics are as follows: degree centrality (60%), degree (51%) and clustering (50%). Therefore, the best recall and AUC results are in the middle of the range found in other, similar studies. However, we note also that none of the cited works dealt with a dataset with tens of thousands of pairs of compared subreddits. Separately, the question as to what should be reasonably expected when measuring recall for the node-based metadata and the AUC for the network-based metadata requires further investigation. However, this is out of scope of this contribution.
Since there was no single best similarity metric, additional attempts at combining the rankings to achieve better results were completed. Taking the top results from two separate rankings (for example, top five score-based entities and top five degree-based entities) did not yield better results. In most cases, the obtained results were simply an average of two rankings, which makes this worse than accepting the better ranking of the two.
In the next steps, the obtained results were further explored. First, a working hypothesis was posed that topically close subreddits, like subreddits about gaming (e.g., like r/gaming, r/esports, r/Games, r/pcgaming), or politics and news (e.g., r/politics, r/news, r/worldnews), have better recall and AUC scores. Although there are solitary cases described in Section Exploration of Particular Subreddit Pairs, no major noticeable correlation between the quality of results and subreddit topical areas was observed.
Second, it was postulated that subreddits with similar named entity network structures (measured with network characteristics) may deliver better quality of results. This assumption was correct. Networks with larger peripheries (measured as the set of nodes with eccentricity [
96] equal to the diameter) achieved recall values that were better by approximately 30–60 percentage points. Similarly, networks with a large number of isolates (nodes with a degree equal to zero) achieved results approximately eight percentage points higher than the mean. Upon further reflection, this phenomenon is easy to explain. In most cases, networks with a large periphery or with many isolates happen to also have a low number of high-degree nodes. These high-degree nodes are the focal points of the network (i.e., of discussion). Hence, they achieve a higher
score,
greater numbers of comments and higher values of other metadata. Moreover, they are very likely to appear in similar network configurations in subreddit named entity networks. This means that they are more often chosen as the results by both the metadata and the network characteristics rankings. Moreover, they are also likely to be crossposted due to their general popularity. Finally, this means that the proposed method works more reliably on non-complex networks, i.e., networks with the high periphery and/or a large number of isolates. As a by-product, this presents an interesting case of the “1% rule” for the Internet networks previously shown in other studies [
88,
97].
Many other approaches to combining metadata and network characteristics were tested, e.g., using the percent difference with the
score or the sum with the degree of centrality, etc. (we note that there are almost infinitely many ways to calculate a similarity between two characteristics (see, for instance, [
98])). Moreover, multiple popular graph metrics from the previous study [
71] were also attempted, i.e., pagerank [
99], voterank, closeness centrality [
100], betweenness centrality [
101], current flow closeness centrality [
102], current flow betweenness centrality [
102,
103]. Node2Vec embeddings with different hyperparameters (p: 1, 2; q: 1, 2; walk length: 10, 100; number of walks: 10, 100, vector dimensions: 16, 32, 64) were also checked. However, all of them failed to obtain results of at least 10% recall@10 or at least 20% AUC, so they were omitted from this discussion. Overall, the final best results in terms of recall and AUC were achieved for formulas introduced in
Section 5. These are
degree,
degree centrality,
clustering [
51] and
average neighbor degree [
50]. Nevertheless, this issue is not resolved, and further investigation is planned.
Exploration of Particular Subreddit Pairs
Let us now briefly share the most interesting results obtained in the cases where the algorithm reached the highest recall and AUC. These results are summarized in
Table 3. There, when applicable, metrics values are given in parentheses, with the following demarcation:
=
and
=
.
Let us now summarize the most interesting findings. First, many similarities occur when one of the subreddits has a strictly narrower or wider topic, and it is this topic (named entity) that is the common one for the two subreddits. For example, r/SteamDeck (gaming console) and r/totalwar (video game) share entities Steam Deck and Total War (score: r. 100%, a. 76%, number of comments: r. 100%, a. 80%). Between r/manga and r/Meika (a Manga character) it is Meika that is the most common (degree: r. 81%, a. 78%, average neighbor degree,: r.82%, a. 100%). Between r/PornStars and r/KristyBlack (a porn actress) the entity is Kristy Black (total awards received: r. 89%, a. 80%); for r/MovieDetails and r/UnexpectedMulaney (subreddit about references to John Mulaney), John Mulaney is the most common topic (degree, r.63%, a. 98%); r/hqcelebvideos (subreddit about celebrities) and r/EmmaWatson share Emma Watson as the main similarity (degree centrality r. 86% a. 82%).
Second, there exist groups of subreddits that intuitively could have been expected to be similar or have common grounds. For example, r/nvidia and r/AyyMD are both about digital computing companies and the subreddits share many models of GPUs, such as AMD, EVGA, NVIDIA GeForce, RTX (score, number of comments and has image all have recall over 83% and AUC over 73%). Another example is r/carporn (related to beautiful cars) and r/NASCAR, where the common grounds are names of races and cars (e.g., Circuit Zolder, Death Valley, NASCAR Camaro, Suzuki, XB Falcon) (degree centrality r. 67%, a. 67%). German (language) subreddit r/German and Germany (country) r/germany share German and Deutsch (score, number of comment, total awards received, has image—all r. 100% and a. 100%). R/Borderlands2 and r/Borderands3 (both related to the video game) share the name of the game as the most common topic (score, number of comments, total awards received, has image—all have r. 100%, a. 100%). r/graphic_design and r/technology share Adobe (the company creating software for a.o. graphic design) (number of comments: r. 100%, a. 100%). R/onions (subreddit about anonymous access to the Internet) and r/ethereum (cryptocurrency) share Tor, software for anonymous Internet browsing (score, number of comments, total awards received—all r. 100% and a. 100%). R/frugalmalefashion (fashion subreddit) and r/eagles (sports team) share Nike (the sport fashion company) as the main common interest (score, number of comments, total awards received—all r. 100% and a. 100%).
R/Piracy and r/vpnnetwork are both mostly interested in Blackfriday VPN, Disney Plus, Internet Service Provider (ISP), which are the intuitive common topics between the two (degree centrality r.71% a.92%). Further, r/vpnnetwork and r/fantasybball (Fantasy Basketball) share the NBA league, Firefox and Android (score, number of comments, degree, degree centrality, average neighbor degree—all r. 100% and a. 100%).
Looking from yet another angle, some rather unintuitive similarities between subreddits were found. For example, r/crappyoffbrands and r/AwesomeOffBrands, which discuss bad and great offbrands, both share China as the main similarity (degree centrality r. 86%, a. 88%). R/UNBGBBIIVCHIDCTIICBG (subreddit about engaging videos) and r/lingling40hrs (subreddit about string instruments) share Electric Harp, Kiki Bello (harpist) and Van Halen (guitarist). This seems intuitive that these are the similar-named entities, but it is not intuitive that these subreddits share any topic whatsoever.
Separately, it is relatively easy to realize that this work has potential to help other research to expand its scope. For instance, there were studies [
104,
105] on social norms based on the r/AITA subreddit. What is interesting is that the acronym AITA (“Am I The Asshole?”) is the main similarity between r/AITA and r/justdependathings (
score,
number of comments, average neighbor degree—all r. 100% and a. 100%) and also between r/AITA and r/weddingshaming (average neighbor degree: r. 63%, a. 98%), and r/AITA and r/houseplants (
score,
number of comments, average neighbor degree—all r. 100% and a. 100%). In this way, two additional communities could have been worth looking into when studying social norms. Moreover, there was a master thesis concerning The Legend of Korra [
106] which analyzed subreddit r/TheLastAirBender. Using the introduced tool, benefit could be obtained from using r/ecchi and r/AndavaArt which are two subreddits where The Legend of Korra is one of the main similarities (average neighbor degree—all r. 66% and a. 80%). Another example is a study on subreddit r/NBA [
107]. As it appears, NBA is the main similarity between subreddits r/vpnnetwork and r/fantasybball, the latter being dedicated to fantasy basketball, which also aggregates fans of the league. Finally, various studies on the 4chan Reddit community (r/4chan) [
108,
109,
110] could also look into the subreddit r/justneckbeardthings, because r/4chan and r/justneckbeardthings share 4chan and Anon as the main similarities (
score,
number of comments,
total awards received—all r. 100% and a. 100%).
Finally, a general observation visible in the results is that, most often, metadata metrics are consistent with each other, i.e., when
score achieves high recall,
number of comments and
total awards received, achieve high results, too. For example, this is the case for r/German and r/germany, r/Borderlands2 and r/Borderands3, r/onions and r/ethereum, r/freefromwork and r/wholesomememes (see results in
Table 2). A similar situation occurs with network characteristics, e.g., for r/vpnnetwork and r/fantasybball, r/AITA and r/justdependathings, r/manga and r/Meika.
On the other hand, there is no correlation between the node and network characteristics. As a matter of fact, it is possible that one group of metrics may report high similarity measures, while the other very low ones. This observation will be investigated in future research.
6. Concluding Remarks
The aim of this contribution was twofold. First, our goal was to apply named entity recognition and graph networks to propose a method for studying the structure of information content of subreddits. The proposed method treats subreddits as individual structures and establishes similarities between them. Second, we aimed to explore the potential of crossposts as (1) natural indicators of topics shared by subreddits and (2) as a supporting measure of success of the proposed method. Application of the proposed approach allowed for to capture a number of expected similarities between subreddits. Moreover, some unexpected relations were also found. Finally, when using crossposts, it was shown that the proposed method achieved performance metrics that match those reported for similar problems. These results combined deliver a basic level of confidence in the proposed approach.
Since the obtained results are promising, further explorations are planned. Here, a few potential examples based on use of the already developed tool set are presented. (1) In the text, some suggestions are made as to the ways to make the results more comprehensive. This can include, among other methods, extending the dataset by one more year, e.g., 2021 (or two more years, including 2020). We note also that use of 2023 data has to wait till after the end of the year for the posting and crossposting to subside. Adding data would also allow following evolution of the information content structure for the time frame of 2020–2022. (2) Taking into account that the strength of connections is already measured, it would be possible to start from considering almost all connections and then systematically increase the threshold of keeping connections. In this way, connection strength-based evolution of the network graphs can be explored. (3) Reported work involved only textual data. Since a large portion of Reddit content is multimedia (mostly as images and videos or short clips), there is space for extension of the proposed approach with multi-modal image processing. This, however, requires substantially more work.
Finally, future work may involve the generation of crosspostable content. Though this is just a weak hypothesis (outside the scope of this work), the proposed approach could help to generate crosspostable content, i.e., posts of interest to readers of multiple precisely identified subreddits. Here, our solution could help to generate traffic/coverage desired by content creators and bridge gaps between information bubbles [
111] and/or reach out to echo chambers [
112].