Characterizing Topics in Social Media Using Dynamics of Conversation^{ †}

## Abstract

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Datasets

#### 2.2. Characterizing the Dynamics of Conversation and Discourse

#### 2.2.1. Definitions

#### 2.2.2. Feature Design

- Depth: The path length between the root node and the farthest response node in a branch. We use Dijkstra’s algorithm (DA) to measure this.$$\mathrm{DEP}\left({B}_{i}\right)=\underset{n\in {B}_{i},n\ne {n}_{0}}{\mathrm{max}}\left(\mathrm{DA}({n}_{0},n)\right)$$
- Magnitude: The maximum in-degree centrality in a branch. Given the adjacency matrix A for T, then$$\mathrm{MAG}\left({B}_{i}\right)=\underset{n\in {B}_{i}}{\mathrm{max}}(\sum _{k}{a}_{k,n}),{a}_{k,n}\in A$$
- Engagement: The total number of users involved in a branch where ${n}_{m}$ is the set of messages generated by user m.$$\mathrm{ENG}\left({B}_{i}\right)=\left|\right\{m\phantom{\rule{0.222222em}{0ex}}\left|\phantom{\rule{0.222222em}{0ex}}\right|{n}_{m}\cap {B}_{i}|>0\}|$$
- Longevity: The time t that expired between the creation of an initial node and the latest response node.$$\mathrm{LNG}\left({B}_{i}\right)=\underset{n\in {B}_{i}}{\mathrm{max}}\left(t\left(n\right)\right)-\underset{n\in {B}_{i},n\ne {n}_{0}}{\mathrm{min}}\left(t\left(n\right)\right)$$

- First order entropy: The probability ${p}_{1}\left(m\right)$ of some user m within r generating a message n for T.$${\mathrm{E}}_{1}=-\sum _{m\in r}{p}_{1}\left(m\right)\mathrm{ln}\left({p}_{1}\left(m\right)\right)$$
- Second order entropy: The probability ${p}_{2}\left({l}_{ij}\right)$ of a unique edge ${l}_{ij}=({n}_{i},{n}_{j})$ being formed between two messages by two specific users ${m}_{i}$ and ${m}_{j}$ within T.$${\mathrm{E}}_{2}=-\sum _{{l}_{ij}\in L}{p}_{2}\left({l}_{ij}\right)\mathrm{ln}\left({p}_{2}\left({l}_{ij}\right)\right)$$

#### 2.3. Validating Response Features through Genre Classification

#### 2.4. Feature Exploration through Clustering

#### 2.5. Comparing Response Features to Latent Dirichlet Allocation

#### 2.6. Outlier Detection

## 3. Results

#### 3.1. Genre Classification

#### 3.2. Clustering Analysis

#### 3.3. LDA Analysis

#### 3.4. Outlier Analysis

## 4. Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

**Figure 2.**PCA of response features for 1000 submissions from $r/hockey$, clustered by K-means (K = 2). Word clouds of keywords extracted from the text of the submissions contained in each cluster are shown above together, with the size of each keyword corresponding to the magnitude of their respective PageRank score.

**Figure 3.**RBO score of top 10 keyword lists versus Euclidean distance of response feature clusters for (

**a**) $r/politics$ where $K=5$ and (

**b**) $r/atheism$ where $K=3$. RBO’s weight parameter p is set to $0.98$.

**Figure 4.**Comparison of clustering patterns with response feature K-means (

**a**,

**c**) and LDA (

**b**,

**d**) on the Subreddits $r/science$ (

**a**,

**b**) and $r/news$ (

**c**,

**d**).

**Figure 5.**PCA of response features for 1000 submissions, clustered with K-means, from (

**a**) $r/soccer$ where $K=6$ and (

**b**) $r/gaming$, where $K=7$.

Symbol | Definition |
---|---|

S | Some Subreddit within Reddit. |

M | All users subscribed to S. |

r | A submission within S, represented as a set of users that responded to the submission, $r=\left\{r\right\},r\subseteq M$. |

$T(N,L)$ | A tree network representing the structure of hierarchically linked comments made by responders of a submission. |

n | A user-generated comment within T, $n\in N$ where N is the total set of comments within T. |

${n}_{0}$ | The head node. This is the text submitted to the Subreddit that triggers the comment cascade. |

l | A directed edge within T, representing the direction of information flowing between comments (from respondee to responder), $l\in L$ where L is the total set of edges within T. |

B | The set of branches found in T. ${B}_{i}=\left\{n\right\},{B}_{i}\subseteq N$ where the ith branch contains some subset of linked comments (including ${n}_{0}$) generated for a submission. |

S | politics, gaming, soccer | politics, gaming | politics, soccer | gaming, soccer | politics, atheism |

Score | 0.81 | 0.82 | 0.96 | 0.91 | 0.76 |

Game Thread | Playoff | Series | Friday | Trash Talk | |
---|---|---|---|---|---|

Cluster 1 Frequency | 204 | 76 | 35 | 1 | 1 |

Cluster 2 Frequency | 1 | 1 | 22 | 31 | 32 |

