#
Characterizing Topics in Social Media Using Dynamics of Conversation^{ †}

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Datasets

#### 2.2. Characterizing the Dynamics of Conversation and Discourse

#### 2.2.1. Definitions

**Observation**

**1.**

**Observation**

**2.**

#### 2.2.2. Feature Design

- Depth: The path length between the root node and the farthest response node in a branch. We use Dijkstra’s algorithm (DA) to measure this.$$\mathrm{DEP}\left({B}_{i}\right)=\underset{n\in {B}_{i},n\ne {n}_{0}}{\mathrm{max}}\left(\mathrm{DA}({n}_{0},n)\right)$$
- Magnitude: The maximum in-degree centrality in a branch. Given the adjacency matrix A for T, then$$\mathrm{MAG}\left({B}_{i}\right)=\underset{n\in {B}_{i}}{\mathrm{max}}(\sum _{k}{a}_{k,n}),{a}_{k,n}\in A$$
- Engagement: The total number of users involved in a branch where ${n}_{m}$ is the set of messages generated by user m.$$\mathrm{ENG}\left({B}_{i}\right)=\left|\right\{m\phantom{\rule{0.222222em}{0ex}}\left|\phantom{\rule{0.222222em}{0ex}}\right|{n}_{m}\cap {B}_{i}|>0\}|$$
- Longevity: The time t that expired between the creation of an initial node and the latest response node.$$\mathrm{LNG}\left({B}_{i}\right)=\underset{n\in {B}_{i}}{\mathrm{max}}\left(t\left(n\right)\right)-\underset{n\in {B}_{i},n\ne {n}_{0}}{\mathrm{min}}\left(t\left(n\right)\right)$$

- 5.
- First order entropy: The probability ${p}_{1}\left(m\right)$ of some user m within r generating a message n for T.$${\mathrm{E}}_{1}=-\sum _{m\in r}{p}_{1}\left(m\right)\mathrm{ln}\left({p}_{1}\left(m\right)\right)$$
- 6.
- Second order entropy: The probability ${p}_{2}\left({l}_{ij}\right)$ of a unique edge ${l}_{ij}=({n}_{i},{n}_{j})$ being formed between two messages by two specific users ${m}_{i}$ and ${m}_{j}$ within T.$${\mathrm{E}}_{2}=-\sum _{{l}_{ij}\in L}{p}_{2}\left({l}_{ij}\right)\mathrm{ln}\left({p}_{2}\left({l}_{ij}\right)\right)$$

#### 2.3. Validating Response Features through Genre Classification

#### 2.4. Feature Exploration through Clustering

#### 2.5. Comparing Response Features to Latent Dirichlet Allocation

#### 2.6. Outlier Detection

## 3. Results

#### 3.1. Genre Classification

#### 3.2. Clustering Analysis

#### 3.3. LDA Analysis

#### 3.4. Outlier Analysis

## 4. Discussion

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Bogdanov, P.; Busch, M.; Moehlis, J.; Singh, A.K.; Szymanski, B.K. The social media genome: Modeling individual topic-specific behavior in social media. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara, ON, Canada, 25–28 August 2013; pp. 236–242. [Google Scholar]
- Diakopoulos, N.A.; Shamma, D.A. Characterizing debate performance via aggregated twitter sentiment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Atlanta, GA, USA, 10–15 April 2010; pp. 1195–1198. [Google Scholar]
- Horne, B.D.; Adali, S.; Sikdar, S. Identifying the social signals that drive online discussions: A case study of reddit communities. In Proceedings of the 2017 26th International Conference on Computer Communication and Networks (ICCCN), Vancouver, BC, Canada, 31 July–3 August 2017; pp. 1–9. [Google Scholar]
- Castillo, C.; El-Haddad, M.; Pfeffer, J.; Stempeck, M. Characterizing the life cycle of online news stories using social media reactions. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing, Baltimore, MD, USA, 15–19 February 2014; pp. 211–223. [Google Scholar]
- Dodds, P.S.; Harris, K.D.; Kloumann, I.M.; Bliss, C.A.; Danforth, C.M. Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLoS ONE
**2011**, 6, e26752. [Google Scholar] [CrossRef] [PubMed] - Lee, C.; Kwak, H.; Park, H.; Moon, S. Finding influentials based on the temporal order of information adoption in twitter. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 1137–1138. [Google Scholar]
- Ramage, D.; Dumais, S.; Liebling, D. Characterizing microblogs with topic models. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, Washington, DC, USA, 23–26 May 2010. [Google Scholar]
- Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data.
**2012**, 6, 1–39. [Google Scholar] [CrossRef] - Flamino, J.; Szymanski, B.K. A Reaction-Based Approach to Information Cascade Analysis. In Proceedings of the 2019 28th International Conference on Computer Communication and Networks (ICCCN), Valencia, Spain, 29 July–1 August 2019; pp. 1–9. [Google Scholar]
- Hessel, J.; Tan, C.; Lee, L. Science, AskScience, and BadScience: On the coexistence of highly related communities. In Proceedings of the International AAAI Conference on Web and Social Media, Cologne, Germany, 17–20 May 2016. [Google Scholar]
- Zhang, A.; Culbertson, B.; Paritosh, P. Characterizing online discussion using coarse discourse sequences. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11. [Google Scholar]
- Kulisiewicz, M.; Kazienko, P.; Szymanski, B.K.; Michalski, R. Entropy Measures of Human Communication Dynamics. Sci. Rep.
**2018**, 8, 15697. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer: New York, NY, USA, 2001; Volume 1. [Google Scholar]
- Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]
- Heidemann, J.; Klier, M.; Probst, F. Identifying key users in online social networks: A pagerank based approach. In Proceedings of the ICIS 2010, St. Louis, MO, USA, 12–15 December 2010. [Google Scholar]
- Webber, W.; Moffat, A.; Zobel, J. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst.
**2010**, 28, 1–38. [Google Scholar] [CrossRef] - Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res.
**2003**, 3, 993–1022. [Google Scholar] - Berger, J.; Milkman, K.L. Emotion and virality: What makes online content go viral? NIM Mark. Intell. Rev.
**2013**, 5, 18. [Google Scholar] [CrossRef] [Green Version]

**Figure 2.**PCA of response features for 1000 submissions from $r/hockey$, clustered by K-means (K = 2). Word clouds of keywords extracted from the text of the submissions contained in each cluster are shown above together, with the size of each keyword corresponding to the magnitude of their respective PageRank score.

**Figure 3.**RBO score of top 10 keyword lists versus Euclidean distance of response feature clusters for (

**a**) $r/politics$ where $K=5$ and (

**b**) $r/atheism$ where $K=3$. RBO’s weight parameter p is set to $0.98$.

**Figure 4.**Comparison of clustering patterns with response feature K-means (

**a**,

**c**) and LDA (

**b**,

**d**) on the Subreddits $r/science$ (

**a**,

**b**) and $r/news$ (

**c**,

**d**).

**Figure 5.**PCA of response features for 1000 submissions, clustered with K-means, from (

**a**) $r/soccer$ where $K=6$ and (

**b**) $r/gaming$, where $K=7$.

Symbol | Definition |
---|---|

S | Some Subreddit within Reddit. |

M | All users subscribed to S. |

r | A submission within S, represented as a set of users that responded to the submission, $r=\left\{r\right\},r\subseteq M$. |

$T(N,L)$ | A tree network representing the structure of hierarchically linked comments made by responders of a submission. |

n | A user-generated comment within T, $n\in N$ where N is the total set of comments within T. |

${n}_{0}$ | The head node. This is the text submitted to the Subreddit that triggers the comment cascade. |

l | A directed edge within T, representing the direction of information flowing between comments (from respondee to responder), $l\in L$ where L is the total set of edges within T. |

B | The set of branches found in T. ${B}_{i}=\left\{n\right\},{B}_{i}\subseteq N$ where the ith branch contains some subset of linked comments (including ${n}_{0}$) generated for a submission. |

S | politics, gaming, soccer | politics, gaming | politics, soccer | gaming, soccer | politics, atheism |

Score | 0.81 | 0.82 | 0.96 | 0.91 | 0.76 |

Game Thread | Playoff | Series | Friday | Trash Talk | |
---|---|---|---|---|---|

Cluster 1 Frequency | 204 | 76 | 35 | 1 | 1 |

Cluster 2 Frequency | 1 | 1 | 22 | 31 | 32 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Flamino, J.; Gong, B.; Buchanan, F.; Szymanski, B.K.
Characterizing Topics in Social Media Using Dynamics of Conversation. *Entropy* **2021**, *23*, 1642.
https://doi.org/10.3390/e23121642

**AMA Style**

Flamino J, Gong B, Buchanan F, Szymanski BK.
Characterizing Topics in Social Media Using Dynamics of Conversation. *Entropy*. 2021; 23(12):1642.
https://doi.org/10.3390/e23121642

**Chicago/Turabian Style**

Flamino, James, Bowen Gong, Frederick Buchanan, and Boleslaw K. Szymanski.
2021. "Characterizing Topics in Social Media Using Dynamics of Conversation" *Entropy* 23, no. 12: 1642.
https://doi.org/10.3390/e23121642