Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
Abstract
:1. Introduction
- R.Q.1.
- How does BERTopic differ from traditional topic extraction techniques?Given the novelty of this algorithm, a deep understanding of its functioning will help clarify the main research question, R.Q.2.
- R.Q.2.
- How can BERTopic be optimized to handle Twitter data?This is the main research question. Given the novelty of BERTopic, there are few works that employ with it today. The optimization process will require a strong awareness of the different stages of the algorithm, and a set of evaluation metrics to be compared.
- R.Q.3.
- What topics are more prevalent in terms of the 117th US Congress?Is there a variation between parties? Are there overarching topics that will characterize this Congress?
2. Background
2.1. A Brief Contextualization of U.S. Politics
2.1.1. We the People
2.1.2. The 117th United States Congress
2.2. American Politics and Twitter
3. Review of Short-Text Topic Mining Techniques
3.1. Topic Modeling
- It has expanded the preprocessing steps. Instead of removing all non-alphanumeric characters, it saves those that are relevant in tweets—hashtags, mentions, and URLs—and uses them for further information (e.g., as potential topic labels).
- Words in a tweet are either topic words or background words, and each has underlying word distributions. A given user has a topic distribution, and follows a Bernoulli distribution when writing a tweet: firstly, the user picks a topic based on their own topic distribution; at the time of choosing each next word, it either chooses a topic word or a background one.
3.2. Topic Classification
- Continuous bag-of-words: predicts a word based on its neighbors. The architecture consists of an input layer (the neighbor words), a hidden layer, and an output layer (the predicted word). The hidden layer learns the vector representation of the words.
- Skip–Gram: predicts neighbors based on a word. The architecture consists of an input layer (the word), a hidden layer, and an output layer (the neighbors). The hidden layer outputs the vector representation of the individual word.
- Distributed Memory: predicts the next word given an input layer of document vector and a ‘context’ vector of current words (those surrounding the target word). The latter consists of a sliding window, which traverses the document as the target word changes. These two vectors are concatenated and passed through a hidden layer, where their relationships are learned. The outputs consist of a softmax layer, predicting the probability distribution of the target word. During training, the hidden and output layer’s weights are adjusted to minimize the prediction error. The document vector is also updated at this stage.
- Distributed Bag-of-Words: predicts a set of words given only the document vector. The difference of this approach to distributed memory is that the input layer only consists of the document vector. There is no consideration of word order or context within the document, which simplifies the prediction task.
- Creating embeddings: generates embedded document and word vectors, using Word2Vec and Doc2Vec.
- Reducing dimensionality: reduces the number of dimensions in the documents using Uniform Manifold Approximation and Projection (UMAP). UMAP is a dimensionality reduction algorithm useful for complex datasets. It also has the advantage of clustering similar data points close to each other, helping to identify dense areas. UMAP calculates similarity scores across pairs of data points, depending on their distances and the number of neighbors in the cluster (which are user-defined). It then projects them into a low-dimensional graph and adjusts the distances between the data points according to their cluster [60]. Using this algorithm, this step maintains embedding variability while decreasing the high dimensional space, and it also manages to identify clusters in the data.
- Segmenting clusters: the model identifies dense areas of documents in the space. If a document belongs to one of the areas, it is given a label, otherwise, it is considered noise. This is performed with HDBSCAN, a hierarchical clustering algorithm that classifies data points as either core points or non-core points. Core points are those with more than neighbors. After classifying all observations, a core point is assigned to cluster 1. All other points in its neighboring region are added to the cluster, and so are the neighboring points of these. When no other core points can be assigned to this cluster, a new core point is assigned to cluster 2. This process is repeated until all data points have been assigned to a cluster [61].
- Computing centroids: For each dense area within the document vectors, the centroid is calculated, which serves as the topic vector.
- Finding topic representations: the topic representations consist of the nearest word vectors to the topic vector.
- Encoder–decoder structure: the encoder takes in the input sequence and processes it through multiple layers of self-attention and feed-forward neural networks, capturing contextual information. The decoder then generates the output sequence based on the encoded representation of the input and its self-attention mechanism.
- Self-attention mechanism: self-attention allows each word in a sequence to consider the relationships with all other words in the sequence. This is achieved by calculating the weighted representations of the input words, where the weights are determined by the relevance of each word to the current word being processed. The self-attention mechanism consists of three main components: query, key, and value. These components are used to compute attention scores that determine how much each word contributes to the representation of the current word. Multiple attention heads are used in parallel to capture the different aspects of the relationships.
3.3. Other Related Work
4. Materials and Methods
4.1. Data Selection and Extraction
4.2. Data Preprocessing
- Replacing HTML character entities with their corresponding symbols;
- Removing hashtags, hyperlinks, and mentions, creating new tweet features with each one;
- Removing ’RT’ from tweets that added no further information;
- Removing punctuation;
- Removing tweets of length 0 (not having associated text), which could result from the previous steps.
4.3. Topic Extraction with BERTopic
4.3.1. Embeddings
4.3.2. Dimensionality Reduction
4.3.3. Clustering
4.3.4. Vectorizer
4.3.5. Topic Representation
4.4. Topic Evaluation with BERTopic
- Perplexity measures the model’s capability of generating documents based on the learned topics. It shows how well the model explains the data by analyzing its predictive power, and it can therefore be seen as a measure of model quality.
- Topic coherence is a measure closely related to interpretability. Since a topic is a discrete distribution over words, these words must be coherent inside each topic. This means they are similar to each other, making it an interpretable topic, instead of being only a result of statistical inference. Word similarity can be measured with cosine similarity, which ranges from 0 to 1. Higher values of cosine similarity suggest closely related words.
- Topic diversity examines how different the topics are to each other. Low diversity is a result of redundant topics, implying difficulty in distinguishing them from what remains. A suggested technique for measuring diversity is counting the unique words present in the top words of all topics [91].
- Stability in topic modeling refers to whether the results obtained are consistent across runs. Since topic modeling applies concepts from statistical inference, some variation is expected when modeling the same data for several runs. It also follows that more relevant topics will be consistent throughout iterations. One way to measure the stability of a model is to use the Jaccard similarity coefficient of the top words across topics and across iterations. Jaccard similarity ranges from 0 to 1, where identical topics share a coefficient of 1.
5. Results and Discussion
5.1. Topic Results
- DANGER #Putin’s legitimacy built on its image as the strong leader who restored #Russia to superpower after the disasters of the ’90s. Now the economy is in shambles & the military is being humiliated & their only tools to reestablish power balance with the West are cyber & nukes (28 February 2022);
- To force #Ukraine into deal he can claim as a victory #Putin needs battlefield momentum. The danger now is that he has no economic or diplomatic cards to play, their conventional forces are stalled & cyber, chemical, biological & non-strategic nukes are their only escalation options (25 March 2022).
- Happy birthday to my friend @RepDavids from #KS03, a champion for creating economic opportunity and giving working families the tools they need to #MakeItInAmerica as Vice Chair of @TransportDems and as a Member of @HouseSmallBiz. (22 May 2022);
- Wishing a happy birthday to my friend from #IL06, @RepCasten, a champion for climate as Co-Chair of the New Democrat Coalition’s #ClimateChange Task Force and a Member of @ClimateCrisis. (24 November 2022).
- We must always remember and honor those who served our country and those who gave their lives in protecting our freedoms. May we never forget their sacrifice this Memorial Day. (@VP, 30 May 2022);
- On Veterans Day, we honor the brave men and women who answer the call to serve. They represent the best of what America has to offer. We owe them the greatest debt of gratitude. Thank you for your service - and a special thank you to my favorite veteran, my father Ed, now 94. (@RepAdamSchiff, 11 November 2022).
- An economy that is strong. A nation that is safe. A future that is built on freedom. A government that is accountable. This is the Republican Commitment to America. (22 September 2022);
- Republicans have made our #CommitmentToAmerica. Under the leadership of @GOPLeader, we will build: An Economy That is Strong, A Nation That is Safe, A Future That is Built on Freedom, A Government That is Accountable. Now let us get to work. (17 November 2022).
- In today’s episode of @Firebrand_Pod, Rep. Matt Gaetz brings us an exclusive report from the U.S.-Mexico border with @sherifflamb1, and discusses rising gas costs, Russian propaganda, men competing in women’s sports, and more! (24 March 2022);
- RT @Firebrand_Pod: Episode 76 LIVE: Ban TikTok (feat. @GavinWax)—Firebrand with @RepMattGaetz https://t.co/KQdzDcSeHJ (18 November 2022).
- The needs of Utahns have been forefront as I have helped negotiate our bipartisan infrastructure plan. Our plan would provide Utah with funding to expand our physical infrastructure and help fight wildfires without tax increases or adding to the deficit. (11 July 2021);
- From funding water projects like the Central Utah Project to building transportation systems like High Valley Transit and modernizing wildfire policy through an expert commission, our infrastructure bill has been delivering for Utah since it was signed into law 1 year ago today. (15 November 2022).
- #DefendOurDemocracy: Re-elect Democratic Rep. Tom O’Halleran in #AZ02. (1 July 2022);
- Democratic Victory: Congratulations to Congresswoman @MaryPeltola on your re-election in Alaska! -NP (24 November 2022).
- Yesterday I received my second COVID-19 booster shot. We know that getting vaccinated is the best form of protection from this virus and boosters are critical in providing an additional level of protection. If you haven’t received your first booster—do it today (2 April 2022);
- Prepare for a healthy holiday season by getting your updated COVID vaccine. It’s free, safe, and effective. (13 November 2022).
5.2. Visualizing 117th Congress Topics
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
BERT | Bidirectional Encoder Representations from Transformers |
BoW | Bag-of-Words |
LDA | Latent Dirichlet Allocation |
LSA | Latent Semantic Analysis |
ML | Machine Learning |
NLP | Natural Language Processing |
pLSA | Probabilistic Latent Semantic Analysis |
STTM | Short-Text Topic Mining |
SVD | Singular Value Decomposition |
TF-IDF | Term Frequency-Inverse Document Frequency |
VSM | Vector Space Model |
Appendix A
References
- Satterfield, H. How Social Media Affects Politics. 2020. Available online: https://www.meltwater.com/en/blog/social-media-affects-politics (accessed on 1 September 2023).
- Bonney, V. How Social Media Is Shaping Our Political Future. 2018. Available online: https://www.youtube.com/watch?v=9Kd99IIWJUw (accessed on 3 August 2023).
- Center for Humane Technology. How Social Media Polarizes Political Campaigns. 2021. Available online: https://www.youtube.com/watch?v=1GRxORsQhY4 (accessed on 3 August 2023).
- Statista. Social Media and Politics in the United States. 2023. Available online: https://www.statista.com/topics/3723/social-media-and-politics-in-the-united-states/ (accessed on 26 September 2023).
- Statista. X/Twitter: Number of Users Worldwide 2024. Available online: https://www.statista.com/statistics/303681/twitter-users-worldwide/ (accessed on 26 September 2023).
- Reveilhac, M.; Morselli, D. The Impact of Social Media Use for Elected Parliamentarians: Evidence from Politicians’ Use of Twitter During the Last Two Swiss Legislatures. Swiss Political Sci. Rev. 2023, 29, 96–119. [Google Scholar] [CrossRef]
- Anand, A. Timeline of Advances in the Field of NLP that Led to Development of Tools like ChatGPT. 2020. Available online: https://dev.to/amananandrai/recent-advances-in-the-field-of-nlp-33o1 (accessed on 3 September 2023).
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar] [CrossRef]
- Hajjej, A. Trump Tweets: Topic Modeling Using Latent Dirichlet Allocation. 2020. Available online: https://medium.datadriveninvestor.com/trump-tweets-topic-modeling-using-latent-dirichlet-allocation-e4f93b90b6fe (accessed on 26 September 2023).
- Abadah, M.S.K.; Keikhosrokiani, P.; Zhao, X. Analytics of Public Reactions to the COVID-19 Vaccine on Twitter Using Sentiment Analysis and Topic Modelling. In Handbook of Research on Applied Artificial Intelligence and Robotics for Government Processes; IGI Global: Hershey, PA, USA, 2023; pp. 156–188. [Google Scholar] [CrossRef]
- Zhou, S.; Kan, P.; Huang, Q.; Silbernagel, J. A guided latent Dirichlet allocation approach to investigate real-time latent topics of Twitter data during Hurricane Laura. J. Inf. Sci. 2023, 49, 465–479. [Google Scholar] [CrossRef]
- Knoll, B. President Obama, the Democratic Party, and Socialism: A Political Science Perspective. Available online: https://www.huffpost.com/entry/obama-romney-economy_b_1615862 (accessed on 3 August 2023).
- DemocraticParty. Where We Stand. Available online: https://democrats.org/where-we-stand/ (accessed on 3 August 2023).
- Republican National Committee. GOP—About Our Party. Available online: https://gop.com/about-our-party/ (accessed on 3 August 2023).
- U.S. Senate. Constitution of the United States. Available online: https://www.senate.gov/about/origins-foundations/senate-and-constitution/constitution.htm (accessed on 3 August 2023).
- Benzine, C. The Bicameral Congress: Crash Course Government and Politics 2. 2015. Available online: https://www.youtube.com/watch?v=n9defOwVWS8 (accessed on 3 August 2023).
- Benzine, C. Congressional Elections: Crash Course Government and Politics 6. 2015. Available online: https://www.youtube.com/watch?v=qxiD9AEX4Hc&list=PL8dPuuaLjXtOfse2ncvffeelTrqvhrz8H&index=6 (accessed on 3 August 2023).
- Binder, S. Goodbye to the 117th Congress, Bookended by Remarkable Events. 2022. Available online: https://www.washingtonpost.com/politics/2022/12/29/congress-year-review/ (accessed on 3 August 2023).
- PressGallery. Members’ Official Twitter Handles. Available online: https://pressgallery.house.gov/ (accessed on 27 August 2023).
- Lee, S.; Panetta, G. Twitter Is the Most Popular Social Media Platform for Members of Congress—However, Prominent Democrats Tweet More Often and Have Larger Followings than Republicans. 2019. Available online: https://www.businessinsider.com/democratic-republican-congress-twitter-followings-political-support-2019-2 (accessed on 27 August 2023).
- Mills, B.R. Take It to Twitter: Social Media Analysis of Members of Congress. 2021. Available online: https://towardsdatascience.com/take-it-to-twitter-sentiment-analysis-of-congressional-twitter-in-r-ee206a5b05bc (accessed on 27 August 2023).
- Marr, B. How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. 2018. Available online: https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/ (accessed on 3 August 2023).
- Ma, L.; Goharian, N.; Chowdhury, A.; Chung, M. Extracting Unstructured Data from Template Generated Web Documents. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, New York, NY, USA, 3–8 November 2003; pp. 512–515. [Google Scholar] [CrossRef]
- Defined.ai. The Challenge of Building Corpus for NLP Libraries. Available online: https://www.defined.ai/blog/the-challenge-of-building-corpus-for-nlp-libraries/ (accessed on 3 August 2023).
- Murshed, B.A.H.; Mallappa, S.; Abawajy, J.; Saif, M.A.N.; Al-ariki, H.D.E.; Abdulwahab, H.M. Short text topic modelling approaches in the context of big data: Taxonomy, survey, and analysis. Artif. Intell. Rev. 2023, 56, 5133–5260. [Google Scholar] [CrossRef] [PubMed]
- Harris, Z.S. Distributional Structure. WORD 1954, 10, 146–162. [Google Scholar] [CrossRef]
- Jones, K.S. A Statistical Interpretation of Term Specificity and its Application in Retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
- Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
- MacQueen, J. Some Methods for Classification and Analysis of Multvariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1986; pp. 281–297. [Google Scholar]
- Xia, L.; Luo, D.; Zhang, C.; Wu, Z. A Survey of Topic Models in Text Classification. In Proceedings of the 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 25–28 May 2019; pp. 244–250. [Google Scholar] [CrossRef]
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
- Valdez, D.; Pickett, A.; Goodson, P. Topic Modeling: Latent Semantic Analysis for the Social Sciences. Soc. Sci. Q. 2018, 99. [Google Scholar] [CrossRef]
- Sai, T.V.; Lohith, K.; Sai, M.; Tejaswi, K.; Ashok Kumar, P.; Karthikeyan, C. Text Analysis On Twitter Data Using LSA and LDA. In Proceedings of the 2023 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 23–25 January 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Chang, P.; Yu, Y.T.; Sanders, A.; Munasinghe, T. Perceiving the Ukraine-Russia Conflict: Topic Modeling and Clustering on Twitter Data. In Proceedings of the 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService), Athens, Greece, 17–20 July 2023; pp. 147–148. [Google Scholar] [CrossRef]
- Qomariyah, S.; Iriawan, N.; Fithriasari, K. Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis. Proc. AIP Conf. 2019, 2194, 020093. [Google Scholar] [CrossRef]
- Karami, A.; Gangopadhyay, A.; Zhou, B.; Kharrazi, H. Fuzzy Approach Topic Discovery in Health and Medical Corpora. Int. J. Fuzzy Syst. 2018, 20, 1334–1345. [Google Scholar] [CrossRef]
- Kim, S.; Park, H.; Lee, J. Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Syst. Appl. 2020, 152, 113401. [Google Scholar] [CrossRef]
- Hofmann, T. Probabilistic Latent Semantic Indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 50–57. [Google Scholar] [CrossRef]
- Kumar, P.; Vardhan, M. Aspect-Based Sentiment Analysis of Tweets Using Independent Component Analysis (ICA) and Probabilistic Latent Semantic Analysis (pLSA). In Advances in Data and Information Sciences; Springer: Singapore, 2019; pp. 3–13. [Google Scholar] [CrossRef]
- Shen, Y.; Guo, H. Research on high-performance English translation based on topic model. Digit. Commun. Netw. 2023, 9, 505–511. [Google Scholar] [CrossRef]
- Blei, D.; Ng, A.; Jordan, M.; Lafferty, J. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Anastasiu, D.; Tagarelli, A.; Karypis, G. Document Clustering: The Next Frontier. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2013; pp. 305–338. [Google Scholar] [CrossRef]
- Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101, 5228–5235. [Google Scholar] [CrossRef] [PubMed]
- Mehrpour, F. Analyzing Twitter Sentiment and Hype on Real Estate Market: A Topic Modeling Approach. 2023. Available online: https://dr.library.brocku.ca/handle/10464/17848 (accessed on 29 September 2023).
- Fakhri, M.I.; Irawan, H. Analyzing Sentiment and Topic Modelling of iPhone Xs Post Launch Event through Twitter Data. AIP Conf. Proc. 2023, 2646, 040030. [Google Scholar] [CrossRef]
- Strydom, I.F.; Grobler, J.; Vermeulen, E. Investigating the Use of Topic Modeling for Social Media Market Research: A South African Case Study. In Proceedings of the 23rd International Conference, Athens, Greece, 3–6 July 2023; pp. 305–320. [Google Scholar] [CrossRef]
- Kaur, J.; Hussain, I.Z.; Lotto, M.; Butt, Z.; Morita, P. Preventing public health crises: An expert system using Big Data and AI in combating the spread of health misinformation. Popul. Med. 2023, 5, A631. [Google Scholar] [CrossRef]
- Praveen, S.V.; Ittamalla, R.; Mahipalan, M.; Mahitha, M.; Priya, D.H. What Do Veterans Discuss the Most about Post-Combat Stress on Social Media?—A Text Analytics Study. J. Loss Trauma 2023, 28, 187–189. [Google Scholar] [CrossRef]
- Lyu, A.; Liu, C.; Ding, Z.; Li, J.; Zhang, W. Analysis of gender sentiment expression in network based on TF-LDA algorithm. Adv. Eng. Technol. Res. 2023, 5, 322. [Google Scholar] [CrossRef]
- Bheema, S.T.; Kotha, S.K. Insights from COVID-19 #Vaccine Twitter analytics. In Proceedings of the 19th Annual Symposium on Graduate Research and Scholarly Projects; Wichita State University: Wichita, KS, USA, 2023. [Google Scholar]
- Comito, C. How Do We Talk and Feel About COVID-19? Sentiment Analysis of Twitter Topics. In Proceedings of the 12th International Conference, Held as Part of the Services Conference Federation, SCF 2023, Honolulu, HI, USA, 23–26 September 2023; pp. 95–107. [Google Scholar] [CrossRef]
- Anchal, N.G.; Sriram, A.; Mathew, J.J.; Iyer, L.S.; Mahara, T. Analyzing the role of Indian media during the second wave of COVID using topic modeling. In Hybrid Computational Intelligent Systems; CRC Press: Boca Raton, FL, USA, 2023; Chapter 11. [Google Scholar]
- Meier, F.; Fugl Eskjær, M. Topic Modelling Three Decades of Climate Change News in Denmark. SSRN 2023. [Google Scholar] [CrossRef]
- Rathod, R.G.; Barve, Y.; Saini, J.R.; Rathod, S. From Data Pre-processing to Hate Speech Detection: An Interdisciplinary Study on Women-targeted Online Abuse. In Proceedings of the 2023 3rd International Conference on Intelligent Technologies (CONIT), Hubli, India, 23–25 June 2023; pp. 1–8. [Google Scholar] [CrossRef]
- Zhao, W.X.; Jiang, J.; Weng, J.; He, J.; Lim, E.P.; Yan, H.; Li, X. Comparing Twitter and Traditional Media Using Topic Models. In Advances in Information Retrieval; Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6611, pp. 338–349. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. 2013. Available online: http://arxiv.org/abs/1301.3781 (accessed on 3 August 2023).
- Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. 2014. Available online: http://arxiv.org/abs/1405.4053 (accessed on 3 August 2023).
- Angelov, D. Top2Vec: Distributed Representations of Topics. 2020. Available online: http://arxiv.org/abs/2008.09470 (accessed on 3 August 2023).
- StatQuest with Josh Starmer. UMAP Dimension Reduction, Main Ideas!!! 2022. Available online: https://www.youtube.com/watch?v=eN0wFzBA4Sc (accessed on 28 September 2023).
- StatQuest with Josh Starmer. Clustering with DBSCAN, Clearly Explained!!! 2022. Available online: https://www.youtube.com/watch?v=RDZUdRSDOok (accessed on 28 September 2023).
- Karas, B.; Qu, S.; Xu, Y.; Zhu, Q. Experiments with LDA and Top2Vec for embedded topic discovery on social media data—A case study of cystic fibrosis. Front. Artif. Intell. 2022, 5, 948313. [Google Scholar] [CrossRef] [PubMed]
- Zengul, F.D.; Bulut, A.; Oner, N.; Ahmed, A.; Ozaydin, B.; Yadav, M. A Practical and Empirical Comparison of Three Topic Modeling Methods using a COVID-19 Corpus: LSA, LDA, and Top2Vec. In Proceedings of the 56th Hawaii International Conference on System Sciences, Maui, HI, USA, 3–6 January 2023. [Google Scholar]
- Vianna, D.; Silva De Moura, E. Organizing Portuguese Legal Documents through Topic Discovery. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 3388–3392. [Google Scholar] [CrossRef]
- Crijns, A.; Vanhullebusch, V.; Reusens, M.; Reusens, M.; Baesens, B. Topic modelling applied on innovation studies of Flemish companies. J. Bus. Anal. 2023, 6, 243–254. [Google Scholar] [CrossRef]
- Bretsko, D.; Belyi, A.; Sobolevsky, S. Comparative Analysis of Community Detection and Transformer-Based Approaches for Topic Clustering of Scientific Papers. In Proceedings of the 23rd International Conference, Athens, Greece, 3–6 July 2023; pp. 648–660. [Google Scholar] [CrossRef]
- Von Der Mosel, J.; Trautsch, A.; Herbold, S. On the Validity of Pre-Trained Transformers for Natural Language Processing in the Software Engineering Domain. IEEE Trans. Softw. Eng. 2023, 49, 1487–1507. [Google Scholar] [CrossRef]
- Grootendorst, M.P. The Algorithm—BERTopic. Available online: https://maartengr.github.io/BERTopic/algorithm/algorithm.html (accessed on 29 September 2023).
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
- Briggs, J. BERTopic Explained. 2022. Available online: https://www.youtube.com/watch?v=fb7LENb9eag (accessed on 3 August 2023).
- Hägglund, M.; Blusi, M.; Bonacina, S. Caring Is Sharing—Exploiting the Value in Data for Health and Innovation: Proceedings of MIE 2023; IOS Press: Amsterdam, The Netherlands, 2023. [Google Scholar]
- Li, Y. Insights from Tweets: Analysing Destination Topics and Sentiments, and Predicting Tourist Arrivals. Doctoral Dissertation, Durham University, Durham, UK, 2023. [Google Scholar]
- Strydom, I.F.; Grobler, J. Topic Modelling for Characterizing COVID-19 Misinformation on Twitter: A South African Case Study. In Proceedings of the 23rd International Conference, Athens, Greece, 3–6 July 2023; pp. 289–304. [Google Scholar] [CrossRef]
- Turner, J.; McDonald, M.; Hu, H. An Interdisciplinary Approach to Misinformation and Concept Drift in Historical Cannabis Tweets. In Proceedings of the 2023 IEEE 17th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 1–3 February 2023; pp. 317–322. [Google Scholar] [CrossRef]
- Koonchanok, R.; Pan, Y.; Jang, H. Tracking public attitudes toward ChatGPT on Twitter using sentiment analysis and topic modeling. arXiv 2023, arXiv:2306.12951. [Google Scholar] [CrossRef]
- Grigore, D.N.; Pintilie, I. Transformer-based topic modeling to measure the severity of eating disorder symptoms. In Proceedings of the CLEF 2023: Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 18–21 September 2023; pp. 18–21. [Google Scholar]
- Mekacher, A.; Falkenberg, M.; Baronchelli, A. The Systemic Impact of Deplatforming on Social Media. arXiv 2023, arXiv:2303.11147. [Google Scholar] [CrossRef]
- Schneider, N.; Shouei, S.; Ghantous, S.; Feldman, E. Hate Speech Targets Detection in Parler using BERT. arXiv 2023, arXiv:2304.01179. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
- Zhou, W.; Zhang, C.; Wu, L.; Shashidhar, M. ChatGPT and marketing: Analyzing public discourse in early Twitter posts. J. Mark. Anal. 2023, 11, 693–706. [Google Scholar] [CrossRef]
- Di Corso, E.; Ventura, F.; Cerquitelli, T. All in a twitter: Self-tuning strategies for a deeper understanding of a crisis tweet collection. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 3722–3726. [Google Scholar] [CrossRef]
- Libit, D. Website that Helped Bring Down Anthony Weiner Is Coming Back. 2016. Updated May 20, 2016. Available online: https://www.cnbc.com/2016/05/19/website-that-helped-bring-down-anthony-weiner-is-coming-back.html (accessed on 28 September 2023).
- de Groot, M.; Aliannejadi, M.; Haas, M.R. Experiments on Generalizability of BERTopic on Multi-Domain Short Text. arXiv 2022, arXiv:2212.08459. [Google Scholar] [CrossRef]
- Gensim: Topic Modelling for Humans. Available online: https://radimrehurek.com/gensim/ (accessed on 28 September 2023).
- Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50. [Google Scholar]
- Gewers, F.L.; Ferreira, G.R.; Arruda, H.F.; Silva, F.N.; Comin, C.H.; Amancio, D.R.; Costa, L.D. Principal Component Analysis: A Natural Approach to Data Exploration. arXiv 2018, arXiv:1804.02502. [Google Scholar] [CrossRef]
- Shlens, J. A Tutorial on Principal Component Analysis. arXiv 2014, arXiv:1404.1100. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
- Jin, X.; Han, J. K-Means Clustering. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; pp. 563–564. [Google Scholar] [CrossRef]
- Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
- Dieng, A.B.; Ruiz, F.J.R.; Blei, D.M. Topic Modeling in Embedding Spaces. Trans. Assoc. Comput. Linguist. 2020, 8, 439–453. [Google Scholar] [CrossRef]
Model | Objective |
---|---|
TM-LDA | Temporal feature-based TM |
RO-LDA | Solve lack of local word co-occurrence |
TSVB-LDA | Increase accuracy |
BR-LDA | Remove background words |
Logistic LDA | Handle arbitrary input (e.g., images) |
TIN-LDA | Explore the interest of microblog users |
TH-LDA | Hierarchical dimensions for high semantics |
FB-LDA | Handle the binary weighing of words |
MBA-LDA | Select best topic representation words |
SS-LDA | Does not require annotated training data |
Twitter-LDA | Achieve high accuracy with Twitter data |
Member Name | Twitter Handle(s) | Chamber | Party |
---|---|---|---|
Adam Schiff | @RepAdamSchiff @AdamSchiff | Senator | D |
Alexandria Ocasio-Cortez | @AOC @RepAOC | Representative | D |
Andy Biggs | @RepAndyBiggsAZ | Representative | R |
Bernie Sanders | @BernieSanders @SenSanders | Senator | I |
Charles Schumer | @SenSchumer @chuckschumer | Senator | D |
Cory Booker | @SenBooker @CoryBooker | Senator | D |
Elizabeth Warren | @ewarren @SenWarren | Senator | D |
Jim Jordan | @Jim_Jordan | Representative | R |
Joaquin Castro | @JoaquinCastrotx | Representative | D |
Joe Biden | @JoeBiden @POTUS | President | D |
John Cornyn | @JohnCornyn | Senator | R |
John Kennedy | @SenJohnKennedy | Senator | R |
Kamala Harris | @KamalaHarris @VP | Vice-President | D |
Kevin McCarthy | @GOPLeader | Representative | R |
Lee Zeldin | @RepLeeZeldin | Representative | R |
Marco Rubio | @SenRubioPress @marcorubio | Senator | R |
Marjorie Taylor Greene | @RepMTG | Representative | R |
Marsha Blackburn | @MarshaBlackburn | Senator | R |
Matt Gaetz | @RepMattGaetz | Representative | R |
Mitt Romney | @SenatorRomney @MittRomney | Senator | R |
Nancy Pelosi | @TeamPelosi @SpeakerPelosi | Representative | D |
Patty Murray | @PattyMurray | Senator | D |
Pramila Jayapal | @RepJayapal @PramilaJayapal | Representative | D |
Rand Paul | @RandPaul | Senator | D |
Rick Scott | @SenRickScott | Senator | R |
Steny Hoyer | @LeaderHoyer @StenyHoyer | Representative | D |
Ted Cruz | @SenTedCruz | Senator | R |
Stage | Algorithms / Parameters |
---|---|
Embeddings | SBERT, SpaCy, Word2Vec |
Dimensionality reduction | UMAP, PCA, Base dimensionality model |
Clustering | HDBSCAN, k-Means |
Vectorizer | ngram_range, min_df, max_features |
Topic representation | reduce_frequent_words, bm25_weighting |
Stage | Selected Algorithms / Parameters |
---|---|
Embeddings | Word2Vec |
Dimensionality reduction | Base dimensionality model |
Clustering | HDBSCAN |
Vectorizer | ngram_range: (1, 1) |
min_df: 5 | |
max_features: 170,000 | |
Topic representation | bm25_weighting: False |
reduce_frequent_words: True |
Topic | Label | Name |
---|---|---|
0 | 0_reduction_prescription_cap_abortion | Abortion |
1 | 1_cancel_student_debt_borrowers | StudentDebt |
2 | 2_ketanji_jackson_brown_judge | JudgeBrown |
3 | 3_poll_early_location_register | Voting |
4 | 4_putin_kyiv_ukraine_ukrainian | UkraineWar |
5 | 5_marijuana_cannabis_legalize_possession | Cannabis |
6 | 6_commitmenttoamerica_housegop_replamalfa_built | #CommitmentToAmerica |
7 | 7_defendourdemocracy_victory_congressman_congratulations | #DefendOurDemocracy |
8 | 8_firebrand_episode_matt_feat | Firebrand |
9 | 9_sacrifice_veteransday_memorialday_memorial | MemorialDay |
10 | 10_vaccine_19_vaccinate_booster | CovidVaccine |
11 | 11_utah_wildfires_infrastructure_mitigation | Utah |
12 | 12_birthday_247th_usmc_wishing | Congratulations |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mendonça, M.; Figueira, Á. Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse. Informatics 2024, 11, 8. https://doi.org/10.3390/informatics11010008
Mendonça M, Figueira Á. Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse. Informatics. 2024; 11(1):8. https://doi.org/10.3390/informatics11010008
Chicago/Turabian StyleMendonça, Margarida, and Álvaro Figueira. 2024. "Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse" Informatics 11, no. 1: 8. https://doi.org/10.3390/informatics11010008
APA StyleMendonça, M., & Figueira, Á. (2024). Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse. Informatics, 11(1), 8. https://doi.org/10.3390/informatics11010008