A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis

Jain, Arjun; Ranganathan, Shyam

doi:10.3390/journalmedia6040176

Open AccessArticle

A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis

by

Arjun Jain

^1,*

and

Shyam Ranganathan

²

¹

D.W. Daniel High School, Central, SC 29630, USA

²

School of Mathematical and Statistical Sciences, College of Science, Clemson University, Clemson, SC 29634, USA

^*

Author to whom correspondence should be addressed.

Journal. Media 2025, 6(4), 176; https://doi.org/10.3390/journalmedia6040176

Submission received: 24 August 2025 / Revised: 12 September 2025 / Accepted: 28 September 2025 / Published: 14 October 2025

Download

Browse Figures

Versions Notes

Abstract

Understanding media discussions on artificial intelligence (AI) is crucial for shaping policy and addressing public concerns. The purpose of this study was to understand sentiment regarding AI in the media and to discover how the discussion of topics changed over time in technology-related media outlets. The study involved three overall steps: data curation and cleaning to obtain a high-quality, timely dataset from a list of relevant technology-news-oriented websites; sentiment analysis to understand the emotion of the articles; and Latent Dirichlet Allocation (LDA) to uncover the topics of discussion. The study curated and analyzed 22,230 articles from technology-focused media outlets between the period 2006 and July 2024, split into three time periods. We found that discussion on AI-related topics has increased significantly over time, with sentiment generally positive. However, since 2022, both negative and positive sentiment proportions within articles have risen, suggesting growing emotional polarization. The introduction of ChatGPT 3.5 in November 2022 notably influenced media narratives. Machine learning remained a dominant topic, while discussion on business and investment, as well as governance and regulation, has gained prominence in recent years. This study demonstrates the impact of technological advancements on media discourse and highlights increasing emotional polarization regarding AI coverage in recent years.

Keywords:

artificial intelligence; sentiment; discussion trends; news media; perception

1. Introduction

The capabilities of Artificial Intelligence (AI) systems have grown significantly over the last few years (Babina et al., 2024). The emergence of big data, better algorithms, and increased investment have all contributed to the growth of AI. AI applications are already impacting scientific research, healthcare, the government, and economics in noticeable ways (Davenport & Kalakota, 2019; Rahmani et al., 2023; Xu et al., 2021; Zuiderwijk et al., 2021). OpenAI’s ChatGPT-3.5 (OpenAI, 2022), a large language model, was released publicly in November 2022 and made the extensive abilities of AI evident to the public and led to its quick adoption for a range of different use cases (Heaven, 2023). However, the fast-paced uptake of AI has also been met with criticism (Huang et al., 2023). AI-backed systems are cited as having inherent bias, poor data privacy, and a lack of transparency (Jobin et al., 2019; Wang & Siau, 2018). Regulation of AI has also been getting more traction and focus, as can be seen by the European Union’s (EU) attempt to institute regulation through the EU AI Act (European Union, 2024). In addition to the discussions about AI in academia, government, and regulatory authorities, the conversations about AI are increasing in technology-focused outlets and in the mainstream media (Fast & Horvitz, 2017). These articles suggest that the public is not only growing more aware of the possibilities of AI, but also discussing its impacts on individuals, workplaces, and communities (Howard, 2019; Maslej et al., 2024). Understanding how AI is being discussed in the media over time provides insights into how public understanding of this topic might be changing (Ouchchy et al., 2020). The public’s perception and knowledge of AI is shaped by mass media and the news. One study found that mass media plays a key role in determining the public’s perception of a topic (Liao, 2023). Technology-focused media outlets often lead the discussion around technology-related topics such as AI. As the media covers a developing topic, such as AI, in greater magnitude, it can affect the public’s beliefs and perception of the topic (Liao, 2023; Y. Liu & Li, 2021). This is important for policymaking, education about ethical use, and addressing emerging concerns among the public.

Past attempts to understand the emotion of discussions in the media about AI have primarily focused on changes in sentiment over time. One study found that sentiment toward AI became increasingly polar over time—positive articles became increasingly positive and negative articles became increasingly negative in sentiment (Moriniello et al., 2024). Another study looked at long-term trends in the public perception of AI and found that articles have been more optimistic than pessimistic (Fast & Horvitz, 2017). Other attempts at understanding public perception of AI analyzed Reddit and YouTube comments (Mohanna & Basiouni, 2024; Qi et al., 2023). However, many of these studies rely on datasets that are outdated and do not contain articles from recent years (Fast & Horvitz, 2017; Garvey & Maskal, 2019; Yi et al., 2023; Zhai et al., 2020). Additionally, many of these papers only extract articles from one or a limited number of websites, which is problematic due to biases that may exist within one publication (Baron, 2006; Elejalde et al., 2018). Therefore, a more diverse and timelier dataset must be included to effectively understand the discussion of AI in the media.

Studies have examined the topics related to AI and the evolution of AI by examining publications in the media. One study (Zhai et al., 2020) found that three topics were most important: imagination, a commercial product, and scientific research. Another study found a focus on employment and concerns about privacy related to AI (Yi et al., 2023). In recent years, studies have discussed the loss of control of AI and have identified ethical concerns, along with an increasing discussion about AI applications within healthcare (Fast & Horvitz, 2017). Like the sentiment analysis studies, many of these studies utilize outdated datasets and do not properly connect sentiment with topic trends. Understanding the sentiment with which topics are mentioned over time is pertinent in determining how the public’s concerns change over time.

Past studies have employed a variety of approaches when trying to study emotion in discussions related to AI. Several past papers (Yi et al., 2023; Zhai et al., 2020) implemented Linguistic Inquiry and Word Count (LIWC), a text analysis program that serves as a metric for different emotions by calculating the percentage of words in each article that fall under one of its dimensions (Boyd et al., 2022). However, LIWC relies on a word count approach and fails to catch some of the greater semantic distinctions (Schwartz et al., 2013). Furthermore, LIWC sometimes mis-categorizes words, which can be problematic (Franklin, 2015). Other papers used sentiment analysis to determine if the text was negative, neutral, or positive (Garvey & Maskal, 2019; Moriniello et al., 2024).

Studies have employed various methods to determine the topics of discussion within AI articles in the media, with the most notable approach being topic modeling. Topic modeling is an unsupervised machine-learning algorithm that looks at a dataset (corpus) of documents and discovers underlying topics (Kherwa & Bansal, 2019). The two topic modeling methods used in studies were latent Dirichlet allocation (LDA), “a generative probabilistic model for collections of discrete data such as text corpora” (Blei et al., 2003) and BERTopic, a topic model that utilizes transformer architecture and implements a c-TF-IDF algorithm (Grootendorst, 2022). Another study used low-level annotations and asked for binary labels to determine if a specific topic was present in a piece of text (Fast & Horvitz, 2017). While the sentiment analysis and topic modeling approaches are effective within existing studies, most of these papers fail to properly address the connection between sentiment trends and discussion topics over time.

Technology-focused media outlets are leading the discussion around AI and play an important role in reflecting and shaping public understanding of AI. Furthermore, past studies are outdated because they fail to analyze the crucial period of AI development that has occurred over the last couple of years. There are some methodological limitations in past studies, such as reliance on keywords to determine if an article is about AI, which may result in the inclusion of articles that may not have a strong focus on AI. This study utilizes a timely, diverse dataset to understand sentiment regarding AI in the media and to discover how the discussion of topics in the media changes over time. This study specifically addresses changes in sentiment and topics in AI-focused articles in technology-related media outlets.

This study elucidated sentiment about AI and AI-related discussion topics in technology-focused media outlets using sentiment analysis and latent Dirichlet allocation (LDA). The following four key research questions (RQ) were addressed:

RQ1:

How has media output on AI-related topics changed over time in technology-focused media outlets?

RQ2:

How has the emotion or sentiment related to AI changed over time?

RQ3:

How has the frequency and presence of discussion topics changed over time?

However, RQ 2 and 3 are more powerful when they are analyzed in relation to each other. Therefore, a 4th RQ addresses the relationships between topics and emotion.

RQ4:

Have topics been talked about in different sentiments over time?

2. Materials and Methods

The study involved three overall steps: data curation and cleaning to obtain a high-quality, timely dataset from a list of relevant technology-news-oriented websites; sentiment analysis to understand the emotion of the articles; and LDA to uncover the topics of discussion. Finally, sentiments were analyzed for key topics over time. The curated dataset, original scraping algorithms, and topic-similarity function code can be accessed at the public GitHub version 3.12.12 repository. The GitHub repository’s location is provided in the data availability statement at the end of this article.

2.1. Dataset Curation and Cleaning

To address the research questions, we curated a study dataset of media articles solely about AI. To find the most relevant, credible, and popular sources, a list of tech news websites was obtained from two different sources that had the most comprehensive lists (FeedSpot, 2024; Shah, 2023). Then, we assessed each website in the list to ensure that it met the inclusion criteria for this study. In order to be included, (a) the website had to have a dedicated artificial intelligence page for articles (this allowed us to only include AI-related articles within this study), (b) had to be scrapable via the Beautiful Soup package within Python version 3.11.9 (Richardson, 2007), (c) all articles had to be accessible from the AI page within the website, and (d) the website had to have a minimum of 500,000 cumulative followers. The websites that passed the criteria were CNET, Computer World, Engadget, Gizmodo, Hackaday, IEEE Spectrum, Mashable, Tech Crunch, The Verge, and Wired.

2.1.1. URL Extraction and Metadata Scraping

Once viable websites were identified, we developed a pipeline for extracting relevant metadata for this study. Relevant metadata for an article includes the publishing date, article length (number of words), and the text content of the article. All this metadata was extracted from the Uniform Resource Locator (URL) for each article. Only articles about AI were eligible for inclusion in the dataset in keeping with study objectives. Previous studies have employed keyword extraction; however, this approach can be inadequate since several diverse keywords used to discuss AI may be missing. We addressed this challenge in the current study by utilizing dedicated AI pages in the selected outlets to ensure that the article (categorized as AI-related by the outlet) was eligible for inclusion in this study. Due to the sheer predicted size of the dataset (n > 10,000), extracting the text of each article manually was not feasible. Therefore, we programmed two scrapers (programs that extract data from websites) for each news outlet. The first scraper automatically extracted the URLs of AI-related news articles. The second scraper took the input of URLs that were produced from the first scraper and then scraped the text content of the article. The scrapers relied on the Python package Beautiful Soup (Richardson, 2007), which can extract the HTML content of a webpage. Then, this HTML content could be manipulated and cleaned to retrieve the relevant data. The time frame for the extracted data extends from when the website initiated their dedicated “Artificial Intelligence” section to July 2024.

2.1.2. Data Validation and Final Cleaning

To ensure that this dataset included high-quality, complete data, a validation process was undertaken. First, all duplicates in the dataset were removed (n = 79). Due to the unpredictable way that articles were formatted, achieving a 100% accuracy of scraped text to actual text across the whole dataset was challenging. However, if the article text could retain most of its contents, then it could be considered high quality and should be able to retain most of its semantic information. We took a random sample of 10 scraped articles for each website, then manually copied text from the URL of each of those scraped articles. We set the manually copied text data to be the ground truth and then compared the scraped content to that. If a website achieved an average accuracy of over 95% across the sample, then it passed this stage of the data validation. If a website did not pass this test, we enacted an iterative process of tweaking the scraper and then validating it again. We then validated the time stamps for each article and set the threshold for accuracy to be 90%. If the accuracy of the scraped timestamps did not reach this threshold, then another random sample was tested with the requirement that the website would need to reach 100% on this retest.

2.2. Sentiment Analysis

Sentiment analysis, also commonly referred to as opinion mining or opinion analysis (Wankhade et al., 2022), is the computational study of people’s attitudes towards what they may be writing about (Medhat et al., 2014). Generally, it tries to quantify the sentiment or attitude that a piece of text may exhibit. However, it is important to note that methods for sentiment analysis vary and that there are three general approaches—machine-learning, lexicon-based, and hybrid approaches (Wankhade et al., 2022). This study employed a machine learning approach to sentiment analysis because of the strength of this approach over other lexicon-based approaches in determining sentiment within the contextual meaning of words, in which overall sentiment understanding is needed. This study specifically utilized a transformer-based sentiment analysis model. A transformer model consists of several layers of encoders and decoders combined with attention, normalization, and feed-forward neural networks. For more details on the transformer model, refer to Vaswani et al. (2017). The Bidirectional Encoder Representations from Transformers (BERT) model is a transformer model that is “designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.” (Devlin et al., 2018). BERT in sentiment analysis is superior to lexicon-based approaches because it weighs words in their relative context, therefore capturing a higher degree of overall semantic content (Alaparthi & Mishra, 2021).

This study used Cardiff NLP’s twitter-roberta-base-sentiment model, which is a RoBERTa-base model that was trained on ~58 million tweets and then fine-tuned for sentiment analysis (Barbieri et al., 2020). RoBERTa or Robustly Optimized BERT Pretraining Approach is a refined form of BERT that has significant performance gains compared to its predecessor (Z. Liu et al., 2021). This model was chosen for the study due to RoBERTa’s superior performance when compared to BERT.

The text from each article that was scraped during the data extraction step was input directly into RoBERTa. The text was not preprocessed since previous studies found that preprocessing did not significantly impact the performance of BERT models (Kurniasih & Manik, 2022). Furthermore, common NLP techniques like stop word removal or lemmatization can potentially remove words that may be semantically important. Each article was analyzed by RoBERTa, which generates a probability distribution over the sentiment classes (negative, neutral, and positive) for each article. Due to RoBERTa’s 512 token limit, the documents had to be split into chunks. Sentiment probabilities for each class were calculated for each chunk, and then the average probabilities for each class were calculated.

2.3. Topic Modeling: Latent Dirichlet Allocation

Topic modeling is a statistical model used in NLP to understand underlying semantic patterns in a set of documents (Kherwa & Bansal, 2019; Vayansky & Kumar, 2020). Blei et al. (2003) introduced latent Dirichlet allocation (LDA), “a generative probabilistic model for collections of discrete data such as text corpora”. LDA assumes that each document is a mixture of latent topics, and each topic is represented as a probability distribution over words (Jelodar et al., 2019). LDA functions by estimating the latent structure of topics through Bayesian inference of topic distributions in a corpus (Blei et al., 2003). LDA relies on three hyperparameters—K topics, Dirichlet prior α, and Dirichlet prior β. K gives the number of topics the LDA model will learn topic-word distributions for. The Dirichlet prior α controls the distribution of topics per document, which influences how mixed the topic proportions are across documents. The Dirichlet prior β controls the distribution of words per topic, which impacts how spread-out word distributions are across topics. The LDA model randomly generates a topic–word distribution φ_k of K topics from the prior distribution Dirt (α) and a document–topic distribution θ_m of M, the number of documents in the corpus, from the prior distribution Dirt (β). Once the LDA model is trained, it can then give the topics of articles in the form of probability distributions of those topics. Likewise, topics will be comprised of probability distributions of words that constitute them. The generative process of LDA is graphically demonstrated in Figure 1. This study used Gensim’s LDA multi-core model (Řehůřek & Sojka, 2010), due to its “fast, memory-efficient, scalable algorithms” for LDA.

2.3.1. Text Preprocessing

Since LDA typically uses a Bag of Words (BoW) approach to topic modeling, text preprocessing needs to occur. A BoW approach was used to analyze text based on word counts. During preprocessing, each article was converted into only lowercase letters. Then, all non-alphabetic characters were removed, and the article was tokenized. Tokenization allowed us to convert to a BoW format. We removed stop words (commonly used words such as “a, an, the” that do not carry important semantic data) from the dataset. Finally, we lemmatized each word, meaning it was broken down into its root form. The Python package NLTK (Bird & Loper, 2004) was used for common stop words, and the package Spacy (Honnibal et al., 2020) was used to lemmatize each word. The performance of the LDA model is reliant on high-quality text data; thus, preprocessing is an integral preparatory step.

2.3.2. Time Period Identification

After this step was completed, the dataset was separated into three distinct time periods. The distribution of AI-related articles in this data set (Figure 2) across the entire period (2006–July 2024) suggested three natural groupings. Within some websites in the dataset, articles started being categorized as AI in 2006, and then there was a spike in the number of AI-tagged articles in 2016. The next major shift appeared to be in November 2022. In November 2022, OpenAI’s ChatGPT-3.5 (OpenAI, 2022) was released and quickly gained traction. ChatGPT attracted 1 million users within the first five days of release and 100 million within two months (Milmo, 2023). Thus, the introduction of ChatGPT marked the start of the third period. Based on the observed trends in the dataset, the three time periods in order are [2006, 2016] [2016, November 2022], [November 2022, July 2024]. Each dataset for these time periods contained the respective articles for that period and any metadata collected during scraping. A separate LDA model was trained on each of these time periods.

2.3.3. Model Fine-Tuning

The number of K topics to be determined by the LDA plays a large part in the model’s performance and overall semantic distinction. At the end of training, the LDA model should be able to split the corpus into distinct topics that bear some information or meaning (Wallach et al., 2009). As stated previously, the hyperparameters α and β impact the topic–document distribution and the topic–word distribution, respectively. High-quality hyperparameters give a good understanding of the general shape of the corpus. For example, a lower α incentivizes the model to assign fewer topics per document and a higher α encourages the model to have a more spread-out topic distribution within documents. On the other hand, a lower β incentivized the model to use fewer words per topic, and a higher β means that each topic has a more spread-out distribution of words. This study utilized the C_V coherence score as its metric for determining what parameters were optimal (Röder et al., 2015). A coherence score measures the semantic similarity between the top words within a given topic (Röder et al., 2015). Every period was separately optimized for hyperparameter values using the coherence score, as they each have different datasets associated with them. To determine the number of K topics, an LDA model was trained on the dataset with the experimental number of topics and then assessed using the C_V coherence score. These LDA models were not saved as their sole purpose was to be assessed for hyperparameter tuning. Each K in the range [5,15] was tested and then scored using the C_V coherence score. The number of K topics for periods 1, 2, and 3, respectively, was 12, 14, and 14. Figure 3 shows the experimental results.

After this step, LDA models were then generated and tested with the constant number of topics previously identified. This study tested every permutation of α to β for the selected options (symmetric, 0.01, 0.31, 0.61, and 0.91). The combination of α and β that provided the highest coherence scores supplied the hyperparameters that were used for the actual model (Appendix A).

2.3.4. Training Models

Once the optimal hyperparameters were determined, an LDA model was then trained for each period. The LDA model for each period underwent 3000 iterations of training and inference. After each model was trained, it was then used to determine a probability distribution for the topics of each article within its period.

2.3.5. Qualitative Analysis of Topics and Visualization

After the topics were generated, a qualitative analysis of the topics was performed. For each topic, a word cloud consisting of the most common words and a list of the 50 most frequent words was produced. Each topic was given a label or title (e.g., “Investment” or “Autonomous vehicles”). The title was broad and intended to generally cover the words from the word cloud and word list. Appendix A shows the word clouds, word lists, and associated topics for three topics (one in each period).

2.3.6. Similar Topics Across Different Periods

Topics of different periods may appear to be very similar in meaning to each other. Therefore, this study used cosine similarity to remove potential biases in determining similarity between topics in different time periods. Cosine similarity is a metric to find the similarity between two pieces of text (Rahutomo et al., 2012). Cosine similarity was calculated between the 50 most frequently used words of pairs of topics. However, naive similarity wouldn’t account for words that may have the same semantic meaning. Thus, BERT word embeddings were used to better capture the meaning of similar words. Two topics are considered the same across periods if their cosine similarity is above the threshold of 0.9.

3. Results

The findings from this study are organized to address the four key RQs. Section 3.1 describes the general trends in the dataset of AI articles in tech-focused media outlets (RQ1). Section 3.2 describes sentiment in the dataset related to AI (RQ 2). Section 3.3 describes the key AI-related topics in different time periods and across time periods (RQ 3). Section 3.4 describes how sentiment changes for key topics across time periods (RQ 4).

3.1. General Trends in the Dataset

The dataset of AI-related articles from technology-focused media outlets curated for this study included 22,230 articles. Period 1 [2006, 2016] had 824 articles, period 2 [2016, November 2022] had 12,851 articles, and period 3 [November 2022, July 2024] had 8555 articles. Empirically, there are two significant jumps in the number of articles published: January 2016 and November 2022 (Figure 2). In period 2, AI-focused publication output was at its lowest in 2020. AI article publishing fluctuated through most of 2023 and 2024 and reached the dataset’s maximum in May 2024. However, this spike does not seem to be a continued trend as the two subsequent months show a decline. Figure 4 and Figure 5 show the distribution of articles across years and average article word counts for each website, respectively. The largest number of articles published in this dataset were from TechCrunch, while articles published in IEEE Spectrum and Wired had the longest articles.

3.2. AI-Related Sentiment Trends

Sentiment analysis was used in this study to understand the emotional content (positive, negative, neutral) of the articles. Intuitively, it makes sense that articles on tech-news websites would generally be more positive in sentiment regarding AI due to the nature of their audience. The three associated classes for an article’s sentiment—negative, neutral, and positive—represent a probability distribution for the sentiment of an article. In this study, the average sentiment distribution was calculated by taking the average of each class for the period or website that was being analyzed. Figure 6 shows the average sentiment distribution within each media outlet. Neutral sentiment was the largest sentiment class because most words in a piece of text do not carry emotional intensity (e.g., stop words). On average, the news outlet Gizmodo had a larger percentage of negative rather than positive sentiment in its articles. TechCrunch and Hackaday both contained large proportions of positive sentiment within their articles.

Average sentiment distribution was then extended to the whole dataset to see how the sentiment distribution changed over time (Figure 7). To get a representative distribution for the sentiment distribution, the average was normalized by dividing by the number of articles, so that a website with a higher number of published articles would not skew the results.

There were fewer articles in period 1. However, trends in those years are still crucial as they show a period of relative volatility in which AI reporting was inconsistent. In 2017, the average positive and negative sentiment proportions roughly equaled each other. This year also happened to be the global minimum of positive sentiment, indicating a year of more negative AI news reporting in tech media outlets. In 2021, the average negative sentiment was high in proportion, and the positive sentiment was relatively low. Intuitively, it feels reasonable that as an article is more positive in sentiment, it will be less negative. A strong negative correlation (r = −0.803, p = 0.000) was found between positive and negative sentiment within articles in this dataset. However, the sentiment distribution appeared to have become more polar in recent years—negative sentiment and positive sentiment proportion both increased in period 3, between 2022 and 2024 (Figure 7).

To understand the average sentiment distribution at an even more granular level, the dataset was analyzed by month for the period between August 2022 to July 2024 (Figure 8). In January 2023, the negative and positive sentiment proportions roughly equaled each other. After that month, the average positive and negative percentages trend in opposite directions. This led up to April 2023, when most of the article was positive in sentiment. In the interval between August 2023–December 2023, the sentiment distribution of articles seemed to change minimally. In April 2024, on average, 67% of each article was positive, suggesting a period of AI optimism. However, just two months later, in July 2024, negative sentiment on average made up over half of an article.

3.3. AI-Related Topics in the Media

The goal of the LDA implemented in this study was to determine what AI-related topics were being talked about in the media and how they trended over time. Figure 9 shows the topics for each period ranked from highest to lowest in terms of the number of articles that include a focus on that topic. In this study, we used a topic threshold of 25%, which means that an article is said to be about a topic if at least 25% of that article includes words associated with that topic grouping. One study (Bastani et al., 2019) used a 20% threshold, though there is no established threshold value in the literature. In the current study, we used a more conservative threshold of 25%.

The topics of discussion have changed over time. Certain topics emerge in different time periods—for example, ‘government/regulation’ and ‘investment’ (see yellow highlights in Figure 9). Similarly, some topics disappear—‘future/philosophy’ and ‘intelligence testing’. ‘Business and investment’ was one of the most frequent topics discussed in articles in period 2, and was the most frequent topic in period 3 (Figure 10).

We were also interested in tracking topics over the entire timeline. A “topic path” is the path a topic takes across several periods. Cosine similarity was used to identify whether two topics in different periods were similar and could therefore be analyzed as a topic path. For example, machine learning is common to all three periods, with a cosine similarity score of above 0.9 between any two periods. Cosine similarity scores are indicated over the arrows in Figure 9. There were several viable topic paths calculated, but this study chose to focus on a select few that met the cosine similarity score threshold (Figure 9) and had high topic frequency in one of the periods.

The percentage of all articles where the topic was present (no threshold), and the percentage of articles where the topic was present above the 25% threshold in each of the three time periods, is shown in Figure 10.

The selected topic paths from Figure 9 were analyzed using a metric called topic popularity. Topic popularity is the frequency of a topic at a given time (Bastani et al., 2019). Topic popularity is calculated by aggregating a topic’s proportions within all articles in that period and then dividing by the number of articles in that period. The selected topic paths were then plotted on the graphs in Figure 11, which show the topic popularity score for each topic over time. For topics that crossed periods two and three, most of the topics exhibited a relative decline in topic popularity when entering period 3. However, topic popularity for ‘natural language’ increased by 225.74% between January 2022 to November 2022. We hypothesized that this is due to the public release of ChatGPT-3.5 in November 2022. To further investigate this position, we compared the percentage of articles containing specific ChatGPT-related keywords in any amount between period 2 and period 3 (Figure 12). The differences in the frequency of use for each keyword between period 2 and period 3 are noteworthy—‘ChatGPT’ goes from essentially not being mentioned in articles to then being mentioned in roughly 40% of all articles in period 3. Additionally, ‘OpenAI’, the company that created ChatGPT, also entered media discussion and was mentioned in about 45% of all articles in period 3. The topic popularity results combined with the keywords demonstrate how a significant technological advancement can affect the discourse within a field.

3.4. Change in Topic Sentiment over Time

Combining sentiment and topic trends can show how sentiment regarding certain topics changes over time. The graphs in Figure 13 show the sentiment distributions for articles that exhibited the topic of interest at a proportion of at least 25%. For the topic paths that crossed periods 2 and 3, there were several jumps in negative sentiment right at the beginning of period 3 (November 2022). The negative sentiment proportion of the ‘Business and Investment’ and ‘Machine Learning’ topics rose by nearly 10% in November 2022. ‘Governance and Regulation’ had the steepest rise in negative sentiment which may mean there was a negative response in the media related to this topic. As has been repeated several times throughout this study, OpenAI’s ChatGPT 3.5 public release in November 2022 appears to have had significant repercussions throughout the whole AI–media landscape.

4. Discussion

This study analyzed discussion trends regarding AI from articles within technology-focused news outlets. This was achieved by looking at how the media output, sentiment of articles, and discussion topics changed in AI-related articles over time. There are two significant jumps in the number of articles published: January 2016 and November 2022. The increase in the number of articles published after November 2022 is particularly significant and can be attributed to the release of OpenAI’s ChatGPT 3.5 public release. This event also helps explain interesting trends in topics and sentiment in this dataset.

This paper identified fluctuations in the distribution of sentiment over time. For the majority of the time periods, negative and positive sentiment seem to be inversely related; however, in recent years, both negative and positive sentiment proportions have been steadily increasing in published articles. This not only indicates a change in the usual trends but also highlights potential polarization in the way AI is discussed. In general, the results from this study show that AI news articles in technology-focused media outlets were usually discussed with a more positive sentiment than a negative one. Garvey and Maskal (2019) coined the term “Terminator Syndrome” to show how negative media coverage of AI affects public concern. However, this study and Garvey and Maskal’s study demonstrate the contrary—AI is generally talked about in a positive light in the media. This is consistent with the findings from other studies that found that AI articles were more optimistic than pessimistic (Fast & Horvitz, 2017). The two previous studies both had smaller datasets and were published several years ago. Therefore, this study was able to understand sentiment of AI articles during a critical period of development using a larger dataset.

The topics of discussion have also changed over time. The emergence of new topics (Figure 10) in different periods may reflect real-world events and concerns. In 2023, The Stanford Institute for Human-Centered Artificial Intelligence (Maslej et al., 2024) found that global AI private investment into startups was nearly 18 times higher than it was in 2013. We theorize that the significant increase in investment into AI-based startups may be attributed to growing optimism surrounding AI that resulted from higher computing power and technological breakthroughs. The increasing capabilities of AI-backed machines may have also spurred more concern regarding government and regulation. Therefore, it makes logical sense that ‘governance and regulation’ would show up in period 2 but then become one of the most significant topics in period 3. The decline of topics such as ‘intelligence testing’ and ‘future/philosophy’ may be due to those topics becoming too niche in the grand scheme of AI reporting, or due to the sparsity of articles in the earlier period. The identified topic paths are valuable in their ability to show how a topic changed across different periods. The significant jump in topic frequency for ‘natural language’ from period 2 to period 3 was most likely the result of the release of ChatGPT 3.5. In addition, the relative increase in the topic frequency ranking of ‘governance’ topics between period 2 and 3 could be because of the topics causing concern. A previous study that used LDA to understand topics found a relative increase in a topic that resembled ‘business and investment’ (Zhai et al., 2020). Fast and Horvitz (2017) found that concern regarding the control of AI has been increasing. The current study may not have explicitly identified ‘control of AI’ as a topic, but it is fair to say that the increased prevalence of ‘governance and regulation’ topics in this dataset may reflect the growing discussion regarding control and regulation of AI-backed systems.

Previous published studies are not only outdated at this point but are also limited in their datasets. To our knowledge, the curated dataset developed for this study is the largest scraped dataset of AI-news articles used in a research study. Further, this study fills a critical gap in our understanding of the rapidly changing AI landscape that has developed over the past couple of years. This study is also novel in that it connects the findings from topic modeling with sentiment analysis to find trends over time. Understanding the sentiment of articles for a whole period is useful in making conclusions regarding that period. However, to truly understand how a topic develops over time, looking at the sentiment distribution for a topic is critical. Therefore, this study is particularly useful when looking at discussion trends in a holistic way.

However, this study has some limitations. This study only includes articles until the end of July 2024, and it focused only on technology-focused media outlets. Future research studies should aim to use a more ongoing approach to data extraction where the pipeline for scraping is more adaptive and can thus be easily adjusted. Inclusion of articles from the general media might allow comparisons of sentiments and topics between technology-focused and non-technology focused outlets. This paper assumed that the average sentiment classes within one article would be indicative of sentiment within the entire article. Due to computing constraints, hyperparameter tuning could only be run for 500 iterative passes through the dataset. While this was the best option for feasibly determining the number of hyperparameters, there is a chance the LDA model could have been optimized even further.

5. Conclusions

This study showed that discussion about AI-related topics has significantly increased in technology-focused media outlets since November 2022, demonstrating that the media output on AI-related topics has greatly changed over time (RQ1). The overall sentiment related to various topics is generally positive. Additionally, both negative and positive sentiment proportions within articles have risen, indicating possible emotional polarization. By analyzing the proportion and changes over time, RQ2 was sufficiently addressed. The article used latent Dirichlet allocation appropriately to understand the frequency and presence of various topics, and was able to successfully answer RQ3. The topic of machine learning was discussed in all three time periods at high frequency, while business and investment and governance and regulation emerged in period 2 and period 3 as important topics. RQ4 was examined effectively using sentiment analysis for a given topic of interest. Several topics, such as ‘Business and Investment’, ‘Machine Learning’, and ‘Governance and Regulation’ experienced increases in negative sentiment proportion in November 2022, when OpenAI’s ChatGPT 3.5 was publicly released. These findings help to understand the effects of an event on the sentiments of each respective topic. This study was successful in curating a large dataset of AI-focused articles and utilized a novel approach to combine sentiment analysis with topic modeling that allows us to understand key topic trends over time. The dataset and methods could be utilized by future researchers in order to further the work started in this study.

Author Contributions

Conceptualization, A.J.; methodology, A.J. and S.R.; formal analysis, A.J. and S.R.; investigation, A.J. and S.R.; data curation, A.J.; writing—original draft preparation, A.J.; writing—review and editing, S.R.; visualization, A.J.; supervision, S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data, scraping algorithms, and topic–similarity function used in this study can be accessed at the GitHub repository at this link (https://github.com/AJain271/Tech-news-AI-articles-project, accessed on 23 August 2025). Due to the rapidly changing nature of the scraping algorithms, the programs may not function as they did during the study process.

Acknowledgments

The authors would like to thank Anjali Joseph for her help in reviewing and copyediting the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BoW	Bag of Words
BERT	Bidirectional Encoder Representations from Transformers
EU	European Union
LDA	latent Dirichlet Allocation
LIWC	Linguistic Inquiry and Word Count
RoBERTa	Robustly Optimized BERT Pretraining Approach
URL	Uniform Resource Locator

Appendix A

Appendix A.1. Websites That Passed Inclusion Criteria and Those That Failed Inclusion Criteria

1.: TechCrunch [Passed inclusion criteria]
2.: Wired [Passed inclusion criteria]
3.: Engadget [Passed inclusion criteria]
4.: The Verge [Passed inclusion criteria]
5.: TechNewsWorld [Failed]
6.: GeekWire [Failed]
7.: CNET [Passed inclusion criteria]
8.: Digital Trends [Failed]
9.: Android Authority [Failed]
10.: PCWorld [Failed]
11.: The Next Web [Failed]
12.: Silicon Valley Journals [Failed]
13.: Tech Hubs Media [Failed]
14.: Soft2Share [Failed]
15.: Newskart [Failed]
16.: Ars Technica [Failed]
17.: Techmeme [Failed]
18.: Gadgets 360 [Failed]
19.: Techradar [Failed]
20.: ZDNET [Failed]
21.: TechSpot [Failed]
22.: TechRepublic [Failed]
23.: VentureBeat [Failed]
24.: AppleInsider [Failed]
25.: MacWorld [Failed]
26.: Tech In Asia [Failed]
27.: KnowTechie [Failed]
28.: MIT Tech Review [Failed]
29.: TechHive [Failed]
30.: Mashable [Passed inclusion criteria]
31.: Gizmodo [Passed inclusion criteria]
32.: Cord Cutter News [Failed]
33.: Life Hacker [Failed]
34.: ComputerWorld [Passed inclusion criteria]
35.: MakeUseOf [Failed]
36.: HowToGeek [Failed]
37.: Pymnts.com [Failed]
38.: Product Hunt [Failed]
39.: Pocket Lint [Failed]
40.: Tom’s Guide [Failed]
41.: Slash Gear [Failed]
42.: The Information [Failed]
43.: Term Sheet [Failed]
44.: Ubergizmo [Failed]
45.: 9to5 Mac [Failed]
46.: Tech2 [Failed]
47.: Recode [Failed]
48.: IEEE Spectrum [Passed inclusion criteria]
49.: O’Reilly [Failed]
50.: Hackaday [Passed inclusion criteria]

Appendix A.2. Sample of CV Coherence Scores Testing for Hyperparameter Tuning

Table A1. Period 1 CV coherence values with alpha = 0.31.

Alpha	Beta	CV Coherence
0.31	symmetric	0.3908774736
0.31	0.01	0.3922563959
0.31	0.31	0.4031903854
0.31	0.61	0.4028762874
0.31	0.91	0.4661256387

Table A2. Period 2 CV coherence values with alpha = 0.01.

Alpha	Beta	CV Coherence
0.01	symmetric	0.5044535816
0.01	0.01	0.5028748581
0.01	0.31	0.5037658353
0.01	0.61	0.5419765267
0.01	0.91	0.552961287

Table A3. Period 3 CV coherence values with alpha = 0.61.

Alpha	Beta	CV Coherence
0.61	symmetric	0.5215005753
0.61	0.01	0.5242769182
0.61	0.31	0.5246141571
0.61	0.61	0.5380900789
0.61	0.91	0.5463081086

Appendix A.3. Word Clouds, Word Lists, and Associated Topics for Three Topics—One per Period

[‘computer’, ‘machine’, ‘system’, ‘google’, ‘learn’, ‘brain’, ‘ai’, ‘datum’, ‘he’, ‘learning’, ‘deep’, ‘researcher’, ‘network’, ‘way’, ‘algorithm’, ‘neural’, ‘artificial’, ‘image’, ‘these’, ‘build’, ‘company’, ‘language’, ‘program’, ‘research’, ‘facebook’, ‘world’, ‘understand’, ‘now’, ‘call’, ‘software’, ‘university’, ‘thing’, ‘go’, ‘technology’, ‘good’, ‘science’, ‘his’, ‘would’, ‘people’, ‘word’, ‘year’, ‘problem’, ‘even’, ‘many’, ‘model’, ‘who’, ‘much’, ‘know’, ‘then’, ‘help’]

Figure A1. Period 1 topic: Machine learning word cloud and word list.

[‘startup’, ‘million’, ‘founder’, ‘market’, ‘investor’, ‘funding’, ‘round’, ‘raise’, ‘venture’, ‘investment’, ‘he’, ‘capital’, ‘tech’, ‘lead’, ‘build’, ‘business’, ‘product’, ‘partner’, ‘our’, ‘ceo’, ‘billion’, ‘focus’, ‘fund’, ‘team’, ‘co’, ‘industry’, ‘last’, ‘techcrunch’, ‘base’, ‘early’, ‘china’, ‘big’, ‘invest’, ‘firm’, ‘come’, ‘announce’, ‘grow’, ‘over’, ‘series’, ‘who’, ‘customer’, ‘plan’, ‘found’, ‘world’, ‘service’, ‘growth’, ‘month’, ‘first’, ‘opportunity’, ‘global’]

Figure A2. Period 2 topic: Investment word cloud and word list.

[‘chatgpt’, ‘chatbot’, ‘openai’, ‘google’, ‘language’, ‘gpt’, ‘code’, ‘text’, ‘answer’, ‘datum’, ‘gemini’, ‘prompt’, ‘llm’, ‘question’, ‘information’, ‘response’, ‘large’, ‘bard’, ‘developer’, ‘write’, ‘source’, ‘generate’, ‘train’, ‘ask’, ‘generative’, ‘meta’, ‘release’, ‘open’, ‘system’, ‘give’, ‘bot’, ‘version’, ‘test’, ‘claude’, ‘access’, ‘word’, ‘available’, ‘example’, ‘call’, ‘anthropic’, ‘bing’, ‘research’, ‘task’, ‘provide’, ‘base’, ‘assistant’, ‘training’, ‘own’, ‘capability’, ‘launch’]

Figure A3. Period 3 topic: Large language models word cloud and word list.

References

Alaparthi, S., & Mishra, M. (2021). BERT: A sentiment analysis odyssey. Journal of Marketing Analytics, 9(2), 118–126. [Google Scholar] [CrossRef]
Babina, T., Fedyk, A., He, A., & Hodson, J. (2024). Artificial intelligence, firm growth, and product innovation. Journal of Financial Economics, 151, 103745. [Google Scholar] [CrossRef]
Barbieri, F., Camacho-Collados, J., Espinosa Anke, L., & Neves, L. (2020). TweetEval: Unified benchmark and comparative evaluation for tweet classification. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics. [Google Scholar]
Baron, D. P. (2006). Persistent media bias. Journal of Public Economics, 90(1), 1–36. [Google Scholar] [CrossRef]
Bastani, K., Namavari, H., & Shaffer, J. (2019). Latent Dirichlet allocation (LDA) for topic modeling of the CFPB consumer complaints. Expert Systems with Applications, 127, 256–271. [Google Scholar] [CrossRef]
Bird, S., & Loper, E. (2004). NLTK: The natural language toolkit. In Proceedings of the ACL interactive poster and demonstration sessions. Association for Computational Linguistics. [Google Scholar]
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. [Google Scholar]
Boyd, R., Ashokkumar, A., Seraj, S., & Pennebaker, J. (2022). The development and psychometric properties of LIWC-22. Available online: https://www.liwc.app (accessed on 15 November 2024).
Davenport, T., & Kalakota, R. (2019). The potential for artificial intelligence in healthcare. Future Healthcare Journal, 6(2), 94–98. [Google Scholar] [CrossRef]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers). Association for Computational Linguistics. [Google Scholar]
Elejalde, E., Ferres, L., & Herder, E. (2018). On the nature of real and perceived bias in the mainstream media. PLoS ONE, 13(3), e0193765. [Google Scholar] [CrossRef]
European Union. (2024). Regulation (EU) 2024/1689 of the European parliament and of the council of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 1689. Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng (accessed on 23 August 2025).
Fast, E., & Horvitz, E. (2017, February 4–9). Long-term trends in the public perception of artificial intelligence. Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), San Francisco, CA, USA. [Google Scholar] [CrossRef]
FeedSpot. (2024). Top 100 tech news websites in 2024. FeedSpot. Available online: https://news.feedspot.com/tech_news_websites/ (accessed on 23 August 2025).
Franklin, E. (2015). Some theoretical considerations in off-the-shelf text analysis software. Available online: https://aclanthology.org/R15-2002.pdf (accessed on 20 November 2024).
Garvey, C., & Maskal, C. (2019). Sentiment analysis of the news media on artificial intelligence does not support claims of negative bias against artificial intelligence. Omics, 24(5), 286–299. [Google Scholar] [CrossRef]
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv, arXiv:2203.05794. [Google Scholar] [CrossRef]
Heaven, W. D. (2023). ChatGPT is everywhere. Here’s where it came from. MIT Technology Review, 10, 1–5. [Google Scholar]
Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy (version 3.7.2): Industrial-strength natural language processing in Python [Computer software]. Zenodo. [Google Scholar] [CrossRef]
Howard, J. (2019). Artificial intelligence: Implications for the future of work. American Journal of Industrial Medicine, 62(11), 917–926. [Google Scholar] [CrossRef]
Huang, C., Zhang, Z., Mao, B., & Yao, X. (2023). An overview of artificial intelligence ethics. IEEE Transactions on Artificial Intelligence, 4(4), 799–819. [Google Scholar] [CrossRef]
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169–15211. [Google Scholar] [CrossRef]
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. [Google Scholar] [CrossRef]
Kherwa, P., & Bansal, P. (2019). Topic modeling: A comprehensive review. EAI Endorsed Transactions on Scalable Information Systems, 7(24), e2. [Google Scholar] [CrossRef]
Kurniasih, A., & Manik, L. P. (2022). On the role of text preprocessing in BERT embedding-based DNNs for classifying informal texts. International Journal of Advanced Computer Science and Applications (IJACSA), 1024(512), 256. [Google Scholar] [CrossRef]
Liao, C.-H. (2023). Exploring the influence of public perception of mass media usage and attitudes towards mass media news on altruistic behavior. Behavioral Sciences, 13(8), 621. [Google Scholar] [CrossRef]
Liu, Y., & Li, X. (2021). Pro-environmental behavior predicted by media exposure, SNS involvement, and cognitive and normative factors. Environmental Communication, 15(7), 954–968. [Google Scholar] [CrossRef]
Liu, Z., Lin, W., Shi, Y., & Zhao, J. (2021, August 13–15). A robustly optimized BERT pre-training approach with post-training. Chinese Computational Linguistics: 20th China National Conference (CCL 2021), Hohhot, China. [Google Scholar] [CrossRef]
Maslej, N., Loredana Fattorini, R. P., Parli, V., Reuel, N., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., & Clark, J. (2024). The AI index 2024 annual report. Stanford University. Available online: https://aiindex.stanford.edu/wp-content/uploads/2024/05/HAI_AI-Index-Report-2024.pdf (accessed on 21 October 2024).
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093–1113. [Google Scholar] [CrossRef]
Milmo, D. (2023). ChatGPT reaches 100 million users two months after launch. The Guardian. Available online: https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app (accessed on 23 August 2025).
Mohanna, S., & Basiouni, A. (2024). Consumer’s cognitive and affective perceptions of artificial intelligence (AI) in social media: Topic modelling approach. Journal of Electrical Systems, 20(3), 1317–1326. [Google Scholar] [CrossRef]
Moriniello, F., Martí-Testón, A., Muñoz, A., Silva Jasaui, D., Gracia, L., & Solanes, J. E. (2024). Exploring the relationship between the coverage of AI in WIRED magazine and public opinion using sentiment analysis. Applied Sciences, 14(5), 1994. [Google Scholar] [CrossRef]
OpenAI. (2022). ChatGPT. OpenAI. Available online: https://chatgpt.com/overview?openaicom_referred=true (accessed on 30 November 2022).
Ouchchy, L., Coin, A., & Dubljević, V. (2020). AI in the headlines: The portrayal of the ethical issues of artificial intelligence in the media. AI & Society, 35(4), 927–936. [Google Scholar] [CrossRef]
Qi, W., Pan, J., Lyu, H., & Luo, J. (2023). Excitements and concerns in the post-ChatGPT era: Deciphering public perception of AI through social media analysis. Telematics and Informatics, 92, 102158. [Google Scholar] [CrossRef]
Rahmani, A. M., Rezazadeh, B., Haghparast, M., Chang, W.-C., & Ting, S. G. (2023). Applications of artificial intelligence in the economy, including applications in stock trading, market analysis, and risk management. IEEE Access, 11, 80769–80793. [Google Scholar] [CrossRef]
Rahutomo, F., Kitasuka, T., & Aritsugi, M. (2012, October 29–30). Semantic cosine similarity. The 7th International Student Conference on Advanced Science and Technology (ICAST 2012), Seoul, Republic of Korea. [Google Scholar]
Richardson, L. (2007). Beautiful soup documentation. Available online: https://ucilnica.fri.uni-lj.si/pluginfile.php/217774/mod_resource/content/1/beautiful-soup-4-readthedocs-io-en-latest.pdf (accessed on 21 August 2024).
Röder, M., Both, A., & Hinneburg, A. (2015, January 31–February 6). Exploring the space of topic coherence measures. WSDM 2015: Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China. [Google Scholar]
Řehůřek, R., & Sojka, P. (2010, May 22). Software framework for topic modelling with large corpora. LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta. [Google Scholar]
Schwartz, H. A., Eichstaedt, J., Blanco, E., Dziurzynski, L., Kern, M. L., Ramones, S., Seligman, M., & Ungar, L. (2013). Choosing the right words: Characterizing and reducing error of the word count approach. In M. Diab, T. Baldwin, & M. Baroni (Eds.), Second joint conference on lexical and computational semantics (* SEM), volume 1: Proceedings of the main conference and the shared task: Semantic textual similarity. Association for Computational Linguistics. [Google Scholar]
Shah, A. (2023). Top 40 tech news websites list to follow in 2023. SeekaHost. Available online: https://www.seekahost.com/best-tech-news-websites/ (accessed on 20 May 2024).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017, December 4–9). Attention is all you need. 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. [Google Scholar]
Vayansky, I., & Kumar, S. A. P. (2020). A review of topic modeling methods. Information Systems, 94, 101582. [Google Scholar] [CrossRef]
Wallach, H. M., Mimno, D., & McCallum, A. (2009, December 7–10). Rethinking LDA: Why priors matter. 23rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada. [Google Scholar]
Wang, W., & Siau, K. (2018, May 17–18). Artificial intelligence: A study on governance, policies, and regulations. Thirteenth Midwest Association for Information Systems Conference, Saint Louis, MO, USA. [Google Scholar]
Wankhade, M., Rao, A. C. S., & Kulkarni, C. (2022). A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7), 5731–5780. [Google Scholar] [CrossRef]
Xu, Y., Liu, X., Cao, X., Huang, C., Liu, E., Qian, S., Liu, X., Wu, Y., Dong, F., Qiu, C.-W., Qiu, J., Hua, K., Su, W., Wu, J., Xu, H., Han, Y., Fu, C., Yin, Z., Liu, M., … Zhang, J. (2021). Artificial intelligence: A powerful paradigm for scientific research. The Innovation, 2(4), 100179. [Google Scholar] [CrossRef]
Yi, A., Goenka, S., & Pandelaere, M. (2023). Partisan media sentiment toward artificial intelligence. Social Psychological and Personality Science, 15(6), 682–690. [Google Scholar] [CrossRef]
Zhai, Y., Yan, J., Zhang, H., & Lu, W. (2020). Tracing the evolution of AI: Conceptualization of artificial intelligence in mass media discourse. Information Discovery and Delivery, 48(3), 137–149. [Google Scholar] [CrossRef]
Zuiderwijk, A., Chen, Y.-C., & Salem, F. (2021). Implications of the use of artificial intelligence in public governance: A systematic literature review and a research agenda. Government Information Quarterly, 38(3), 101577. [Google Scholar] [CrossRef]

Figure 1. Graphical representation of the LDA model process.

Figure 2. Frequency of AI-related articles in the study dataset by year (2006–July 2024).

Figure 3. Identification of K topics for each time period using C_V coherence score.

Figure 4. Number of articles published in each media outlet included in the dataset.

Figure 5. Average word count of articles published in different media outlets.

Figure 6. Average sentiment distribution for articles published within each website.

Figure 7. Average percentage of positive, negative and neutral sentiment in articles across years.

Figure 8. Average normalized percentage of positive, negative and neutral sentiment in articles between August 2022 and July 2024.

Figure 9. Topics in each period in decreasing order of frequency along with topic paths.

Figure 10. Topic distributions for time periods 1, 2 and 3.

Figure 11. Topic popularity over time for chosen topic paths.

Figure 12. Comparison of the percentage of articles mentioning ChatGPT-related keywords in period 2 and period 3.

Figure 13. Sentiment distributions for chosen topic paths where the topic is discussed over the 25% threshold within an article.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jain, A.; Ranganathan, S. A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis. Journal. Media 2025, 6, 176. https://doi.org/10.3390/journalmedia6040176

AMA Style

Jain A, Ranganathan S. A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis. Journalism and Media. 2025; 6(4):176. https://doi.org/10.3390/journalmedia6040176

Chicago/Turabian Style

Jain, Arjun, and Shyam Ranganathan. 2025. "A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis" Journalism and Media 6, no. 4: 176. https://doi.org/10.3390/journalmedia6040176

APA Style

Jain, A., & Ranganathan, S. (2025). A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis. Journalism and Media, 6(4), 176. https://doi.org/10.3390/journalmedia6040176

Article Menu

A Longitudinal Analysis of Artificial Intelligence Coverage in Technology-Focused News Media Using Latent Dirichlet Allocation and Sentiment Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Curation and Cleaning

2.1.1. URL Extraction and Metadata Scraping

2.1.2. Data Validation and Final Cleaning

2.2. Sentiment Analysis

2.3. Topic Modeling: Latent Dirichlet Allocation

2.3.1. Text Preprocessing

2.3.2. Time Period Identification

2.3.3. Model Fine-Tuning

2.3.4. Training Models

2.3.5. Qualitative Analysis of Topics and Visualization

2.3.6. Similar Topics Across Different Periods

3. Results

3.1. General Trends in the Dataset

3.2. AI-Related Sentiment Trends

3.3. AI-Related Topics in the Media

3.4. Change in Topic Sentiment over Time

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Websites That Passed Inclusion Criteria and Those That Failed Inclusion Criteria

Appendix A.2. Sample of CV Coherence Scores Testing for Hyperparameter Tuning

Appendix A.3. Word Clouds, Word Lists, and Associated Topics for Three Topics—One per Period

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI