1. Introduction
In the era of big data, the financial industry is increasingly leveraging vast and complex datasets to enhance market predictions and investment strategies [
1]. A substantial portion of these data consist of unstructured textual information—including news articles, social media posts, and financial reports—which contains valuable insights into market sentiment [
2]. Effectively analyzing this textual content is essential for understanding how sentiment influences stock price movements [
3]. Sentiment analysis, a subfield of natural language processing (NLP), has become a critical tool for extracting and quantifying emotions, opinions, and attitudes from text, thereby offering insight into the psychological drivers of market behavior [
4].
Recent advances in machine learning and statistical modeling have significantly improved the predictive power of sentiment analysis in financial contexts. In particular, deep learning techniques have enhanced both the accuracy and scalability of sentiment analysis, enabling the efficient processing of large-scale financial datasets [
5]. However, conventional sentiment analysis methods often reduce sentiment to a single aggregate score, potentially overlooking subtle emotional nuances that may be crucial for financial decision-making [
6].
To address these limitations, latent profile analysis (LPA) has emerged as a promising technique for uncovering hidden patterns within sentiment data. LPA is a probabilistic clustering method that identifies unobserved subgroups within datasets, offering a more nuanced segmentation than traditional clustering approaches such as K-means [
7]. In the context of sentiment analysis, LPA enables the discovery of distinct sentiment profiles, revealing deeper insights into investor behavior and stock price dynamics [
8].
Integrating sentiment analysis with LPA creates a powerful framework for analyzing big data in financial markets [
9]. By segmenting sentiment data into latent emotional profiles, researchers can better understand how diverse emotional patterns shape market dynamics [
6]. Furthermore, this combined approach enhances the predictive capacity of financial models by identifying complex sentiment patterns often missed by traditional techniques [
10].
This study applies sentiment analysis and latent profile analysis to a large corpus of financial news articles spanning nearly two decades, using Teva Pharmaceutical Industries as a case study. The primary objective is to identify latent sentiment profiles and evaluate their impact on stock price fluctuations. To the best of our knowledge, this is the first study to integrate LPA with sentiment analysis for the purpose of stock price prediction, presenting a novel approach to understanding the relationship between market sentiment and financial outcomes.
Additionally, this study makes a significant theoretical contribution by integrating multiple conceptual frameworks, including the Efficient Market Hypothesis (EMH), Behavioral Finance Theory, Sentiment Theory, and Signal Theory. This interdisciplinary perspective provides a robust foundation for analyzing the psychological, informational, and rational dimensions of market behavior. By combining big data analytics with diverse theoretical lenses, this research advances methodologies in financial forecasting and offers actionable insights for investors, analysts, and policymakers navigating the complexities of modern financial markets [
1].
Table 1 summarizes key studies on sentiment analysis and stock price prediction, which are further explored in the
Section 2.
2. Literature Review
Natural language conveys significant information about individuals’ emotional and cognitive coping mechanisms [
16]. Research has shown that the words people use can offer insights into their physical and mental health [
17,
18,
19]. This has sparked growing interest in the automated identification and extraction of opinions, emotions, and sentiments from text [
20]. Various studies have employed diverse textual analysis techniques to predict fluctuations in stock prices.
2.1. Effects of Language on Stock Return
The influence of language on stock prices is well established. Eckhaus [
14] applied Natural Language Processing (NLP) to demonstrate that qualitative indicators of transformational leadership can predict stock performance. In a subsequent study, Eckhaus, Taussig, and Ben-Hador [
15] used NLP to show that Top Management Team (TMT) communication conveys tacit knowledge and cues relevant to stock prices. Tetlock, Saar-Tsechansky, and Macskassy [
10] explored whether quantitative measures of language could predict accounting earnings and stock returns, finding that the predictive power of negative words is strongest in news stories focused on fundamentals. Feng [
21] found that the presence of words such as “risk” and “uncertain” in annual reports correlates with lower annual earnings and stock returns.
Antweiler and Frank [
11] used a Naïve Bayes classifier to categorize Wall Street Journal corporate news stories by topic and predict their effect on trading behavior. In follow-up studies, Antweiler and Frank [
12], along with Das and Chen [
13], developed algorithms to generate “bullish”, “neutral”, or “bearish” ratings from Internet chat room messages and news stories. Both studies found significant predictive value for individual stock returns. Das and Chen [
13] emphasized capturing the emotive tone of text and discovered a correlation with stock values. Their approach combined a voting mechanism with additional classifiers, including Support Vector Machines (SVMs), to enhance prediction accuracy. Schumaker and Chen [
6] investigated the impact of financial news articles using three different textual representations—Bag of Words, Noun Phrases, and Named Entities—and showed that their model could forecast future stock prices.
While previous research has applied sentiment analysis to forecast market trends [
22], this study is, to the best of our knowledge, the first to integrate latent profile analysis with sentiment analysis for stock price prediction. This novel approach holds promise for significantly improved predictive accuracy.
The role of language and sentiment in shaping stock prices has been well documented. Studies have shown that linguistic features—whether from leadership communication, news media, or online discourse—can offer predictive insights into stock behavior [
10]. These findings highlight the importance of both qualitative and quantitative textual indicators in financial decision-making. Nevertheless, fully understanding these effects requires a solid theoretical grounding. Theories such as the Efficient Market Hypothesis (EMH) and Behavioral Finance provide essential frameworks for interpreting how sentiment and language influence market dynamics.
To summarize, this study addresses several important gaps identified in the existing literature. Despite significant advances in sentiment-based financial forecasting, key limitations remain. First, most prior research relies on aggregated sentiment scores, which tend to obscure subtle emotional nuances that may significantly influence market behavior. Second, the use of advanced segmentation techniques—such as latent profile analysis (LPA)—remains limited, despite their potential to uncover hidden emotional structures within sentiment data. Third, to the best of our knowledge, no previous studies have integrated sentiment analysis with LPA specifically for stock price prediction, representing a critical methodological gap. Finally, earlier work often relies on a single theoretical framework, lacking the integrated perspective provided by combining the Efficient Market Hypothesis, Behavioral Finance, Sentiment Theory, and Signal Theory. These gaps highlight the need for a more nuanced, probabilistic, and theoretically grounded approach to understanding and predicting financial market dynamics—an approach this study seeks to advance.
2.2. Theoretical Framework
2.2.1. Efficient Market Hypothesis (EMH) and Behavioral Finance Theory
The Efficient Market Hypothesis (EMH), developed by Eugene Fama in the 1960s, posits that stock prices fully incorporate all available information, rendering it impossible to consistently achieve excess returns through the analysis of either public or private data [
23]. However, recent research suggests that market sentiment—shaped by sources such as social media and financial news—can influence stock prices in ways not fully accounted for by the EMH. For example, Bollen, Mao [
24], demonstrated that the sentiment analysis of Twitter data could predict movements in stock market indices, implying that public sentiment contains valuable information beyond traditional financial metrics. Such findings challenge the EMH’s core assumptions by emphasizing the influence of non-rational and emotional factors in price formation and suggesting that markets may not always function with perfect efficiency.
In response to these limitations, Behavioral Finance Theory—advanced by scholars such as Richard Thaler and Robert Shiller—explores how psychological and emotional factors impact financial decision-making. Contrary to the EMH’s assumption of investor rationality, Behavioral Finance contends that cognitive biases and heuristics often drive behavior, giving rise to phenomena like herding behavior and overreaction [
25,
26]. Herding occurs when investors collectively follow market trends, thereby amplifying price volatility. Overreaction refers to exaggerated responses to news or events, causing temporary deviations from fundamental values. Public mood and sentiment—central constructs in Behavioral Finance—play a critical role in shaping these dynamics. For instance, overly negative sentiment during downturns may trigger panic selling, while excessive optimism can inflate speculative bubbles.
Together, these insights challenge the rational agent model proposed by the EMH and underscore the importance of sentiment as a driving force in market dynamics.
2.2.2. Sentiment Theory and Signal Theory
Complementing Behavioral Finance, Sentiment Theory examines how emotions and attitudes influence decision-making, particularly in financial contexts. Sentiment functions as an intermediary between available information and market reactions, often causing asset prices to diverge from intrinsic valuations. Excessive optimism can lead to asset bubbles, while persistent pessimism may result in undervaluation. Sentiment indicators—such as consumer confidence indices and media tone analyses—have been shown to predict market behavior, underscoring the psychological underpinnings of asset price movements [
3]. The theory is especially relevant to financial forecasting, as it promotes the quantification of sentiment to identify latent emotional patterns that drive investor behavior.
Signal Theory, developed by Spence [
27], offers a complementary lens by explaining how agents interpret signals in situations of asymmetric information. In financial markets, sentiment may act as a form of signaling, reflecting collective investor expectations and perceptions of market conditions. During times of uncertainty, for example, investors might rely on sentiment signals derived from media coverage or social media trends as proxies for fundamental information. The utility of these signals depends on their credibility, clarity, and the interpretive context. When paired with Sentiment Theory, Signal Theory helps explain how investors convert emotional cues into actionable strategies, particularly under conditions of ambiguity or incomplete information.
2.2.3. Integration and Relevance
Integrating EMH, Behavioral Finance, Sentiment Theory, and Signal Theory provides a comprehensive theoretical foundation for this study. The EMH establishes a baseline of rational market behavior, while Behavioral Finance accounts for deviations driven by cognitive and emotional factors. Sentiment Theory systematically quantifies these emotional influences, and Signal Theory situates sentiment within the broader context of communication under uncertainty. Together, these frameworks support the application of latent profile analysis (LPA) in this study to uncover hidden sentiment profiles that bridge traditional financial metrics with the emotional dimensions of investor behavior. By adopting this integrated perspective, the study offers a deeper understanding of the dynamic interplay between sentiment, decision-making, and stock price movements.
The integration of the EMH, Behavioral Finance, Sentiment Theory, and Signal Theory provides a multi-dimensional lens through which to interpret the complex dynamics of financial forecasting. While the EMH establishes a normative baseline of market rationality and information efficiency, Behavioral Finance challenges this view by highlighting systematic deviations driven by psychological biases and emotional responses. Sentiment Theory complements this by offering a framework for quantifying and analyzing those emotional patterns, thereby operationalizing key behavioral constructs. Signal Theory further enhances the explanatory framework by positioning sentiment as a communicative signal that operates in environments of uncertainty and information asymmetry. Together, these theories are not contradictory but rather complementary: the EMH frames the ideal, Behavioral Finance explains deviations, Sentiment Theory captures and measures the emotional component of those deviations, and Signal Theory describes the mechanism by which sentiment influences decision-making. Their integration thus offers a coherent and comprehensive foundation for modeling sentiment-driven stock price behavior.
2.3. Latent Profile Analysis in Sentiment Analysis
Latent profile analysis (LPA) is a categorical latent variable method used to identify unobserved subpopulations within a population based on a set of continuous indicators [
28]. In recent years, LPA has gained increasing attention in work and organizational sciences [
28]. Its ability to detect hidden subgroups makes it particularly valuable in sentiment analysis. In this domain, textual data are often converted into sentiment scores or categories—such as positive, neutral, or negative [
29]. Traditional methods typically rely on predefined sentiment categories or continuous sentiment scales, which may fail to capture the nuanced patterns embedded in textual data. LPA addresses this limitation by uncovering latent sentiment profiles, revealing underlying structures that are not immediately observable [
9].
LPA can be applied to examine sentiment distributions across subgroups—such as customer segments, demographic clusters, or product reviews—exposing distinct patterns in sentiment expression [
30]. By modeling the data as originating from a mixture of distributions, LPA detects subtle variations in sentiment, such as differences between strongly positive, moderately positive, or ambivalent expressions. This enables researchers to move beyond simplistic categorizations and to identify sentiment profiles influenced by contextual factors, linguistic style, or cultural background [
7].
Importantly, LPA assigns data points to latent profiles probabilistically, offering a measure of uncertainty in profile classification. This feature is especially useful in sentiment analysis, as it reflects the inherently subjective and ambiguous nature of emotional expression. For instance, in user-generated content like social media posts, LPA can detect distinct sentiment patterns based on tone, intensity, or the frequency of emotionally charged language [
31].
Integrating LPA into sentiment analysis workflows enhances researchers’ ability to capture the complexity of emotional expression in textual data. By identifying unobserved subgroups based on individuals’ sentiment patterns, LPA supports a more refined understanding of sentiment dynamics. As sentiment analysis continues to evolve, advanced methods like LPA offer a crucial bridge between quantitative rigor and qualitative insight into emotional content [
9].
2.4. Advantages of LPA
LPA offers several advantages over traditional clustering methods:
Clear Statistical Framework. LPA is grounded in an explicit statistical model that enables the assessment of model fit, unlike algorithms such as K-means, which lack a formal probabilistic structure [
31].
Probabilistic Classification. LPA provides posterior probabilities for profile membership, allowing for uncertainty in classification. This is especially valuable in contexts where group boundaries are not clearly defined, such as emotional or subjective data [
32].
Parameter Estimation by Group. LPA estimates unique parameters (e.g., means and variances) for each latent profile, offering greater flexibility than distance-based methods like K-means [
7].
Objective Model Selection. LPA supports model selection tools such as the Bayesian Information Criterion (BIC) to determine the optimal number of latent profiles, reducing reliance on arbitrary or subjective criteria common in other methods [
31]. Simulation-based comparisons have demonstrated LPA’s superiority over K-means and ensemble clustering methods in a variety of clustering tasks [
7,
33,
34].
LPA was selected over alternative clustering techniques—such as K-means, Hierarchical Clustering, or Gaussian Mixture Models (GMMs)—due to its unique advantages in modeling probabilistic membership and accounting for uncertainty in classification. Unlike K-means and Hierarchical Clustering, which rely on distance-based heuristics and assume hard cluster assignments, LPA uses a model-based approach that provides posterior probabilities for profile membership, allowing for a more nuanced understanding of sentiment distribution. Although GMM shares some conceptual similarity with LPA in modeling data as a mixture of distributions, LPA offers a more structured framework for modeling latent categorical variables and facilitates integration into regression and structural equation models. These features make LPA particularly suitable for analyzing sentiment data, which are inherently ambiguous, subjective, and often non-linearly distributed. As such, LPA aligns more closely with the conceptual and statistical needs of this study.
2.5. The Case Study of Teva Pharmaceutical Industries
Teva Pharmaceutical Industries Ltd. is a global pharmaceutical and biotechnology company headquartered in Israel, and one of the world’s leading producers of generic medications [
35]. Founded in 1901, Teva offers a broad portfolio of more than 3500 products, reaching nearly 200 million patients across six continents daily.
Teva serves as an ideal case study for applying latent profile analysis (LPA) to sentiment analysis in the context of stock price prediction. The company’s global leadership, combined with its historically volatile market performance, provides a rich context for examining the relationship between sentiment and stock behavior. Between 2000 and 2018, Teva experienced substantial fluctuations due to a variety of factors, including regulatory hurdles, mergers and acquisitions, patent expirations, and broader market forces. This period encompasses a diverse range of conditions—from rapid growth to significant downturns—offering a comprehensive dataset for sentiment analysis.
In addition, Teva’s prominence in both Israeli and international markets generates a high volume of sentiment-rich content, including news coverage, analyst commentary, and social media discourse. This abundance of data makes Teva particularly suitable for investigating how market sentiment correlates with stock performance.
Limiting the analysis to events up to 2018 ensures a focused dataset that captures organic market dynamics without the confounding influence of more recent disruptions. After 2018, Teva faced a series of extraordinary events, including legal exposure related to the opioid crisis and major organizational restructuring, which introduced atypical market behaviors driven by external shocks rather than sentiment-driven trends. Furthermore, the onset of the COVID-19 pandemic in early 2020 caused unprecedented volatility across global markets, including pharmaceuticals. These disruptions reflect irregular conditions that obscure typical sentiment-to-price relationships.
By ending the dataset in 2018, this study preserves a clear analytical environment for examining the predictive capabilities of sentiment analysis using LPA. This timeframe enables the identification of meaningful sentiment patterns and their relationship to stock performance, free from distortion by exceptional or exogenous crises.
3. Methods
3.1. Sample
To construct the article database, a systematic approach was employed. Two prominent financial news platforms—Bloomberg [
36] and Reuters [
37]—were selected as primary sources for financial articles. All articles containing the company name “Teva” were extracted from these platforms. Each article was then matched with Teva’s stock price on the next trading day following its publication.
To ensure data quality and relevance, a thorough data cleaning process was performed. Articles were excluded if they met one or more of the following criteria: (1) empty content or missing text; (2) corrupted encoding resulting in unreadable characters or gibberish; (3) articles containing fewer than five meaningful words, which were typically metadata-only entries or placeholders; and (4) documents unrelated to Teva, based on a manual inspection of the headline and body text. This filtering process aimed to retain only complete and content-rich articles that were suitable for sentiment analysis.
The final sample included 1496 Bloomberg articles published between 26 June 2000 and 28 February 2018, and 2431 Reuters articles published between 1 January 2012 and 9 August 2017. A rigorous data cleansing process was applied to eliminate irrelevant, blank, corrupted, or one-word entries. After this process, a total of 3843 articles remained. Each article was paired with its corresponding valence score and the stock price on the subsequent trading day, resulting in a robust dataset for analysis.
3.2. Sentiment Analysis
For sentiment analysis, this study employs the Valence Aware Dictionary for Sentiment Reasoning (VADER) [
4,
38], a widely used tool recognized for its effectiveness in analyzing textual data. The VADER integrates lexical features with syntactic and grammatical conventions [
39], making it particularly effective in detecting the emotional tone of written content. The tool generates a sentiment score ranging from −1 to +1, where −1 represents an extremely negative sentiment, +1 indicates an extremely positive sentiment, and values near 0 suggest a neutral sentiment.
3.3. Segmentation
From the set of available variables, valence and stock-related indicators—price, open, high, and low—were selected for analysis. Latent profile analysis (LPA) was performed using the mclust package in R [
40].
To assess model fit, the following criteria were considered: the Bayesian Information Criterion (BIC), the Bootstrapped Likelihood Ratio Test (BLRT), and the distinctiveness of the profiles [
41,
42]. The BIC identifies the best-fitting model based on the lowest value, while the BLRT compares nested models and provides
p-values indicating whether additional profiles significantly improve model fit.
The BIC is a statistical metric used to compare model fit while penalizing for complexity; lower BIC values indicate better-fitting models. The BLRT is a statistical test used to assess whether the addition of a profile improves model fit, with significance determined through resampling. These criteria were used together to balance statistical accuracy with model parsimony.
Models with an increasing number of latent profiles were fitted iteratively until the BLRT yielded a non-significant result, suggesting that further adding profiles did not enhance model fit.
The BIC favored a nine-profile solution, while the BLRT supported an eight-profile model. In the nine-profile model, however, the additional (ninth) profile included fewer than 1% of the observations and did not exhibit statistical distinctiveness from the preceding profile. Therefore, based on parsimony and profile clarity, the eight-profile model was selected as the optimal solution.
The decision to select the eight-profile solution over the nine-profile model was based on a combination of model fit indices and substantive interpretability. While the Bayesian Information Criterion (BIC) favored the nine-profile model due to a slightly lower value, the Bootstrapped Likelihood Ratio Test (BLRT) indicated that the improvement in model fit from the eighth to the ninth profile was not statistically significant (p = 0.15), suggesting diminishing returns from model complexity. Furthermore, the ninth profile contained fewer than 1% of the total observations and lacked distinctiveness in both sentiment and price variation when compared to the adjacent profiles. This raised concerns about overfitting and reduced practical interpretability. Sensitivity analysis confirmed that the inclusion of the ninth profile did not materially change the classification of data points in the other profiles or improve the model’s predictive accuracy. Therefore, the eight-profile solution was chosen as the most parsimonious and interpretable model that balances statistical rigor with conceptual clarity.
Model fit indices are presented in
Table 2.
Figure 1 presents the correlation heatmap of the latent profiles. As shown, no strong correlations (i.e., above 0.7) were observed between the profiles. This suggests that the profiles are relatively independent from one another in terms of linear associations and do not exhibit high similarity. These results support the conclusion that the latent profiles reflect distinct patterns within the dataset.
This low correlation structure further justifies the segmentation, indicating that each profile captures a unique emotional-sentiment structure with potentially different implications for stock price behavior. It also supports the use of all profiles in parallel regression modeling without excessive concern for multicollinearity.
Figure 2 presents the distribution of sentiment scores (
VALENCE) across the eight latent profiles.
Figure 2 illustrates the emotional structure of each profile using a violin plot, highlighting differences in both central tendency and variance. For example, Profile 1 exhibits a consistently high positive sentiment, whereas Profile 2 is characterized by strongly negative scores. These distributional patterns are consistent with the regression results and further support the presence of distinct emotional dynamics within each latent profile.
The assignment of news articles to the profiles are described in
Table 3.
3.4. Analysis
The analysis began with a simple linear regression model, in which price was the dependent variable and valence served as the independent variable. Next, eight separate regressions were conducted, each using one of the latent profiles as a single independent variable, again predicting price.
In the following step, all eight profiles were included simultaneously as predictors in a single linear regression model aimed at predicting price. This allowed for a direct comparison between models in terms of their predictive power.
To control for potential confounding effects, the number of words in each article was added as a control variable across all profile-based models, under the assumption that article length may influence the valence score.
To evaluate the overall goodness-of-fit of the models, Structural Equation Modeling (SEM) was employed [
43,
44]. Model fit was assessed using several common indices: Comparative Fit Index (CFI), Normed Fit Index (NFI), Tucker–Lewis Index (TLI), and Root Mean Square Error of Approximation (RMSEA). According to conventional thresholds, values of CFI, NFI, and TLI ≥ 0.95, and RMSEA < 0.06 indicate a good model fit [
45].
To ensure methodological transparency and facilitate future replication, we summarize here the key components of the data collection, preprocessing, and modeling pipeline. The full process was conducted using widely accessible tools. Articles were retrieved from Bloomberg and Reuters via keyword-based queries, followed by a combination of manual and automated filtering procedures, as described in
Section 3.1. Sentiment scoring was performed using the VADER tool, ensuring consistent and replicable sentiment extraction. Latent profile analysis was conducted in R using the mclust package, and regression models were implemented using standard statistical libraries. While proprietary access to Bloomberg and Reuters may limit full data replication, the methodological pipeline—comprising data preparation, sentiment analysis, and model specification—can be reproduced with similar datasets. We encourage future researchers to adapt and extend this framework, and we have provided detailed descriptions of each analytical stage to support transparency and reproducibility.
To clarify, the novelty of the proposed work does not lie in the use of latent profile analysis (LPA) or sentiment analysis independently—both of which have been applied in various domains—but rather in their integration for the purpose of stock price forecasting. To the best of our knowledge, this specific combination has not been explored in previous research. This methodological innovation enables the identification of latent sentiment structures within financial news and their direct linkage to stock market behavior, offering a new and powerful approach to sentiment-based financial prediction.
4. Results
To evaluate the predictive power of the profile-based approach compared to the direct use of valence, we first conducted a linear regression in which price served as the dependent variable and valence as the independent variable. Although the regression was statistically significant (p < 0.001), the model exhibited a relatively low coefficient of determination (R2 = 0.10), indicating limited explanatory power. Valence itself was a significant predictor (B = 2.56, p < 0.001).
Subsequently, eight separate linear regressions were conducted, each using one of the latent profiles as a single independent variable predicting
price. The results of these regressions are presented in
Table 4.
As shown in
Table 3, all of the latent profiles were statistically significant predictors of stock price, with the exception of two profiles that demonstrated equal or lower predictive power compared to the direct valence-based model.
Next, a multiple linear regression was conducted, in which all eight profiles were included simultaneously as independent variables, with stock price as the dependent variable. This model yielded a substantially higher coefficient of determination (R
2 = 0.47), indicating a marked improvement in explanatory power compared to the
valence-only model. The results of this regression are summarized in
Table 4.
As presented in
Table 5, nearly all profile variables were statistically significant predictors of stock price. The only exception was Profile 2, which did not reach statistical significance.
These findings suggest that incorporating the full set of latent sentiment profiles substantially enhances the model’s predictive power compared to relying solely on valence as a single sentiment indicator.
Figure 3 illustrates displays the relationship between sentiment scores (
VALENCE) and stock price changes across profiles.
Figure 3 displays scatter plots that separate data points by latent profile, highlighting distinct patterns in the sentiment–stock price relationship for each group.
Profile 1 is characterized by a predominantly positive sentiment (‘VALENCE’ variable), with articles reflecting generally optimistic tones. The corresponding stock price changes (‘Change’ variable) show a clear positive trend, indicating a strong association between positive sentiment and stock price increases. Profile 1 yielded a high positive coefficient in the regression analysis, underscoring its importance as a key driver in sentiment-based stock price forecasting.
Profile 2 includes mixed sentiment scores, leaning toward neutral or slightly negative tones. The associated stock price changes exhibit a weak and inconsistent pattern, with both increases and decreases observed. This profile demonstrates minimal predictive power and was the only one not found to be statistically significant in the regression model. These results suggest that Profile 2 represents market scenarios with little to no sentiment effect.
Profile 3 contains a balanced distribution of positive and neutral sentiment. Stock price changes associated with this profile reveal a moderate positive trend, suggesting a noticeable—though not strong—influence of sentiment on market movements. Profile 3 contributes moderately to the predictive model.
Profile 4 exhibits a predominantly neutral sentiment, with sentiment scores clustering around the midpoint or slightly negative. Stock price changes show minimal variation and lack a clear directional trend. Accordingly, Profile 4 contributes the least to the predictive power of the model.
Profile 5 is marked by a broadly positive sentiment and is associated with a moderate positive trend in stock price changes. Regression results highlight Profile 5 as a significant and consistent predictor, reinforcing its role in enhancing model performance.
Profile 6 reflects a mix of neutral to slightly negative sentiment, but the associated stock price changes show a strong negative trend. Despite relatively mild sentiment scores, the resulting sharp declines in stock prices make Profile 6 one of the strongest predictors in the model, with a prominent negative regression coefficient.
Profile 7 also displays largely neutral sentiment. Like Profile 4, stock price movements here are inconsistent and minimal. This profile demonstrates weak predictive value and contributes little to the regression model.
Profile 8 presents an intriguing case: sentiment scores are overwhelmingly positive, yet stock price changes show a strong negative trend. This contradiction highlights the complexity of sentiment-driven forecasting, where even highly positive sentiment may coincide with adverse market reactions. Profile 8 had one of the strongest negative coefficients in the regression analysis.
Finally, a Structural Equation Model (SEM) was constructed, controlling for the number of words across all profiles. Profile 2 was excluded from this model due to its lack of statistical significance. Correlations were specified among all remaining profiles, based on the assumption that each reflects a portion of the broader valence construct.
The hypothesized model demonstrated excellent fit:
p > 0.05, CFI = 1, NFI = 1, TLI = 0.99, RMSEA = 0.02.
All relationships were significant at p < 0.001, except for the effect of NumWords on Profile 3, which was significant at p < 0.01.
In summary, the SEM was employed to evaluate the overall model fit and the interrelations among the latent sentiment profiles, under the assumption that they collectively represent dimensions of the broader valence construct. Profile 2 was excluded due to its lack of statistical significance in prior analyses. The model was specified using standard SEM libraries, and model fit was assessed using commonly accepted indices (CFI, NFI, TLI, RMSEA), as reported in the
Section 4.
Figure 4 presents the full model with standardized coefficients.
Figure 4 presents the SEM model along with standardized coefficients.
Table 6 provides a comparative summary of the sentiment analysis methods, prediction targets, and explained variance (R
2) reported in prior studies, alongside the results of the current study.
5. Discussion
The findings of this study underscore the enhanced predictive capabilities of latent profile analysis (LPA) when integrated with sentiment analysis for stock price forecasting. By segmenting sentiment data into eight distinct latent profiles, this study demonstrates a significant improvement in predictive performance compared to traditional sentiment analysis approaches. Specifically, the segmented regression model achieved a substantially higher coefficient of determination (R2 = 0.47) than the direct sentiment model, highlighting the value of uncovering hidden sentiment structures that are otherwise undetectable.
Importantly, not all sentiment profiles contributed equally to stock price prediction. Profiles 6 and 8 exhibited the strongest negative coefficients, indicating that these sentiment patterns are associated with pronounced stock price declines. In contrast, Profile 1 showed a strong positive association, reflecting sentiment linked to market gains. These findings suggest that market reactions to sentiment are non-uniform, shaped by nuanced emotional configurations identified through LPA. Profiles 6 and 8, for example, captured patterns of intense negativity often connected to financial losses, regulatory actions, or broader uncertainty—factors that can amplify pessimistic market behavior.
These insights offer practical applications: for investors, detecting an increase in sentiment patterns such as Profiles 6 and 8 can serve as early warnings, prompting risk-averse strategies such as reducing exposure or hedging. For companies, recognizing the market’s reaction to sentiment clusters can inform communication strategies aimed at mitigating adverse reactions.
Recent studies have emphasized the importance of advanced sentiment analysis techniques in improving the predictive accuracy of financial forecasting models. For instance, Araci [
46] introduced FinBERT, a BERT-based language model fine-tuned for financial sentiment analysis, which demonstrated superior classification performance on financial texts. Similarly, Zou, Zhao [
47] reviewed a wide range of deep learning techniques applied to stock market prediction, highlighting their improved accuracy over traditional models. Furthermore, Jiang [
48] provided a comprehensive review of recent advances in applying deep learning to stock forecasting, identifying both opportunities and challenges in the field. These studies underscore the relevance of our approach, which combines sentiment analysis with latent profile modeling to better capture the complex emotional patterns embedded in financial news.
One particularly noteworthy and counterintuitive finding is the strong negative relationship observed in Profile 8, despite the profile being characterized by consistently high positive sentiment scores. Several potential explanations may account for this anomaly. First, it is possible that a highly optimistic sentiment in financial news reflects overcompensation or public relations efforts during periods of internal crisis or negative fundamentals, which investors may interpret with skepticism. In such cases, positive language may be perceived as lacking credibility, thus triggering negative market reactions. Second, Profile 8 may correspond to instances of “irrational exuberance” or sentiment inflation, where excessive optimism signals a market peak, prompting corrections or profit-taking behaviors. Additionally, the source and tone of the articles in this profile—such as overly promotional content or repetitive positivity—might have contributed to a divergence between textual sentiment and market perception. These findings underscore the importance of considering contextual and psychological factors when interpreting sentiment data and highlight the added value of LPA in detecting such nuanced and complex sentiment–price dynamics.
The study also highlights a key limitation of conventional sentiment analysis approaches, which typically rely on aggregated sentiment scores and fail to account for the complexity and variability inherent in textual sentiment. By applying LPA, this research introduces a probabilistic and profile-based framework that captures uncertainty and supports more robust and context-sensitive predictions. Moreover, the inclusion of control variables—such as the number of words per article—further improves model precision and demonstrates its flexibility for application across varied textual datasets.
These findings hold meaningful implications for financial forecasting. Traders, analysts, and institutional investors can benefit from LPA-based sentiment segmentation, gaining deeper insights into the diverse ways market sentiment influences stock prices. This level of granularity enables the design of more targeted trading strategies that account for heterogeneity in market responses.
These insights not only contribute to the academic understanding of sentiment-driven market behavior but also offer a practical framework that financial professionals can apply to improve forecasting accuracy, risk assessment, and communication strategies in real-world investment contexts.
Contribution and Novelty of the Study
This study offers several novel contributions, both methodological and theoretical, that extend the scope of existing research:
Novel Intersection of Domains:
The study introduces a new intersection between advanced clustering techniques and financial forecasting. While LPA has been used in psychological and social science contexts, its application to stock market prediction remains underexplored. This research advances the field by moving beyond aggregated sentiment scores and demonstrating how latent sentiment profiles can uncover previously hidden patterns that are critical for understanding stock price movements.
Probabilistic Forecasting Framework:
The integration of LPA into stock price prediction represents a methodological advancement. Traditional models often assume linear relationships between sentiment and market behavior, but LPA accounts for the non-linear, uncertain, and context-dependent nature of sentiment data. By identifying latent structures, this approach offers a richer and more actionable interpretation of how sentiment drives financial outcomes.
Capturing Multi-Dimensional Sentiment Dynamics:
This study introduces the ability to detect multi-dimensional sentiment patterns that reflect diverse market conditions—such as periods of high volatility, crisis, or stability. This nuanced segmentation enhances the explanatory and predictive power of sentiment-based models, enabling differentiation between sentiment responses to varying economic contexts.
Cross-Disciplinary Methodological Innovation:
This research bridges a methodological gap by applying LPA—traditionally confined to psychology and sociology—to a financial context. This cross-disciplinary application underscores LPA’s versatility and opens the door for incorporating other advanced clustering or machine learning techniques into financial forecasting, setting a precedent for further innovation.
Integration of Multiple Theoretical Frameworks:
This study offers a comprehensive conceptual foundation by integrating four key theoretical perspectives: Efficient Market Hypothesis (EMH), Behavioral Finance Theory, Sentiment Theory, and Signal Theory.
By combining these lenses, this research bridges rationalist views of market efficiency, psychological insights into investor behavior, and communicative interpretations of sentiment. This integrated framework enhances the theoretical depth of the analysis and improves its explanatory power, contributing both to theoretical advancement and practical financial analytics.
6. Limitations and Future Studies
While this study demonstrates the potential of latent profile analysis (LPA) in enhancing sentiment-based stock price prediction, several limitations should be acknowledged.
First, this analysis is based solely on articles related to Teva Pharmaceutical Industries, which served as a case study. This narrow focus limits the generalizability of the findings to other companies, sectors, or markets. Future research could apply the proposed methodology to a wider range of firms, industries, or even market indices to evaluate its robustness and relevance in diverse financial contexts.
Second, the dataset is limited to the period 2000–2018. This timeframe was deliberately selected to avoid contamination from extraordinary events such as the COVID-19 pandemic, large-scale legal proceedings, or atypical market interventions. However, we acknowledge that external market events—such as earnings announcements, regulatory decisions, and global financial crises—may act as confounding factors that influence both sentiment expression and stock price behavior. While this study sought to minimize such effects by focusing on a relatively stable historical window, it is not possible to fully disentangle all external influences. Future research could incorporate event-based controls, filtering, or longitudinal designs to more explicitly account for such exogenous shocks and to examine how investor sentiment evolves in periods of heightened uncertainty.
Third, a potential source of bias lies in the selection of news sources. This study relied exclusively on articles from Bloomberg and Reuters, two highly reputable financial news outlets. While these sources provide reliable and high-quality reporting, their editorial tone, framing, and audience orientation may systematically differ from other financial or popular media platforms. As a result, the sentiment patterns captured in this dataset may not fully represent the diversity of perspectives found in alternative outlets such as financial blogs, social media, or regional news sources. Additionally, restricting the analysis to the 2000–2018 period introduces temporal bias, as it excludes more recent developments and structural shifts in media consumption and market behavior—such as the growing influence of real-time sentiment from platforms like Twitter or Reddit. These factors may influence both the nature of sentiment expression and its market impact and should be taken into account when interpreting the generalizability of the findings. Future research could address these limitations by incorporating a broader range of sources and expanding the time horizon to capture evolving sentiment mechanisms in financial markets.
Fourth, the latent profiles identified in this study are shaped by the specific modeling parameters used. This may limit the adaptability of the profiles to other datasets or contexts. Exploring alternative clustering techniques or validating the profiles across different sectors, time periods, or economic environments could improve both the robustness and generalizability of the approach.
Another promising direction is the incorporation of multimodal data sources. Integrating sentiment signals from social media, news videos, or trading volume data could enrich the analysis and enhance predictive accuracy. Such integration would enable a more holistic understanding of market sentiment and its influence on financial outcomes.
Moreover, the weaker predictive performance observed for certain profiles suggests that additional contextual features—such as tone, source credibility, or media outlet—may moderate the sentiment–price relationship. Future studies could embed these factors into the LPA framework to further refine model performance and improve explanatory power.
While this study focused on Teva Pharmaceutical Industries as a single case study, we acknowledge that this narrow scope may limit the generalizability of the findings. Teva was selected due to its global market presence, historically volatile stock behavior, and the availability of a large volume of sentiment-laden media coverage. However, future research should apply the proposed methodology to other companies across different industries and market contexts. Validating the latent sentiment profiles and their predictive power in sectors such as technology, energy, or finance—and across broader indices—would help assess the robustness, adaptability, and scalability of the approach. Such extensions could confirm the model’s applicability in diverse market environments and enhance its utility for broader financial forecasting applications.
While VADER was originally developed for sentiment analysis in social media contexts, it has also been applied in multiple studies analyzing financial news and formal textual data (e.g., [
49]), due to its simplicity, transparency, and strong baseline performance. In this study, VADER was chosen for its ability to handle valence-based scoring, integration with rule-based heuristics (e.g., punctuation, capitalization, degree modifiers), and ease of interpretability when scaled across thousands of articles. However, we acknowledge that VADER may have limitations in capturing domain-specific financial language, such as technical jargon, nuanced tone, or forward-looking statements often found in Bloomberg and Reuters articles. Future studies could explore complementing or replacing VADER with domain-adapted sentiment tools. Such enhancements may improve sentiment precision and further strengthen predictive modeling in this domain.
The primary objective of this study was not to compare clustering techniques, but rather to explore the application of latent profile analysis (LPA) as a novel framework for modeling sentiment structures and enhancing stock price prediction. LPA was chosen specifically for its probabilistic foundation, ability to handle uncertainty in profile membership, and compatibility with regression and SEM-based modeling. While a comparative analysis of clustering algorithms is beyond the scope of this study, we acknowledge its potential value and suggest it as an avenue for future research.
Finally, while this study focuses on stock price prediction, the methodology could be extended to other financial indicators, such as market volatility, trading volume, or credit risk. Further research may explore whether combining LPA with sentiment analysis can uncover patterns relevant to macroeconomic forecasting or cross-market dynamics.
By addressing these avenues, future studies can build on the current research to advance the integration of sentiment analytics, latent profiling, and financial forecasting.