1. Introduction
In the era of real-time information flows, financial news plays a pivotal role in shaping investor sentiment and influencing short-term fluctuations in stock prices. Early studies have empirically shown that the tone and amount of media coverage can influence market trends and investor behavior [
1]. Compared with traditional quantitative indicators such as historical prices, trading volume, or technical indices, financial texts frequently provide early warning signals that anticipate market reactions before they are reflected in numerical data. This phenomenon is particularly pronounced in emerging markets, where timely news reports can trigger rapid investor responses and directly shape short-term investment behavior. Consequently, text-based approaches to financial forecasting have become a growing focus in both academic research and practical financial applications.
Although natural language processing (NLP) techniques have been widely applied to financial forecasting, most prior studies exhibit three critical limitations. First, the majority of research has concentrated on English-language corpora, especially those from U.S. and European markets [
2,
3]. Models trained under such conditions often fail to generalize effectively to non-English contexts, where linguistic structures, segmentation issues, multiple meanings, and specialized financial terms differ substantially. In the case of Chinese financial texts, these challenges are further compounded by the lack of large-scale annotated data, limiting the ability of existing approaches to achieve stable performance.
While limited data remain a challenge, the connection between textual sentiment and market behavior is widely recognized. To illustrate this relationship,
Figure 1 shows a simple example: three short news headlines are marked as positive, neutral, or negative, corresponding respectively to rising, stable, and falling stock trends. This example demonstrates how the tone of financial news reflects investor expectations and affects short-term market reactions.
Previous works have typically treated sentiment classification and stock price prediction as separate tasks [
4,
5,
6,
7,
8]. While sentiment analysis is commonly used to assess market emotions and regression models are employed for numerical prediction, the lack of integration between these two tasks has hindered a deeper understanding of how signals derived from news actually translate into predictive value for financial markets. Most studies adopt a two-step process, where sentiment scores or indices are first generated from financial texts and then used as inputs for later forecasting models, such as the BERT–LSTM systems developed by Hiew et al. [
9] and Gu et al. [
10]. Although these methods demonstrate that sentiment information is beneficial for predicting market trends, they generally stop at feature-level fusion rather than performing end-to-end learning that directly connects qualitative sentiment with quantitative price changes.
Recently, some studies have tried to close this gap by modeling textual sentiment and market data together within unified or multimodal systems. For instance, Ho et al. [
11] built a cooperative network that merged social sentiment (via CNN) and price charts (via image-based CNN) to predict stock movement. Nguyen et al. [
12] proposed DASF-Net, a graph-based framework that adapts sentiment signals to stock relationships. Koval et al. [
13] introduced a multimodal model with separate experts for text and time-series data to improve forecasting. Despite these advances, few works have provided a broad evaluation of different text representations and model types under the same setting, particularly for non-English financial data. This motivates the present study to design a unified framework that examines how sentiment features, multimodal strategies, and prediction architectures jointly affect forecasting performance.
In this study, sentiment analysis refers to examining opinions and emotions in text, while sentiment classification means grouping texts into positive or negative categories. Based on these ideas, this work investigates three questions: (i) whether contextual embeddings offer clear advantages over traditional features in sentiment classification, (ii) to what extent sentiment from news helps short-term stock prediction, and (iii) how different levels of market volatility influence this relationship. These questions guide the design of our experiments.
To address these challenges, this study proposes a comprehensive dual-task framework that integrates binary sentiment classification and short-term stock price regression using Chinese financial news. The dataset comprises 38,918 news articles covering Taiwan’s top 50 listed companies (the Taiwan 0050 index) between 2021 and 2023, along with corresponding stock market data. This corpus not only reflects the real-world dynamics of Taiwan’s capital market but also provides a valuable benchmark for testing language-based models in non-English environments. To ensure annotation reliability, sentiment labels were manually assigned by annotators with finance backgrounds and validated through a double-checking process.
Within this framework, five representative types of word embeddings were systematically investigated: one-hot encoding, TF-IDF, Word2Vec (continuous bag-of-words and skip-gram) [
14], and contextual embeddings derived from BERT [
15]. These were combined with 17 prediction models across four categories: (1) traditional machine-learning classifiers such as Naïve Bayes (NB), logistic regression (LR), and random forest (RF); (2) deep neural models such as convolutional neural networks (CNNs) [
16] and long short-term memory networks (LSTMs) [
17]; (3) transformer-based encoders such as BERT [
15], RoBERTa [
18], and BART [
19]; and (4) a large language model (LLaMA3) [
20], fine-tuned on Chinese financial data. By adopting a unified experimental setup, the study provides a detailed comparison of model performance across both sentiment classification and regression tasks.
This work includes forecasting horizons and volatility-aware analysis. In addition to five-day prediction windows, fifteen-day horizons are examined to evaluate the stability of models across varying time scales. Moreover, the constituent stocks of the Taiwan 0050 index are grouped by volatility, using the coefficient of variation (CV) of closing prices as a normalized metric. This design enables a detailed assessment of how predictive accuracy varies between low-volatility (stable) and high-volatility (uncertain) stocks—an aspect of particular importance for real-world investment strategies.
In summary, this paper makes three contributions:
It introduces an integrated dual-task framework that combines financial sentiment classification and short-term stock price forecasting, representing one of the most extensive evaluations of Chinese financial texts from Taiwan’s top 50 listed companies.
It provides comprehensive insights into the relative strengths of different paradigms, including traditional machine learning, deep neural networks, transformers, and large language models, by systematically comparing five embedding methods and seventeen model architectures under a unified protocol.
It explores how sentiment-based predictions relate to actual market behavior by analyzing model performance across time horizons and volatility groups, supported by case studies of firms such as TSMC and Alchip.
The remainder of this article is structured as follows:
Section 2 reviews related work;
Section 3 details the proposed framework and methodology;
Section 4 presents the experimental setup and results; and
Section 5 discusses and interprets the results; and
Section 6 concludes with key findings and future directions.
3. Methodology
3.1. Problem Formulation
This study addresses two interconnected predictive tasks: binary sentiment classification of financial news and short-term stock price forecasting. Together, these tasks offer a comprehensive understanding of how textual information influences financial decision making. For the sentiment classification task, each financial news article
is treated as an input document. Formally, we assume
, where
denotes the space of tokenized and preprocessed news texts represented as sequences of words or embeddings. In practice,
can be instantiated as vectorized representations, such as TF-IDF features or pre-trained embeddings, depending on the model used. The objective is to assign a binary sentiment label
, where 1 denotes positive sentiment (optimistic outlook or favorable signals) and 0 represents negative sentiment (pessimistic outlook or unfavorable signals). Formally, the classifier learns a mapping function:
For the regression task, the aim is to predict the future closing price of a stock. Let
denote the closing price on day
t, where
is the set of non-negative real numbers. Given financial news and stock data up to day
t, denoted as
, the regression model forecasts the closing price at day
, where
represents short-term (five-day) and mid-term (fifteen-day) horizons:
The choice of closing price is motivated by its widespread use as a benchmark indicator of market performance, as it reflects the outcome of intraday trading and represents the market’s consensus valuation at the end of the day. Prior studies have also adopted the closing price as the primary target variable in stock forecasting tasks due to its stability and interpretability [
42,
51]. Compared to opening or intraday prices, the closing price reduces the influence of short-lived fluctuations and aligns naturally with the daily release of financial news. To capture short- and near-term market reactions, two prediction horizons were set: five days and fifteen days. The 5-day horizon reflects immediate sentiment-driven movements, while the 15-day horizon examines the persistence of short-term trends. Such settings are consistent with prior stock forecasting studies that employ short- to medium-term horizons within 5–20 trading days [
52,
53,
54]. This horizon design enables the evaluation of how predictive performance varies with time scale—shorter horizons align more closely with sentiment signals, and longer ones reflect broader market dynamics.
Building on prior findings that textual sentiment conveys investors’ expectations beyond numerical indicators, previous studies [
55,
56] have demonstrated that integrating textual and quantitative features can enhance the predictive performance of stock forecasting models. Accordingly, we hypothesize that combining sentiment-derived textual representations with numerical stock variables yields superior short-term forecasting accuracy compared to using either modality in isolation.
3.2. System Architecture
The proposed framework consists of five sequential stages—data collection, preprocessing, word embedding representation, model training and validation, and prediction and evaluation—integrated into an end-to-end workflow for financial news analysis and stock price forecasting (
Figure 2). During the data collection stage, financial news was obtained from the CMoney platform (
https://www.cmoney.tw/, accessed on 15 August 2024), and daily stock data for Taiwan’s top 50 listed companies were retrieved from Yahoo Finance (
https://finance.yahoo.com/, accessed on 15 August 2024). Each article was paired with the corresponding company and trading date to ensure consistency between textual and numerical data. Preprocessing then reduced news texts to concise representations, removed extraneous noise, and applied segmentation to preserve the most informative content for subsequent analysis. For representation, five embedding approaches were employed to transform text into numerical vectors, thereby capturing lexical, semantic, and contextual information. During model training and validation, a diverse set of machine learning, deep learning, and transformer-based encoders was evaluated under a unified protocol, together with a large language model, LLaMA3 [
20], which was fully fine-tuned on the Chinese financial news dataset. Finally, model outputs were assessed using classification and regression metrics, with further analyses conducted across forecasting horizons and volatility groups to evaluate robustness.
3.3. Data Collection, Labeling, and Preprocessing
The dataset integrates Chinese financial news articles with corresponding stock market data, forming parallel resources for both classification and regression tasks. Financial news was obtained from CMoney, a widely used financial information platform in Taiwan, while stock market data—including open, high, low, close, and adjusted close prices—were retrieved from Yahoo Finance. The corpus covers reports related to Taiwan’s top 50 listed companies (the Taiwan 0050 index) from January 2021 to December 2023. Each news article was aligned with the corresponding company and trading date to ensure consistency between textual and numerical data. In total, the labeled corpus comprised 38,918 financial news articles, of which 18,453 were annotated as negative, reflecting pessimistic or downward-oriented sentiment, and 20,465 as positive, reflecting optimistic or upward-oriented sentiment. This balanced distribution between positive and negative samples provides a robust foundation for training and evaluating sentiment classification models.
Two task-specific datasets were constructed: a classification dataset (C) that pairs each article with a sentiment label, and a regression dataset (R) that links articles with their future closing prices. To avoid bias from duplicated reports within the same trading day, only the final article per company per day was retained in the regression dataset. The regression target corresponds to the closing price on the 5th and 15th trading day after the news release. Sentiment annotation was conducted by undergraduate students with finance backgrounds. Each article was assigned a binary label (1 for positive sentiment, such as favorable performance, expansion, or investment; 0 for negative sentiment, such as financial loss, risk exposure, or policy restrictions). A double-checking mechanism was implemented, whereby disagreements were resolved through consensus, resulting in a reliable corpus for supervised learning.
To prepare the texts for modeling, a three-step preprocessing procedure was adopted: reduction, cleaning, and segmentation. First, only the opening sentence of each article was retained, using the first occurrence of the Chinese full stop as the segmentation boundary. Prior studies in journalism and natural language processing show that the opening sentence or headline of financial news typically conveys the article’s main message [
57,
58]. Moreover, the Chinese full stop denotes the completion of an independent idea, while commas connect related clauses [
59], supporting this strategy as both efficient and semantically coherent. This reduction step shortened articles from an average of approximately 640 characters to 96 characters while maintaining the main summary of each news item. Second, cleaning operations were applied to remove extraneous content such as punctuation, numbers, and special symbols. Third, word segmentation was performed using the Jieba toolkit, which was supplemented with an extended, domain-specific lexicon to improve the recognition of company names and financial terminology. The resulting preprocessed corpus was then transformed into numerical representations in the embedding stage.
3.4. Word Embedding
A critical step in natural language processing is transforming unstructured text into numerical representations that can be processed by machine learning and deep learning models. In this study, five representative embedding approaches were employed to capture different levels of lexical and semantic information from Chinese financial texts. These methods were selected not only for their prevalence in prior financial sentiment research but also for their ability to illustrate the trade-offs among sparse, dense, and contextualized representations.
One-hot encoding, the most basic representation, assigns each unique token a binary vector with a single dimension set to one and all others set to zero. While computationally interpretable and straightforward, this method suffers from extreme sparsity and lacks semantic relationships between words; it was therefore included primarily as a baseline for comparison. TF-IDF improves upon this by weighting word frequency against document distribution, producing sparse yet informative vectors that are suitable for traditional classifiers, such as logistic regression or support vector machines. However, TF-IDF still cannot capture word order or contextual meaning.
To address these limitations, dense embeddings were generated using Word2Vec [
14]. Both the CBOW and skip-gram architectures were trained on the financial corpus. CBOW predicts a target word based on its surrounding context and is particularly efficient for frequent terms. In contrast, skip-gram predicts surrounding words given a target, capturing rare and domain-specific vocabulary more effectively. These dense embeddings yield compact vector spaces in which semantically related words are positioned closer together, providing richer input for deep neural models such as CNNs and LSTMs.
Finally, contextualized embeddings were obtained using BERT (Bidirectional Encoder Representations from Transformers) [
15]. Unlike static embeddings, BERT produces word representations that adapt dynamically to surrounding tokens. This property is fundamental in financial texts, where the meaning of terms often shifts depending on context (e.g., “margin” in accounting versus trading). In this study, a Chinese BERT model was employed to capture nuanced semantics and long-range dependencies, thereby achieving state-of-the-art performance in sentiment classification and related tasks.
The inclusion of these five embedding methods was not intended merely as an independent application of each but rather as a systematic evaluation of their compatibility with different model paradigms. Sparse representations (one-hot, TF-IDF) were expected to align better with traditional machine learning algorithms due to their interpretability and linear separability. Dense embeddings (Word2Vec) were better suited for deep learning architectures, which can leverage continuous vector spaces to capture compositional semantics. Contextualized embeddings (BERT) were anticipated to yield advantages in transformer-based models, where contextual cues are critical for classification and forecasting. By comparing all embedding–model combinations under a unified evaluation pipeline, this study aims to identify the most effective representation strategies for Chinese financial sentiment classification and stock price forecasting. It is worth noting that LLaMA3 [
20], while inherently generating contextualized embeddings through its transformer layers, was not paired with external embedding methods. Instead, it was fine-tuned as an end-to-end predictive model, and its performance was evaluated alongside the other embedding-model combinations.
3.5. Model Training
To systematically evaluate the performance of different learning paradigms in financial sentiment classification and stock price forecasting, we implemented a comprehensive suite of models spanning traditional machine learning algorithms, deep neural architectures, transformer-based encoders, and a large language model. Each model was paired with the appropriate embedding technique and trained separately for both classification and regression tasks.
The traditional machine learning models included nine widely used algorithms: AdaBoost [
60], decision trees [
61], Naïve Bayes, gradient boosting [
62], k-nearest neighbors [
63], logistic regression [
64], multilayer perceptrons [
65], random forests [
66], and support vector machines [
67]. These models were trained on static word representations such as one-hot, TF-IDF, and Word2Vec embeddings, and were selected for their interpretability and computational efficiency, particularly for smaller-scale datasets.
To capture sequential and semantic information in financial texts, we further examined four deep neural networks: convolutional neural networks [
16], recurrent neural networks [
68], gated recurrent units [
69], and long short-term memory networks [
17]. Convolutional layers were used to detect local n-gram patterns, whereas recurrent structures were designed to capture temporal dependencies across token sequences. These models were primarily trained on dense Word2Vec embeddings, with dropout regularization and standard optimization strategies applied to mitigate overfitting.
Transformer-based models were also incorporated, specifically BERT [
15], BART [
19], and RoBERTa [
18], which were fine-tuned on the labeled Chinese financial corpus. Pretrained models were obtained from Hugging Face, namely BERT (ckiplab/bert-base-chinese), RoBERTa (xlm-roberta-base), and BART (fnlp/bart-base-chinese). These models generated contextualized embeddings at the sentence- or document-level and passed them through task-specific heads for either classification or regression. Fine-tuning was conducted under controlled optimization settings, with early stopping applied to mitigate the risks of overfitting that arise from limited training data.
Finally, we extended the comparison to a large language model by including LLaMA3 [
20], specifically the Llama-3.2-1B implementation available on Hugging Face (
https://huggingface.co/meta-llama/Llama-3.2-1B, accessed on 1 March 2025). Unlike the other models, which relied on external embeddings, LLaMA3 was fully fine-tuned on the Chinese financial news dataset, allowing all parameters to adapt end-to-end for both classification and regression. This design highlights its distinct role as an integrated predictive framework, evaluated alongside the embedding model combinations.
To ensure fairness and reproducibility, all models were trained under standardized procedures in the same computational environment. Hyperparameters, including batch size, learning rate, sequence length, and training epochs, are summarized in
Table 1.
To further clarify the hyperparameter settings, several important considerations are noted. For the CNN, RNN, GRU, and LSTM models, the batch size was set to 32, balancing training stability with computational efficiency. In contrast, transformer-based models (BERT and LLaMA3) adopted smaller batch sizes (16 and 8, respectively) due to their larger parameter space and higher memory requirements. The number of epochs was standardized across tasks to ensure fairness in comparison: a smaller number for classifications (5) to prevent overfitting on binary labels, and a larger number for regressions (20) to capture long-term temporal dependencies. The maximum sequence length was fixed at 256 for recurrent models and LLaMA3, but extended to 512 for BERT to leverage its ability to process longer contexts. Learning rates were tuned according to best practices for each model family, with lower rates preventing catastrophic forgetting during training. Dropout regularization was applied only to deep neural architectures (set to 0.2), while transformer-based models relied on their intrinsic regularization mechanisms. Collectively, these standardized settings ensured reproducibility and fair cross-model comparisons.
4. Experiments
4.1. Experimental Setup and Evaluation Metrics
The experiments were designed to address four main objectives. First, we compared five types of word embeddings in combination with a broad set of machine learning, deep learning, and transformer-based models to evaluate their effectiveness in binary sentiment classification of Chinese financial news. Second, we examined the same model—embedding combinations in regression tasks, forecasting stock prices at both five-day and fifteen-day horizons. Third, we investigated the impact of prediction horizon length by contrasting the relative difficulty of short-term and mid-term predictions. Finally, we analyzed the prediction performance across volatility groups by dividing the constituent stocks of the Taiwan 0050 index into three categories—high, medium, and low volatility—based on the coefficient of variation (CV) of closing prices.
The CV was adopted as a normalized volatility measure to facilitate fair comparisons across companies with varying price levels. It is defined as the ratio of the standard deviation (
) to the mean closing price (
), expressed as a percentage.
Based on this metric, companies were stratified into three groups: high volatility (CV > 20%), medium volatility (10% ≤ CV ≤ 20%), and low volatility (CV < 10%).
All experiments were conducted in a controlled computational environment equipped with an NVIDIA RTX 4090 GPU (24 GB), an Intel Core i7-14700F CPU, and 128 GB of RAM. The software stack included PyTorch 2.4, Hugging Face Transformers 4.46, Jieba for Chinese word segmentation, and scikit-learn for machine learning implementations. This environment ensured consistency and reproducibility across all experiments.
Two datasets were constructed for evaluation. The classification dataset (Dataset C) paired financial news articles with manually assigned sentiment labels. In contrast, the regression dataset (Dataset R) aligned articles with the corresponding stock closing prices at five-day and fifteen-day intervals. Both datasets were derived from the same corpus of 38,918 Chinese financial news articles covering Taiwan’s top 50 listed companies between 2021 and 2023, as described in
Section 3.3. To ensure robust and unbiased evaluation, all experiments were conducted using five-fold cross-validation, with results averaged across folds.
Model performance was assessed using standard evaluation metrics, with definitions provided in Equations (
2)–(
7). For the classification task, precision, recall, and F1-score (Equations (
2)–(
4)) were reported to capture complementary aspects of predictive quality. Precision reflects the reliability of positive predictions, recall measures the coverage of actual positives, and F1-score balances the trade-off between the two. For the regression task, mean squared error (MSE), mean absolute error (MAE), and the coefficient of determination (
) (Equations (
5)–(
7)) were employed. MSE penalizes larger errors more heavily, MAE provides an interpretable measure of average error magnitude, and
quantifies the proportion of variance in stock prices explained by the model. Although the sentiment classification task is binary, macro-averaged metrics were adopted to ensure balanced evaluation across both positive and negative classes, thereby mitigating potential bias from class imbalance.
where TP, FP, and FN denote the numbers of true positives, false positives, and false negatives, respectively;
is the true value,
is the predicted value,
is the mean of true values, and
n is the number of samples.
4.2. Experimental Results and Comparative Analysis: Sentiment Classification
The objective is to evaluate the classification performance of various modeling paradigms under different embedding strategies, thereby identifying the most effective embedding–model combinations for financial sentiment analysis. Experiments were conducted by systematically varying both the embedding methods (one-hot, TF-IDF, CBOW, skip-gram, and BERT) and the model categories (machine learning, deep learning, transformer-based models, and large language models). Evaluation was based on macro-averaged precision, recall, and F1-score across all companies in the Taiwan 0050 index.
As summarized in
Table 2, which reports the performance of traditional machine learning baselines, sparse bag-of-words representations paired with generative or linear classifiers delivered the strongest results. Specifically, the combination of one-hot encoding with Naïve Bayes achieved the highest F1-score (0.606), closely followed by TF-IDF with Naïve Bayes (0.598). Among dense embeddings, the pairings of skip-gram with logistic regression (F1 = 0.582) and CBOW with gradient boosting (F1 = 0.569) represented the best-performing combinations. When shallow learners were combined with contextualized BERT embeddings, performance was only modest (best: gradient boosting, F1 = 0.565), highlighting a mismatch between highly contextual representations and linear or tree-based classifiers. These findings suggest that sparse representations are more compatible with classical classifiers, whereas dense or contextual embeddings require more expressive architectures to fully realize their potential.
The performance of deep learning architectures is presented in
Table 3, which combines Word2Vec (CBOW and skip-gram) and BERT embeddings with convolutional and recurrent models. Among these, skip-gram with CNN achieved the highest F1-score (0.660), establishing CNNs as the most effective architecture for sentiment classification in this setting. CBOW embeddings showed relatively stronger synergy with recurrent models such as RNN (F1 = 0.651), while contextual BERT embeddings consistently improved performance across CNN, GRU, and RNN, with CNN again yielding the strongest result (F1 = 0.646). These observations indicate that CNNs are particularly effective when paired with dense or contextual embeddings, whereas recurrent models are more responsive to CBOW representations.
In contrast, transformer-based encoders and LLaMA3 further advanced classification performance, as shown in
Table 4. RoBERTa (F1 = 0.734) and BERT (F1 = 0.730) achieved competitive results, while BART performed slightly worse (F1 = 0.719). LLaMA3 outperformed all other models, earning an F1-score of 0.746, which demonstrates the advantages of large-scale pre-training and full fine-tuning in this domain.
Beyond aggregate metrics,
Figure 3 visualizes classification results across the 50 constituent companies of the Taiwan 0050 index. Considerable variation was observed: some companies consistently achieved high F1-scores across models, whereas others remained challenging regardless of method. This heterogeneity suggests that while LLaMA3 provides the strongest overall performance, model selection may still need to be tailored to company-specific characteristics.
In summary, three key observations emerge. First, sparse embeddings remain competitive when paired with simple classifiers; however, their effectiveness is limited compared to more advanced representations. Second, CNNs show strong adaptability across embedding types, making them reliable architectures for sentiment classification. Third, large-scale pre-trained models, particularly LLaMA3, deliver superior performance, highlighting the value of contextualization and full-parameter fine-tuning in financial sentiment analysis.
4.3. Regression Task Analysis for Volatility-Based Grouping
The regression experiments evaluate predictive performance across volatility-based groups using multiple model families. The results are reported in
Table 5,
Table 6 and
Table 7, and visualized in
Figure 4,
Figure 5 and
Figure 6, allowing for a systematic comparison across volatility tiers, model categories, and forecast horizons.
Figure 4 illustrates the volatility distribution of the Taiwan 0050 constituent stocks as measured by CV. Based on the thresholds defined in
Section 4.1, eight companies were classified as high volatility (CV > 20%), 31 companies as medium volatility (10% ≤ CV ≤ 20%), and 11 companies as low volatility (CV < 10%). The figure highlights substantial heterogeneity: companies such as 3231 and 3661 exhibit extreme volatility (approximately 60% and 59%, respectively), whereas firms such as 2912 and 1216 remain relatively stable with CV values close to 3%. This stratification provides a consistent basis for evaluating model performance across varying levels of market uncertainty.
Table 5 reports the regression outcomes for nine traditional machine learning models across the three volatility groups. Bayesian Linear Regression (BLR) achieved the most reliable explanatory power, with
values of 0.486, 0.590, and 0.368 in the high-, medium-, and low-volatility groups, respectively. Standard linear regression performed moderately in stable markets (
= 0.116 for the low-volatility group) but deteriorated sharply under high volatility (
= 0.069). Ensemble methods, such as Random Forest and Gradient Boosting, produced competitive MAE and MSE values in stable environments but yielded negative
values under volatile conditions, indicating limited generalization. Distance- and kernel-based models, particularly Support Vector Regression, were volatile, with
as low as −6.215 in the high-volatility group, reflecting severe sensitivity to market fluctuations.
Table 6 summarizes the performance of deep neural networks. Recurrent models demonstrated clear advantages. In the high-volatility group, LSTM attained the best explanatory power (
= 0.604), outperforming GRU (
= 0.588) and CNN (
= 0.527). For medium-volatility stocks, RNN achieved the highest
(0.685), closely followed by LSTM (0.654). In the low-volatility group, LSTM again performed best (
= 0.438), with GRU providing a competitive alternative (
= 0.420). These findings underscore the strength of recurrent architectures, particularly LSTM, in capturing sequential dependencies and delivering robust predictions across volatility regimes.
Table 7 presents the regression results of transformer-based encoders and the large language model LLaMA3. Among these, LLaMA3 was the only model to achieve a positive
in the high-volatility group (0.190), whereas BERT, RoBERTa, and BART all produced negative values across all volatility tiers. In medium- and low-volatility markets, LLaMA3 still recorded a negative
, although its MAE and MSE were comparatively lower, suggesting closer alignment with observed prices in absolute terms. These outcomes indicate that while full fine-tuning of large language models provides some resilience in highly unstable markets, transformer-based methods remain less effective than recurrent networks for regression tasks.
Across the three volatility groups (
Table 5,
Table 6 and
Table 7), recurrent architectures consistently outperform other models. LSTM remains the most reliable predictor overall, achieving the highest
in all market conditions and demonstrating strong adaptability to both stable and turbulent regimes. Bayesian Linear Regression provides a solid baseline for low-volatility stocks, while LLaMA3, though less accurate, shows notable robustness in high-volatility markets. These results highlight the superiority of sequential modeling for capturing dynamic financial patterns.
Figure 5 and
Figure 6 compare the forecasting performance of BLR and LSTM under five-day and fifteen-day horizons across the three volatility groups. For low-volatility companies (green bars), both models maintain relatively low MSE values, with only moderate increases when the horizon is extended, reflecting the greater predictability of stable stocks. In the medium-volatility group (yellow bars), horizon effects become more pronounced: BLR shows sharp increases in error at fifteen days, while LSTM exhibits smaller but still noticeable degradation, indicating its stronger ability to capture sequential dependencies under moderate uncertainty. The most striking differences occur in the high-volatility group (red bars), where BLR’s fifteen-day forecasts frequently produce MSE values exceeding the five-day forecasts by multiple orders of magnitude, underscoring its inability to model unstable markets. LSTM follows the same general trend of worsening performance with longer horizons, yet its MSE values remain consistently and substantially lower than those of BLR across all volatility tiers. These findings confirm that predictive accuracy diminishes as the forecast horizon extends, particularly in volatile conditions, but also highlight the comparative robustness of recurrent architectures relative to linear baselines.
4.4. Case Study: TSMC and Alchip
To further illustrate the practical relevance of financial sentiment classification and text-based stock prediction, we conducted a qualitative case study on two prominent Taiwanese semiconductor firms—Taiwan Semiconductor Manufacturing Company (TSMC, 2330) and Alchip Technologies (Alchip-KY, 3661)—both constituents of the Taiwan 0050 index. In line with the CV-based volatility classification defined in
Section 4.1, TSMC exhibits relatively stable price dynamics characteristic of medium-volatility stocks. In contrast, Alchip shows pronounced variability consistent with a high-volatility profile. These contrasting conditions provide a compelling basis for examining how sentiment-driven models behave across different volatility regimes.
Table 8 reports LLaMA regression performance under three input configurations—text and numerical features, numerical-only, and text-only—for both companies at the 5-day and 15-day horizons. This experiment empirically tests the proposed hypothesis regarding the integration of textual and numerical features for short-term forecasting. For TSMC, combining text and numerical features yields the lowest errors (MSE = 87.12 and 93.05 for 5-day and 15-day, respectively), outperforming text-only (3807 and 3974) and numerical-only (388 and 380) variants; the large gap between fusion and text-only underscores the utility of structured price signals for a relatively stable stock. For Alchip, the advantage of multimodal fusion is dramatic: text and numerical features achieve MSE = 16,893 (5-day) and 24,965 (15-day), whereas numerical-only surges above 400,000 and text-only approaches 2.83 million at both horizons, reflecting the difficulty of capturing highly volatile dynamics without complementary information. These ablations indicate that text signals alone are insufficient in turbulent settings, and the relative value of text versus numerical inputs is company- and volatility-dependent.
Figure 7 and
Figure 8 visualize 5-day forecasts for TSMC with Bayesian Linear Regression (BLR) and LLaMA3, respectively. Both recover broad up- and down-moves, yet their error profiles differ systematically: BLR adheres more closely to local peaks and troughs, producing more minor deviations in stable segments, whereas LLaMA renders smoother trajectories that capture long-run structure but understate sharp short-term swings. This contrast highlights a trade-off between short-horizon alignment (BLR) and global-trend tracking (LLaMA3) when volatility is moderate.
Figure 9 (LLaMA3) and
Figure 10 (BLR) present 5-day forecasts for Alchip, whose price path during the observation window shows sharper short-term swings and stronger upward momentum than TSMC. LLaMA3 aligns well with the prevailing trend yet remains smoother, while BLR more tightly tracks rapid oscillations. Taken together, these results portray complementary strengths under high volatility—LLMs better reflect the global direction, whereas linear models provide finer local responsiveness. Consistent with the above patterns, the case narrative is that sentiment classifiers attain high accuracy for both companies (e.g., TSMC LLaMA3 F1 = 0.97 vs. CNN = 0.91; Alchip CNN = 0.96 and LLaMA3 = 0.96), while regression is markedly harder for Alchip: LLaMA3’s MSE rises to 16,893 (5-day) and 24,965 (15-day), numerical-only errors exceed 400,000, and text-only peaks near 2.8 million, underscoring the role of sentiment cues and feature fusion in mitigating uncertainty for volatile stocks.
Across
Table 8 and
Figure 7,
Figure 8,
Figure 9 and
Figure 10, three conclusions emerge. First, multimodal fusion (combining text and numerical data) is consistently beneficial and becomes increasingly essential as volatility increases. Second, at the company level, BLR offers strong short-term alignment in stable segments. In contrast, LLaMA3 captures longer-horizon structure, suggesting that model selection should reflect whether the objective prioritizes local precision or global trend detection. Third, volatility amplifies horizon risk: while both models degrade as horizons extend, the degradation is much steeper for volatile series, reinforcing the need to tailor architectures and inputs to the volatility profile of the target asset. These case-specific findings align with the aggregate patterns in
Figure 5 and
Figure 6, where longer horizons consistently reduce accuracy across volatility tiers, with the most severe degradation observed in high-volatility stocks.
6. Conclusions and Future Work
This study proposed a comprehensive framework that integrates financial sentiment classification and short-term stock price forecasting using Chinese news reports from Taiwan’s top 50 listed companies. By systematically comparing multiple embeddings alongside machine learning, deep learning, transformer-based, and large language models, the study generated several key insights. In sentiment classification, contextualized embeddings and large-scale pretrained models demonstrated clear superiority over traditional baselines, highlighting the importance of context-aware representations for financial text understanding. In stock price regression, recurrent neural networks, particularly LSTM, consistently outperformed statistical and transformer-based alternatives across different volatility levels, confirming their robustness for sequential financial forecasting. However, all models showed performance degradation as the forecasting horizon extended, reflecting the inherent difficulty of medium-term prediction. Company-specific case studies on TSMC and Alchip further validated that model behavior aligns with real-world market dynamics—interpretable linear models suffice for stable firms, whereas sequential and multimodal approaches are better suited to capture uncertainty in high-volatility contexts.
While certain limitations remain, we identify several valuable directions for future work. Future research could extend the proposed framework to multilingual and cross-market datasets, incorporate multimodal information, and explore hybrid designs that integrate sentiment-based predictors with traditional stochastic financial models. Beyond methodological advances, this study offers practical implications for financial analysts and investors. The findings indicate that sentiment signals derived from financial news can serve as early indicators of potential market shifts, especially under volatile conditions where conventional quantitative indicators may lag. For stable firms, interpretable models can capture predictable patterns efficiently, while for more volatile firms, sequential or multimodal architectures are better equipped to handle uncertainty. Incorporating sentiment-based forecasting models into decision-making pipelines can thus enhance risk evaluation and support more informed portfolio management strategies.