Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models

Chuang, Hsiu-Min; He, Hsiang-Chih; Hu, Ming-Che

doi:10.3390/bdcc9100263

Open AccessArticle

Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models

by

Hsiu-Min Chuang

^1,*

,

Hsiang-Chih He

¹ and

Ming-Che Hu

²

¹

Department of Information and Computer Engineering, Chung Yuan Christian University, No. 200, Zhongbei Rd., Zhongli Dist., Taoyuan City 320314, Taiwan

²

Department of Finance and Cooperative Management, College of Business, National Taipei University, No. 151, University Rd., Sanxia Dist., New Taipei City 237303, Taiwan

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(10), 263; https://doi.org/10.3390/bdcc9100263

Submission received: 14 September 2025 / Revised: 8 October 2025 / Accepted: 14 October 2025 / Published: 16 October 2025

(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)

Download

Browse Figures

Versions Notes

Abstract

Financial news has a significant impact on investor sentiment and short-term stock price trends. While many studies have applied natural language processing (NLP) techniques to financial forecasting, most have focused on single tasks or English corpora, with limited research in non-English language contexts such as Taiwan. This study develops a joint framework to perform sentiment classification and short-term stock price prediction using Chinese financial news from Taiwan’s top 50 listed companies. Five types of word embeddings—one-hot, TF-IDF, CBOW, skip-gram, and BERT—are systematically compared across 17 traditional, deep, and Transformer models, as well as a large language model (LLaMA3) fully fine-tuned on the Chinese financial texts. To ensure annotation quality, sentiment labels were manually assigned by annotators with finance backgrounds and validated through a double-checking process. Experimental results show that a CNN using skip-gram embeddings achieves the strongest performance among deep learning models, while LLaMA3 yields the highest overall F1-score for sentiment classification. For regression, LSTM consistently provides the most reliable predictive power across different volatility groups, with Bayesian Linear Regression remaining competitive for low-volatility firms. LLaMA3 is the only Transformer-based model to achieve a positive

R^{2}

under high-volatility conditions. Furthermore, forecasting accuracy is higher for the five-day horizon than for the fifteen-day horizon, underscoring the increasing difficulty of medium-term forecasting. These findings confirm that financial news provides valuable predictive signals for emerging markets and that short-term sentiment-informed forecasts enhance real-time investment decisions.

Keywords:

financial sentiment analysis; stock price prediction; natural language processing; word embedding; deep learning; transformers

1. Introduction

In the era of real-time information flows, financial news plays a pivotal role in shaping investor sentiment and influencing short-term fluctuations in stock prices. Early studies have empirically shown that the tone and amount of media coverage can influence market trends and investor behavior [1]. Compared with traditional quantitative indicators such as historical prices, trading volume, or technical indices, financial texts frequently provide early warning signals that anticipate market reactions before they are reflected in numerical data. This phenomenon is particularly pronounced in emerging markets, where timely news reports can trigger rapid investor responses and directly shape short-term investment behavior. Consequently, text-based approaches to financial forecasting have become a growing focus in both academic research and practical financial applications.

Although natural language processing (NLP) techniques have been widely applied to financial forecasting, most prior studies exhibit three critical limitations. First, the majority of research has concentrated on English-language corpora, especially those from U.S. and European markets [2,3]. Models trained under such conditions often fail to generalize effectively to non-English contexts, where linguistic structures, segmentation issues, multiple meanings, and specialized financial terms differ substantially. In the case of Chinese financial texts, these challenges are further compounded by the lack of large-scale annotated data, limiting the ability of existing approaches to achieve stable performance.

While limited data remain a challenge, the connection between textual sentiment and market behavior is widely recognized. To illustrate this relationship, Figure 1 shows a simple example: three short news headlines are marked as positive, neutral, or negative, corresponding respectively to rising, stable, and falling stock trends. This example demonstrates how the tone of financial news reflects investor expectations and affects short-term market reactions.

Previous works have typically treated sentiment classification and stock price prediction as separate tasks [4,5,6,7,8]. While sentiment analysis is commonly used to assess market emotions and regression models are employed for numerical prediction, the lack of integration between these two tasks has hindered a deeper understanding of how signals derived from news actually translate into predictive value for financial markets. Most studies adopt a two-step process, where sentiment scores or indices are first generated from financial texts and then used as inputs for later forecasting models, such as the BERT–LSTM systems developed by Hiew et al. [9] and Gu et al. [10]. Although these methods demonstrate that sentiment information is beneficial for predicting market trends, they generally stop at feature-level fusion rather than performing end-to-end learning that directly connects qualitative sentiment with quantitative price changes.

Recently, some studies have tried to close this gap by modeling textual sentiment and market data together within unified or multimodal systems. For instance, Ho et al. [11] built a cooperative network that merged social sentiment (via CNN) and price charts (via image-based CNN) to predict stock movement. Nguyen et al. [12] proposed DASF-Net, a graph-based framework that adapts sentiment signals to stock relationships. Koval et al. [13] introduced a multimodal model with separate experts for text and time-series data to improve forecasting. Despite these advances, few works have provided a broad evaluation of different text representations and model types under the same setting, particularly for non-English financial data. This motivates the present study to design a unified framework that examines how sentiment features, multimodal strategies, and prediction architectures jointly affect forecasting performance.

In this study, sentiment analysis refers to examining opinions and emotions in text, while sentiment classification means grouping texts into positive or negative categories. Based on these ideas, this work investigates three questions: (i) whether contextual embeddings offer clear advantages over traditional features in sentiment classification, (ii) to what extent sentiment from news helps short-term stock prediction, and (iii) how different levels of market volatility influence this relationship. These questions guide the design of our experiments.

To address these challenges, this study proposes a comprehensive dual-task framework that integrates binary sentiment classification and short-term stock price regression using Chinese financial news. The dataset comprises 38,918 news articles covering Taiwan’s top 50 listed companies (the Taiwan 0050 index) between 2021 and 2023, along with corresponding stock market data. This corpus not only reflects the real-world dynamics of Taiwan’s capital market but also provides a valuable benchmark for testing language-based models in non-English environments. To ensure annotation reliability, sentiment labels were manually assigned by annotators with finance backgrounds and validated through a double-checking process.

Within this framework, five representative types of word embeddings were systematically investigated: one-hot encoding, TF-IDF, Word2Vec (continuous bag-of-words and skip-gram) [14], and contextual embeddings derived from BERT [15]. These were combined with 17 prediction models across four categories: (1) traditional machine-learning classifiers such as Naïve Bayes (NB), logistic regression (LR), and random forest (RF); (2) deep neural models such as convolutional neural networks (CNNs) [16] and long short-term memory networks (LSTMs) [17]; (3) transformer-based encoders such as BERT [15], RoBERTa [18], and BART [19]; and (4) a large language model (LLaMA3) [20], fine-tuned on Chinese financial data. By adopting a unified experimental setup, the study provides a detailed comparison of model performance across both sentiment classification and regression tasks.

This work includes forecasting horizons and volatility-aware analysis. In addition to five-day prediction windows, fifteen-day horizons are examined to evaluate the stability of models across varying time scales. Moreover, the constituent stocks of the Taiwan 0050 index are grouped by volatility, using the coefficient of variation (CV) of closing prices as a normalized metric. This design enables a detailed assessment of how predictive accuracy varies between low-volatility (stable) and high-volatility (uncertain) stocks—an aspect of particular importance for real-world investment strategies.

In summary, this paper makes three contributions:

It introduces an integrated dual-task framework that combines financial sentiment classification and short-term stock price forecasting, representing one of the most extensive evaluations of Chinese financial texts from Taiwan’s top 50 listed companies.
It provides comprehensive insights into the relative strengths of different paradigms, including traditional machine learning, deep neural networks, transformers, and large language models, by systematically comparing five embedding methods and seventeen model architectures under a unified protocol.
It explores how sentiment-based predictions relate to actual market behavior by analyzing model performance across time horizons and volatility groups, supported by case studies of firms such as TSMC and Alchip.

The remainder of this article is structured as follows: Section 2 reviews related work; Section 3 details the proposed framework and methodology; Section 4 presents the experimental setup and results; and Section 5 discusses and interprets the results; and Section 6 concludes with key findings and future directions.

2. Related Work

2.1. Word Embedding Techniques in Financial NLP

Early approaches to text representation in financial applications relied on sparse methods such as term frequency-inverse document frequency (TF-IDF), which capture word importance based on relative frequency within a corpus. TF-IDF and related bag-of-words models have been widely used in financial sentiment analysis, particularly for tasks such as predicting stock returns from news headlines or earnings reports, due to their simplicity and interpretability [21,22]. However, their inability to capture semantic relationships between words limits their effectiveness in domains with complex financial terminology.

The introduction of dense distributed embeddings marked a turning point. Word2Vec [14] and GloVe [23] learn low-dimensional word vectors that encode semantic similarity, enabling models to recognize relationships such as “profit” being closer to “gain” than “loss.” In finance, these embeddings have been applied to event-driven stock prediction [24] and financial text classification [25], demonstrating their superiority over sparse representations in capturing sentiment-bearing expressions and entity-specific context. Nonetheless, static embeddings assign a single vector per word, which fails to disambiguate context-dependent meanings, such as “bond” referring to a financial instrument or a chemical link.

To overcome these limitations, contextual embeddings based on the transformer architecture have been introduced. BERT [15] generates dynamic embeddings that adapt to surrounding tokens, achieving state-of-the-art results across various NLP benchmarks. Domain-specific variants, such as FinBERT [3,26], have been trained on large-scale financial corpora, resulting in substantial improvements in sentiment analysis and event detection. In the Chinese financial domain, additional challenges such as segmentation ambiguity and limited annotated resources have been noted [27]. These developments underscore the importance of contextualized embeddings in capturing the nuanced semantics of financial texts, laying the foundation for subsequent advances in large-scale pre-trained language models.

2.2. Sentiment Analysis for Financial Applications

Financial sentiment analysis has long been recognized as a crucial link between textual information and market behavior. Early work employed dictionary-based methods, most notably the Loughran–McDonald financial lexicon [28], which corrected biases in general-purpose lexicons by capturing domain-specific semantics. While such approaches offered interpretability, they lacked adaptability to dynamic financial language and have since been largely superseded.

In recent years, supervised learning with deep neural networks has become the dominant paradigm. The availability of corpora such as the Financial PhraseBank [22] enabled the application of machine learning classifiers, while subsequent studies demonstrated the superiority of neural architectures. For instance, Ding et al. [24] introduced event-driven representations for predicting stock movements, and more recent work has extended this line using recurrent and convolutional models to capture sequential and local features in financial news [25,29]. These methods established the importance of leveraging financial sentiment signals for both classification and forecasting tasks.

The past five years have seen rapid progress through transformer-based models and domain-adapted pretraining. BERT [15] and its variants have set new performance benchmarks across NLP, inspiring financial adaptations such as FinBERT [3,26], which incorporate large-scale financial corpora to improve sentiment classification and risk assessment. Recent studies have further demonstrated that contextualized embeddings can significantly enhance downstream financial prediction tasks, ranging from volatility modeling to credit risk evaluation [30]. Building on this line of work, FinBERT2 [31] extended domain-specific pretraining with a substantially larger financial corpus, showing consistent gains over both earlier BERT variants and contemporary LLMs in sentiment and related tasks. In addition, research on multilingual financial NLP has emerged, addressing challenges in non-English corpora such as Chinese financial texts, where segmentation ambiguity and limited labeled data remain open issues [27]. Sentiment analysis in finance has shifted from lexicon-driven methods toward deep contextualized modeling, with transformer-based architectures and financial-domain pretraining now constituting the state of the art.

2.3. Financial Forecasting Models

Machine learning models demonstrated the potential of text-driven forecasting by combining financial news with traditional predictive algorithms. Classical models such as logistic regression, random forests, and support vector regression have been applied to predict price direction or volatility, leveraging sentiment features extracted from financial corpora [1,32]. Subsequent studies have shown that integrating textual sentiment with structured numerical indicators can enhance predictive accuracy, particularly for short-term horizons [33,34]. However, the limited capacity of shallow models to capture temporal dependencies constrained their scalability to more complex forecasting tasks.

The adoption of deep learning architectures brought substantial improvements. RNNs and LSTM models were widely employed to capture sequential dependencies in financial time series and associated news sentiment [24,35]. Recent works demonstrated that hybrid designs—such as CNN-LSTM or attention-augmented RNNs—improved robustness in handling both local patterns and long-range dependencies [36,37]. Moreover, volatility-aware modeling has become increasingly important, with studies showing that prediction performance varies significantly across low- and high-volatility stocks [38,39]. These developments established deep neural networks as the dominant paradigm for financial forecasting in the past decade.

Classical and deep learning models each exhibit distinct strengths and weaknesses in financial forecasting. Shallow models, such as Naïve Bayes, logistic regression, and gradient boosting, are known for their interpretability, computational efficiency, and robustness with limited data; yet, they struggle to capture non-linear interactions or temporal dependencies between textual and numerical signals. In contrast, nonlinear models such as Multilayer Perceptron (MLP) and Gradient Boosting (GB) are capable of modeling complex interactions between textual and numerical signals, thereby improving predictive flexibility at the expense of higher data sensitivity and potential overfitting. Deep sequential architectures, such as LSTMs, excel at learning complex nonlinear relationships and sequential dynamics, making them suitable for modeling time-dependent market behavior. Bayesian regression provides probabilistic interpretability and uncertainty estimation—valuable for volatility-aware forecasting—but often underperforms in multimodal tasks involving high-dimensional text embeddings. This comparative analysis underscores the motivation to explore more advanced architectures that jointly integrate sentiment-derived textual representations with structured market features under a unified forecasting framework.

More recently, transformer-based models have been introduced, enabling the modeling of long-range contextual dependencies and cross-modal fusion of textual and numerical features. BERT and its domain-specific variants have been adapted for forecasting tasks, demonstrating competitive performance relative to LSTM baselines [15,26]. For instance, FinBERT has been shown to outperform BERT and DistilBERT in capturing subtle financial sentiment, achieving 89.6% accuracy and offering strong potential for predictive analytics in finance [40]. Other studies leveraged transformer architectures for multimodal learning, incorporating news, market data, and macroeconomic indicators to improve generalization across markets [41].

Despite these advances, several challenges remain in financial forecasting research. Most prior studies have focused on English corpora and mature markets, such as the United States, which limits generalizability to non-English or emerging markets [22,27]. In addition, sentiment classification and price forecasting are often treated as isolated tasks, overlooking the potential benefits of joint modeling and shared representations [29,41]. Furthermore, comparisons across model architectures, embedding strategies, and forecasting horizons remain fragmented, making it challenging to establish universally applicable best practices [36,42]. Recent work has therefore moved toward more comprehensive frameworks that jointly evaluate embeddings, integrate classification and regression objectives, and extend beyond English corpora to diverse financial contexts.

2.4. Large Language Models for Financial Forecasting

The emergence of LLMs has introduced a new paradigm for financial forecasting by enabling end-to-end learning from large-scale textual data. Unlike earlier models that relied on fixed embeddings or task-specific architectures, LLMs such as GPT-3 [43] and LLaMA [44] provide general-purpose language representations that can be adapted to downstream financial tasks with minimal supervision. This capability is particularly valuable for domains where labeled data is scarce, such as financial news or analyst reports. Recent studies have demonstrated the utility of LLMs for financial sentiment classification and stock prediction. Nath et al. [45] explore LLM-based generation of company earnings call scripts, comparing few-shot prompt engineering with fine-tuning. Both approaches yield coherent drafts, though with trade-offs in accuracy, style, and cost, highlighting practical considerations for financial text generation.

Beyond sentiment classification, LLMs have also been applied to end-to-end financial forecasting frameworks. Yang et al. [46] introduced FinGPT, an open-source financial LLM designed to handle multilingual corpora and specialized financial terminology. Their results indicate that domain-adapted fine-tuning significantly enhances forecasting accuracy compared to generic LLMs. More recently, Huang et al. [47] introduced Open-FinLLMs, the first open-source multimodal financial LLM suite (FinLLaMA, FinLLaMA-Instruct, FinLLaVA), pretrained on large-scale financial corpora and fine-tuned with extensive instruction and multimodal data. Evaluated on 30 datasets and multiple modalities, these models outperform both advanced financial LLMs and general-purpose LLMs such as GPT-4, demonstrating strong potential for real-world financial applications. In addition to task-specific adaptations, such as FinGPT and multimodal frameworks, benchmarking initiatives to systematically evaluate LLM capabilities in financial contexts have emerged. For example, Bigeard et al. [48] introduce the Finance Agent Benchmark, a suite of 537 expert-authored questions across nine task categories using SEC filings, designed to evaluate LLM agents in real-world financial research. The results show that even the best-performing model achieved only 46.8% accuracy, underscoring current limitations and the need for further progress before reliable deployment in finance.

Despite these advances, several challenges remain. Xie et al. introduced the FinLLMs Challenge at IJCAI FinNLP-AgentScen, one of the first shared tasks to systematically evaluate LLMs in finance [49]. The FinLLMs Challenge revealed that while LLMs can surpass traditional baselines in financial classification, summarization, and trading when carefully fine-tuned, their performance remains inconsistent and heavily dependent on task-specific adaptation. Moreover, risks of data leakage and limited robustness across tasks highlight ongoing challenges for deploying LLMs in high-stakes financial applications. Moreover, most evaluations are still concentrated on English corpora, limiting applicability to emerging markets and non-English contexts. Recent work has begun to address this gap; for example, Kumar et al. propose a cross-lingual NER framework that distills knowledge from XLM-RoBERTa into a smaller model with consistency training, successfully transferring financial entity recognition from English to Arabic using minimal labeled data [50]. Their results demonstrate that efficient knowledge transfer can achieve competitive F1 scores while lowering resource demands, underscoring the potential of multilingual approaches in financial NLP. Finally, issues of interpretability and reliability persist, as LLMs may generate fluent but financially irrelevant outputs. Addressing these challenges will require further research on domain-adapted pretraining, efficient fine-tuning methods, and explainability frameworks tailored to financial applications.

3. Methodology

3.1. Problem Formulation

This study addresses two interconnected predictive tasks: binary sentiment classification of financial news and short-term stock price forecasting. Together, these tasks offer a comprehensive understanding of how textual information influences financial decision making. For the sentiment classification task, each financial news article

x_{i}

is treated as an input document. Formally, we assume

x_{i} \in X

, where

X

denotes the space of tokenized and preprocessed news texts represented as sequences of words or embeddings. In practice,

X

can be instantiated as vectorized representations, such as TF-IDF features or pre-trained embeddings, depending on the model used. The objective is to assign a binary sentiment label

y_{i}^{c l s} \in {0, 1}

, where 1 denotes positive sentiment (optimistic outlook or favorable signals) and 0 represents negative sentiment (pessimistic outlook or unfavorable signals). Formally, the classifier learns a mapping function:

f_{c l s} : x_{i} \mapsto y_{i}^{c l s}

For the regression task, the aim is to predict the future closing price of a stock. Let

p_{t} \in R^{+}

denote the closing price on day t, where

R^{+}

is the set of non-negative real numbers. Given financial news and stock data up to day t, denoted as

(x_{\leq t}, p_{\leq t}) \in X \times {(R^{+})}^{t}

, the regression model forecasts the closing price at day

t + k

, where

k \in {5, 15}

represents short-term (five-day) and mid-term (fifteen-day) horizons:

f_{r e g} : (x_{\leq t}, p_{\leq t}) \mapsto p_{t + k}

The choice of closing price is motivated by its widespread use as a benchmark indicator of market performance, as it reflects the outcome of intraday trading and represents the market’s consensus valuation at the end of the day. Prior studies have also adopted the closing price as the primary target variable in stock forecasting tasks due to its stability and interpretability [42,51]. Compared to opening or intraday prices, the closing price reduces the influence of short-lived fluctuations and aligns naturally with the daily release of financial news. To capture short- and near-term market reactions, two prediction horizons were set: five days and fifteen days. The 5-day horizon reflects immediate sentiment-driven movements, while the 15-day horizon examines the persistence of short-term trends. Such settings are consistent with prior stock forecasting studies that employ short- to medium-term horizons within 5–20 trading days [52,53,54]. This horizon design enables the evaluation of how predictive performance varies with time scale—shorter horizons align more closely with sentiment signals, and longer ones reflect broader market dynamics.

Building on prior findings that textual sentiment conveys investors’ expectations beyond numerical indicators, previous studies [55,56] have demonstrated that integrating textual and quantitative features can enhance the predictive performance of stock forecasting models. Accordingly, we hypothesize that combining sentiment-derived textual representations with numerical stock variables yields superior short-term forecasting accuracy compared to using either modality in isolation.

3.2. System Architecture

The proposed framework consists of five sequential stages—data collection, preprocessing, word embedding representation, model training and validation, and prediction and evaluation—integrated into an end-to-end workflow for financial news analysis and stock price forecasting (Figure 2). During the data collection stage, financial news was obtained from the CMoney platform (https://www.cmoney.tw/, accessed on 15 August 2024), and daily stock data for Taiwan’s top 50 listed companies were retrieved from Yahoo Finance (https://finance.yahoo.com/, accessed on 15 August 2024). Each article was paired with the corresponding company and trading date to ensure consistency between textual and numerical data. Preprocessing then reduced news texts to concise representations, removed extraneous noise, and applied segmentation to preserve the most informative content for subsequent analysis. For representation, five embedding approaches were employed to transform text into numerical vectors, thereby capturing lexical, semantic, and contextual information. During model training and validation, a diverse set of machine learning, deep learning, and transformer-based encoders was evaluated under a unified protocol, together with a large language model, LLaMA3 [20], which was fully fine-tuned on the Chinese financial news dataset. Finally, model outputs were assessed using classification and regression metrics, with further analyses conducted across forecasting horizons and volatility groups to evaluate robustness.

3.3. Data Collection, Labeling, and Preprocessing

The dataset integrates Chinese financial news articles with corresponding stock market data, forming parallel resources for both classification and regression tasks. Financial news was obtained from CMoney, a widely used financial information platform in Taiwan, while stock market data—including open, high, low, close, and adjusted close prices—were retrieved from Yahoo Finance. The corpus covers reports related to Taiwan’s top 50 listed companies (the Taiwan 0050 index) from January 2021 to December 2023. Each news article was aligned with the corresponding company and trading date to ensure consistency between textual and numerical data. In total, the labeled corpus comprised 38,918 financial news articles, of which 18,453 were annotated as negative, reflecting pessimistic or downward-oriented sentiment, and 20,465 as positive, reflecting optimistic or upward-oriented sentiment. This balanced distribution between positive and negative samples provides a robust foundation for training and evaluating sentiment classification models.

Two task-specific datasets were constructed: a classification dataset (C) that pairs each article with a sentiment label, and a regression dataset (R) that links articles with their future closing prices. To avoid bias from duplicated reports within the same trading day, only the final article per company per day was retained in the regression dataset. The regression target corresponds to the closing price on the 5th and 15th trading day after the news release. Sentiment annotation was conducted by undergraduate students with finance backgrounds. Each article was assigned a binary label (1 for positive sentiment, such as favorable performance, expansion, or investment; 0 for negative sentiment, such as financial loss, risk exposure, or policy restrictions). A double-checking mechanism was implemented, whereby disagreements were resolved through consensus, resulting in a reliable corpus for supervised learning.

To prepare the texts for modeling, a three-step preprocessing procedure was adopted: reduction, cleaning, and segmentation. First, only the opening sentence of each article was retained, using the first occurrence of the Chinese full stop as the segmentation boundary. Prior studies in journalism and natural language processing show that the opening sentence or headline of financial news typically conveys the article’s main message [57,58]. Moreover, the Chinese full stop denotes the completion of an independent idea, while commas connect related clauses [59], supporting this strategy as both efficient and semantically coherent. This reduction step shortened articles from an average of approximately 640 characters to 96 characters while maintaining the main summary of each news item. Second, cleaning operations were applied to remove extraneous content such as punctuation, numbers, and special symbols. Third, word segmentation was performed using the Jieba toolkit, which was supplemented with an extended, domain-specific lexicon to improve the recognition of company names and financial terminology. The resulting preprocessed corpus was then transformed into numerical representations in the embedding stage.

3.4. Word Embedding

A critical step in natural language processing is transforming unstructured text into numerical representations that can be processed by machine learning and deep learning models. In this study, five representative embedding approaches were employed to capture different levels of lexical and semantic information from Chinese financial texts. These methods were selected not only for their prevalence in prior financial sentiment research but also for their ability to illustrate the trade-offs among sparse, dense, and contextualized representations.

One-hot encoding, the most basic representation, assigns each unique token a binary vector with a single dimension set to one and all others set to zero. While computationally interpretable and straightforward, this method suffers from extreme sparsity and lacks semantic relationships between words; it was therefore included primarily as a baseline for comparison. TF-IDF improves upon this by weighting word frequency against document distribution, producing sparse yet informative vectors that are suitable for traditional classifiers, such as logistic regression or support vector machines. However, TF-IDF still cannot capture word order or contextual meaning.

To address these limitations, dense embeddings were generated using Word2Vec [14]. Both the CBOW and skip-gram architectures were trained on the financial corpus. CBOW predicts a target word based on its surrounding context and is particularly efficient for frequent terms. In contrast, skip-gram predicts surrounding words given a target, capturing rare and domain-specific vocabulary more effectively. These dense embeddings yield compact vector spaces in which semantically related words are positioned closer together, providing richer input for deep neural models such as CNNs and LSTMs.

Finally, contextualized embeddings were obtained using BERT (Bidirectional Encoder Representations from Transformers) [15]. Unlike static embeddings, BERT produces word representations that adapt dynamically to surrounding tokens. This property is fundamental in financial texts, where the meaning of terms often shifts depending on context (e.g., “margin” in accounting versus trading). In this study, a Chinese BERT model was employed to capture nuanced semantics and long-range dependencies, thereby achieving state-of-the-art performance in sentiment classification and related tasks.

The inclusion of these five embedding methods was not intended merely as an independent application of each but rather as a systematic evaluation of their compatibility with different model paradigms. Sparse representations (one-hot, TF-IDF) were expected to align better with traditional machine learning algorithms due to their interpretability and linear separability. Dense embeddings (Word2Vec) were better suited for deep learning architectures, which can leverage continuous vector spaces to capture compositional semantics. Contextualized embeddings (BERT) were anticipated to yield advantages in transformer-based models, where contextual cues are critical for classification and forecasting. By comparing all embedding–model combinations under a unified evaluation pipeline, this study aims to identify the most effective representation strategies for Chinese financial sentiment classification and stock price forecasting. It is worth noting that LLaMA3 [20], while inherently generating contextualized embeddings through its transformer layers, was not paired with external embedding methods. Instead, it was fine-tuned as an end-to-end predictive model, and its performance was evaluated alongside the other embedding-model combinations.

3.5. Model Training

To systematically evaluate the performance of different learning paradigms in financial sentiment classification and stock price forecasting, we implemented a comprehensive suite of models spanning traditional machine learning algorithms, deep neural architectures, transformer-based encoders, and a large language model. Each model was paired with the appropriate embedding technique and trained separately for both classification and regression tasks.

The traditional machine learning models included nine widely used algorithms: AdaBoost [60], decision trees [61], Naïve Bayes, gradient boosting [62], k-nearest neighbors [63], logistic regression [64], multilayer perceptrons [65], random forests [66], and support vector machines [67]. These models were trained on static word representations such as one-hot, TF-IDF, and Word2Vec embeddings, and were selected for their interpretability and computational efficiency, particularly for smaller-scale datasets.

To capture sequential and semantic information in financial texts, we further examined four deep neural networks: convolutional neural networks [16], recurrent neural networks [68], gated recurrent units [69], and long short-term memory networks [17]. Convolutional layers were used to detect local n-gram patterns, whereas recurrent structures were designed to capture temporal dependencies across token sequences. These models were primarily trained on dense Word2Vec embeddings, with dropout regularization and standard optimization strategies applied to mitigate overfitting.

Transformer-based models were also incorporated, specifically BERT [15], BART [19], and RoBERTa [18], which were fine-tuned on the labeled Chinese financial corpus. Pretrained models were obtained from Hugging Face, namely BERT (ckiplab/bert-base-chinese), RoBERTa (xlm-roberta-base), and BART (fnlp/bart-base-chinese). These models generated contextualized embeddings at the sentence- or document-level and passed them through task-specific heads for either classification or regression. Fine-tuning was conducted under controlled optimization settings, with early stopping applied to mitigate the risks of overfitting that arise from limited training data.

Finally, we extended the comparison to a large language model by including LLaMA3 [20], specifically the Llama-3.2-1B implementation available on Hugging Face (https://huggingface.co/meta-llama/Llama-3.2-1B, accessed on 1 March 2025). Unlike the other models, which relied on external embeddings, LLaMA3 was fully fine-tuned on the Chinese financial news dataset, allowing all parameters to adapt end-to-end for both classification and regression. This design highlights its distinct role as an integrated predictive framework, evaluated alongside the embedding model combinations.

To ensure fairness and reproducibility, all models were trained under standardized procedures in the same computational environment. Hyperparameters, including batch size, learning rate, sequence length, and training epochs, are summarized in Table 1.

To further clarify the hyperparameter settings, several important considerations are noted. For the CNN, RNN, GRU, and LSTM models, the batch size was set to 32, balancing training stability with computational efficiency. In contrast, transformer-based models (BERT and LLaMA3) adopted smaller batch sizes (16 and 8, respectively) due to their larger parameter space and higher memory requirements. The number of epochs was standardized across tasks to ensure fairness in comparison: a smaller number for classifications (5) to prevent overfitting on binary labels, and a larger number for regressions (20) to capture long-term temporal dependencies. The maximum sequence length was fixed at 256 for recurrent models and LLaMA3, but extended to 512 for BERT to leverage its ability to process longer contexts. Learning rates were tuned according to best practices for each model family, with lower rates preventing catastrophic forgetting during training. Dropout regularization was applied only to deep neural architectures (set to 0.2), while transformer-based models relied on their intrinsic regularization mechanisms. Collectively, these standardized settings ensured reproducibility and fair cross-model comparisons.

4. Experiments

4.1. Experimental Setup and Evaluation Metrics

The experiments were designed to address four main objectives. First, we compared five types of word embeddings in combination with a broad set of machine learning, deep learning, and transformer-based models to evaluate their effectiveness in binary sentiment classification of Chinese financial news. Second, we examined the same model—embedding combinations in regression tasks, forecasting stock prices at both five-day and fifteen-day horizons. Third, we investigated the impact of prediction horizon length by contrasting the relative difficulty of short-term and mid-term predictions. Finally, we analyzed the prediction performance across volatility groups by dividing the constituent stocks of the Taiwan 0050 index into three categories—high, medium, and low volatility—based on the coefficient of variation (CV) of closing prices.

The CV was adopted as a normalized volatility measure to facilitate fair comparisons across companies with varying price levels. It is defined as the ratio of the standard deviation (

σ

) to the mean closing price (

μ

), expressed as a percentage.

C V = (σ / μ) \times 100

(1)

Based on this metric, companies were stratified into three groups: high volatility (CV > 20%), medium volatility (10% ≤ CV ≤ 20%), and low volatility (CV < 10%).

All experiments were conducted in a controlled computational environment equipped with an NVIDIA RTX 4090 GPU (24 GB), an Intel Core i7-14700F CPU, and 128 GB of RAM. The software stack included PyTorch 2.4, Hugging Face Transformers 4.46, Jieba for Chinese word segmentation, and scikit-learn for machine learning implementations. This environment ensured consistency and reproducibility across all experiments.

Two datasets were constructed for evaluation. The classification dataset (Dataset C) paired financial news articles with manually assigned sentiment labels. In contrast, the regression dataset (Dataset R) aligned articles with the corresponding stock closing prices at five-day and fifteen-day intervals. Both datasets were derived from the same corpus of 38,918 Chinese financial news articles covering Taiwan’s top 50 listed companies between 2021 and 2023, as described in Section 3.3. To ensure robust and unbiased evaluation, all experiments were conducted using five-fold cross-validation, with results averaged across folds.

Model performance was assessed using standard evaluation metrics, with definitions provided in Equations (2)–(7). For the classification task, precision, recall, and F1-score (Equations (2)–(4)) were reported to capture complementary aspects of predictive quality. Precision reflects the reliability of positive predictions, recall measures the coverage of actual positives, and F1-score balances the trade-off between the two. For the regression task, mean squared error (MSE), mean absolute error (MAE), and the coefficient of determination (

R^{2}

) (Equations (5)–(7)) were employed. MSE penalizes larger errors more heavily, MAE provides an interpretable measure of average error magnitude, and

R^{2}

quantifies the proportion of variance in stock prices explained by the model. Although the sentiment classification task is binary, macro-averaged metrics were adopted to ensure balanced evaluation across both positive and negative classes, thereby mitigating potential bias from class imbalance.

Precision = \frac{T P}{T P + F P}

(2)

Recall = \frac{T P}{T P + F N}

(3)

F1-score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(4)

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(5)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(6)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(7)

where TP, FP, and FN denote the numbers of true positives, false positives, and false negatives, respectively;

y_{i}

is the true value,

{\hat{y}}_{i}

is the predicted value,

\bar{y}

is the mean of true values, and n is the number of samples.

4.2. Experimental Results and Comparative Analysis: Sentiment Classification

The objective is to evaluate the classification performance of various modeling paradigms under different embedding strategies, thereby identifying the most effective embedding–model combinations for financial sentiment analysis. Experiments were conducted by systematically varying both the embedding methods (one-hot, TF-IDF, CBOW, skip-gram, and BERT) and the model categories (machine learning, deep learning, transformer-based models, and large language models). Evaluation was based on macro-averaged precision, recall, and F1-score across all companies in the Taiwan 0050 index.

As summarized in Table 2, which reports the performance of traditional machine learning baselines, sparse bag-of-words representations paired with generative or linear classifiers delivered the strongest results. Specifically, the combination of one-hot encoding with Naïve Bayes achieved the highest F1-score (0.606), closely followed by TF-IDF with Naïve Bayes (0.598). Among dense embeddings, the pairings of skip-gram with logistic regression (F1 = 0.582) and CBOW with gradient boosting (F1 = 0.569) represented the best-performing combinations. When shallow learners were combined with contextualized BERT embeddings, performance was only modest (best: gradient boosting, F1 = 0.565), highlighting a mismatch between highly contextual representations and linear or tree-based classifiers. These findings suggest that sparse representations are more compatible with classical classifiers, whereas dense or contextual embeddings require more expressive architectures to fully realize their potential.

The performance of deep learning architectures is presented in Table 3, which combines Word2Vec (CBOW and skip-gram) and BERT embeddings with convolutional and recurrent models. Among these, skip-gram with CNN achieved the highest F1-score (0.660), establishing CNNs as the most effective architecture for sentiment classification in this setting. CBOW embeddings showed relatively stronger synergy with recurrent models such as RNN (F1 = 0.651), while contextual BERT embeddings consistently improved performance across CNN, GRU, and RNN, with CNN again yielding the strongest result (F1 = 0.646). These observations indicate that CNNs are particularly effective when paired with dense or contextual embeddings, whereas recurrent models are more responsive to CBOW representations.

In contrast, transformer-based encoders and LLaMA3 further advanced classification performance, as shown in Table 4. RoBERTa (F1 = 0.734) and BERT (F1 = 0.730) achieved competitive results, while BART performed slightly worse (F1 = 0.719). LLaMA3 outperformed all other models, earning an F1-score of 0.746, which demonstrates the advantages of large-scale pre-training and full fine-tuning in this domain.

Beyond aggregate metrics, Figure 3 visualizes classification results across the 50 constituent companies of the Taiwan 0050 index. Considerable variation was observed: some companies consistently achieved high F1-scores across models, whereas others remained challenging regardless of method. This heterogeneity suggests that while LLaMA3 provides the strongest overall performance, model selection may still need to be tailored to company-specific characteristics.

In summary, three key observations emerge. First, sparse embeddings remain competitive when paired with simple classifiers; however, their effectiveness is limited compared to more advanced representations. Second, CNNs show strong adaptability across embedding types, making them reliable architectures for sentiment classification. Third, large-scale pre-trained models, particularly LLaMA3, deliver superior performance, highlighting the value of contextualization and full-parameter fine-tuning in financial sentiment analysis.

4.3. Regression Task Analysis for Volatility-Based Grouping

The regression experiments evaluate predictive performance across volatility-based groups using multiple model families. The results are reported in Table 5, Table 6 and Table 7, and visualized in Figure 4, Figure 5 and Figure 6, allowing for a systematic comparison across volatility tiers, model categories, and forecast horizons.

Figure 4 illustrates the volatility distribution of the Taiwan 0050 constituent stocks as measured by CV. Based on the thresholds defined in Section 4.1, eight companies were classified as high volatility (CV > 20%), 31 companies as medium volatility (10% ≤ CV ≤ 20%), and 11 companies as low volatility (CV < 10%). The figure highlights substantial heterogeneity: companies such as 3231 and 3661 exhibit extreme volatility (approximately 60% and 59%, respectively), whereas firms such as 2912 and 1216 remain relatively stable with CV values close to 3%. This stratification provides a consistent basis for evaluating model performance across varying levels of market uncertainty.

Table 5 reports the regression outcomes for nine traditional machine learning models across the three volatility groups. Bayesian Linear Regression (BLR) achieved the most reliable explanatory power, with

R^{2}

values of 0.486, 0.590, and 0.368 in the high-, medium-, and low-volatility groups, respectively. Standard linear regression performed moderately in stable markets (

R^{2}

= 0.116 for the low-volatility group) but deteriorated sharply under high volatility (

R^{2}

= 0.069). Ensemble methods, such as Random Forest and Gradient Boosting, produced competitive MAE and MSE values in stable environments but yielded negative

R^{2}

values under volatile conditions, indicating limited generalization. Distance- and kernel-based models, particularly Support Vector Regression, were volatile, with

R^{2}

as low as −6.215 in the high-volatility group, reflecting severe sensitivity to market fluctuations.

Table 6 summarizes the performance of deep neural networks. Recurrent models demonstrated clear advantages. In the high-volatility group, LSTM attained the best explanatory power (

R^{2}

= 0.604), outperforming GRU (

R^{2}

= 0.588) and CNN (

R^{2}

= 0.527). For medium-volatility stocks, RNN achieved the highest

R^{2}

(0.685), closely followed by LSTM (0.654). In the low-volatility group, LSTM again performed best (

R^{2}

= 0.438), with GRU providing a competitive alternative (

R^{2}

= 0.420). These findings underscore the strength of recurrent architectures, particularly LSTM, in capturing sequential dependencies and delivering robust predictions across volatility regimes.

Table 7 presents the regression results of transformer-based encoders and the large language model LLaMA3. Among these, LLaMA3 was the only model to achieve a positive

R^{2}

in the high-volatility group (0.190), whereas BERT, RoBERTa, and BART all produced negative values across all volatility tiers. In medium- and low-volatility markets, LLaMA3 still recorded a negative

R^{2}

, although its MAE and MSE were comparatively lower, suggesting closer alignment with observed prices in absolute terms. These outcomes indicate that while full fine-tuning of large language models provides some resilience in highly unstable markets, transformer-based methods remain less effective than recurrent networks for regression tasks.

Across the three volatility groups (Table 5, Table 6 and Table 7), recurrent architectures consistently outperform other models. LSTM remains the most reliable predictor overall, achieving the highest

R^{2}

in all market conditions and demonstrating strong adaptability to both stable and turbulent regimes. Bayesian Linear Regression provides a solid baseline for low-volatility stocks, while LLaMA3, though less accurate, shows notable robustness in high-volatility markets. These results highlight the superiority of sequential modeling for capturing dynamic financial patterns.

Figure 5 and Figure 6 compare the forecasting performance of BLR and LSTM under five-day and fifteen-day horizons across the three volatility groups. For low-volatility companies (green bars), both models maintain relatively low MSE values, with only moderate increases when the horizon is extended, reflecting the greater predictability of stable stocks. In the medium-volatility group (yellow bars), horizon effects become more pronounced: BLR shows sharp increases in error at fifteen days, while LSTM exhibits smaller but still noticeable degradation, indicating its stronger ability to capture sequential dependencies under moderate uncertainty. The most striking differences occur in the high-volatility group (red bars), where BLR’s fifteen-day forecasts frequently produce MSE values exceeding the five-day forecasts by multiple orders of magnitude, underscoring its inability to model unstable markets. LSTM follows the same general trend of worsening performance with longer horizons, yet its MSE values remain consistently and substantially lower than those of BLR across all volatility tiers. These findings confirm that predictive accuracy diminishes as the forecast horizon extends, particularly in volatile conditions, but also highlight the comparative robustness of recurrent architectures relative to linear baselines.

4.4. Case Study: TSMC and Alchip

To further illustrate the practical relevance of financial sentiment classification and text-based stock prediction, we conducted a qualitative case study on two prominent Taiwanese semiconductor firms—Taiwan Semiconductor Manufacturing Company (TSMC, 2330) and Alchip Technologies (Alchip-KY, 3661)—both constituents of the Taiwan 0050 index. In line with the CV-based volatility classification defined in Section 4.1, TSMC exhibits relatively stable price dynamics characteristic of medium-volatility stocks. In contrast, Alchip shows pronounced variability consistent with a high-volatility profile. These contrasting conditions provide a compelling basis for examining how sentiment-driven models behave across different volatility regimes.

Table 8 reports LLaMA regression performance under three input configurations—text and numerical features, numerical-only, and text-only—for both companies at the 5-day and 15-day horizons. This experiment empirically tests the proposed hypothesis regarding the integration of textual and numerical features for short-term forecasting. For TSMC, combining text and numerical features yields the lowest errors (MSE = 87.12 and 93.05 for 5-day and 15-day, respectively), outperforming text-only (3807 and 3974) and numerical-only (388 and 380) variants; the large gap between fusion and text-only underscores the utility of structured price signals for a relatively stable stock. For Alchip, the advantage of multimodal fusion is dramatic: text and numerical features achieve MSE = 16,893 (5-day) and 24,965 (15-day), whereas numerical-only surges above 400,000 and text-only approaches 2.83 million at both horizons, reflecting the difficulty of capturing highly volatile dynamics without complementary information. These ablations indicate that text signals alone are insufficient in turbulent settings, and the relative value of text versus numerical inputs is company- and volatility-dependent.

Figure 7 and Figure 8 visualize 5-day forecasts for TSMC with Bayesian Linear Regression (BLR) and LLaMA3, respectively. Both recover broad up- and down-moves, yet their error profiles differ systematically: BLR adheres more closely to local peaks and troughs, producing more minor deviations in stable segments, whereas LLaMA renders smoother trajectories that capture long-run structure but understate sharp short-term swings. This contrast highlights a trade-off between short-horizon alignment (BLR) and global-trend tracking (LLaMA3) when volatility is moderate.

Figure 9 (LLaMA3) and Figure 10 (BLR) present 5-day forecasts for Alchip, whose price path during the observation window shows sharper short-term swings and stronger upward momentum than TSMC. LLaMA3 aligns well with the prevailing trend yet remains smoother, while BLR more tightly tracks rapid oscillations. Taken together, these results portray complementary strengths under high volatility—LLMs better reflect the global direction, whereas linear models provide finer local responsiveness. Consistent with the above patterns, the case narrative is that sentiment classifiers attain high accuracy for both companies (e.g., TSMC LLaMA3 F1 = 0.97 vs. CNN = 0.91; Alchip CNN = 0.96 and LLaMA3 = 0.96), while regression is markedly harder for Alchip: LLaMA3’s MSE rises to 16,893 (5-day) and 24,965 (15-day), numerical-only errors exceed 400,000, and text-only peaks near 2.8 million, underscoring the role of sentiment cues and feature fusion in mitigating uncertainty for volatile stocks.

Across Table 8 and Figure 7, Figure 8, Figure 9 and Figure 10, three conclusions emerge. First, multimodal fusion (combining text and numerical data) is consistently beneficial and becomes increasingly essential as volatility increases. Second, at the company level, BLR offers strong short-term alignment in stable segments. In contrast, LLaMA3 captures longer-horizon structure, suggesting that model selection should reflect whether the objective prioritizes local precision or global trend detection. Third, volatility amplifies horizon risk: while both models degrade as horizons extend, the degradation is much steeper for volatile series, reinforcing the need to tailor architectures and inputs to the volatility profile of the target asset. These case-specific findings align with the aggregate patterns in Figure 5 and Figure 6, where longer horizons consistently reduce accuracy across volatility tiers, with the most severe degradation observed in high-volatility stocks.

5. Discussion

5.1. Comparison with Prior Studies and Empirical Validation

In comparison with previous research, the empirical results of this study demonstrate consistency with existing findings in financial sentiment forecasting. FinBERT and its variants have demonstrated strong performance on English financial corpora, typically reporting around 85–90% F1 or accuracy in polarity classification [26,30]. However, cross-lingual generalization remains limited. For instance, Lan et al. [27] achieved an F1 score of approximately 0.71 on Chinese financial sentiment classification, while Tang and Yang [30] established the FinMTEB benchmark, which provides a balanced evaluation suite covering both English and Chinese financial corpora. The classification performance across multilingual tasks yielded average F1 scores of around 0.65, with the best-performing financial language models achieving approximately 0.75 F1 scores. This suggests the need for localized fine-tuning and domain-specific adaptation. This study attains comparable performance on a large-scale Chinese corpus, indicating that domain-specific fine-tuning and manual annotation can effectively mitigate language-related gaps.

Similarly, Lopez-Lira and Tang [70] showed that large language models such as ChatGPT can generate sentiment-based scores from financial news headlines that significantly predict out-of-sample daily stock returns, even when controlling for traditional quantitative factors. Their theoretical model, which incorporates information-processing constraints and market under-reaction, further explains why LLM-derived sentiment measures contain incremental predictive power. Building upon this evidence, our findings extend the analysis to Chinese financial news, confirming that localized sentiment representations also possess measurable influence on short-term price fluctuations. Moreover, Open-FinLLMs [48] proposed a multimodal framework integrating textual, numerical, and temporal data for long-horizon forecasting. While their system emphasizes macro-temporal stability, our dual-task design demonstrates greater responsiveness to short-term market dynamics, highlighting the immediacy of sentiment effects within a trading day. Taken together, these results suggest that multilingual and multimodal LLMs continue to broaden the landscape of financial NLP, yet task-specific fine-tuning remains essential for achieving robust performance in localized markets.

Empirically, the results summarized in Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 further substantiate the comparative insights discussed above. Across Table 2, Table 3 and Table 4, contextualized embeddings and Transformer-based encoders consistently outperform lexical and word-vector representations, echoing previous conclusions on the superiority of contextual models such as BERT and RoBERTa [3,18,19]. Likewise, Table 5 and Table 6 show that coupling textual sentiment with numerical indicators yields clear gains over price-only baselines, aligning with Mehtab and Sen [6] and Xiao et al. [7]. As shown in Table 7, this improvement is especially notable during high-volatility periods, consistent with recent evidence that sentiment-driven models are particularly effective in turbulent markets [13,70].

These quantitative trends are further illustrated in Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10. The cross-company heatmap (Figure 3) reveals stable F1-scores from large pretrained models such as LLaMA3, supporting earlier findings on model generalization across heterogeneous data [11,12]. The volatility distribution (Figure 4) provides the empirical basis for grouping firms by uncertainty levels [13], while the 5-day trajectories (Figure 7, Figure 8, Figure 9 and Figure 10) illustrate that our multimodal design yields smaller deviations than price-only baselines [8]. These quantitative and visual analyses demonstrate the robustness of the proposed framework across firms and volatility regimes, thereby reinforcing its empirical consistency with prior studies.

5.2. Theoretical Implications and Limitations

The proposed framework also bears theoretical implications for financial modeling. Classical stochastic differential equation (SDE) models, such as the Black–Scholes [71] and Heston processes [72], effectively capture endogenous market volatility but generally overlook exogenous factors, including financial news or investor sentiment [73,74]. By contrast, our text-driven framework incorporates these external signals to explain short-term fluctuations that traditional formulations cannot account for. Rather than replacing SDE-based approaches, this study complements them by integrating event-driven and sentiment-driven components, thereby bridging behavioral and stochastic perspectives on market dynamics. Such integration offers a promising avenue for future research, enabling hybrid systems that combine data-driven learning with analytical market equations.

Despite these contributions, several limitations warrant consideration. First, the dataset is limited to Chinese financial news from Taiwan between 2021 and 2023, which may limit its generalizability to other markets or languages. Second, the model’s performance depends heavily on the quality and scale of annotated corpora, and the sentiment classification process is susceptible to labeling subjectivity, especially when interpreting nuanced tones in financial news. Third, the temporal horizon of prediction is confined to short-term dynamics, leaving long-horizon generalization for future exploration. Finally, multimodal signals such as social media sentiment or macroeconomic indicators were not incorporated, and their integration could provide a more comprehensive understanding of market behavior in future work.

6. Conclusions and Future Work

This study proposed a comprehensive framework that integrates financial sentiment classification and short-term stock price forecasting using Chinese news reports from Taiwan’s top 50 listed companies. By systematically comparing multiple embeddings alongside machine learning, deep learning, transformer-based, and large language models, the study generated several key insights. In sentiment classification, contextualized embeddings and large-scale pretrained models demonstrated clear superiority over traditional baselines, highlighting the importance of context-aware representations for financial text understanding. In stock price regression, recurrent neural networks, particularly LSTM, consistently outperformed statistical and transformer-based alternatives across different volatility levels, confirming their robustness for sequential financial forecasting. However, all models showed performance degradation as the forecasting horizon extended, reflecting the inherent difficulty of medium-term prediction. Company-specific case studies on TSMC and Alchip further validated that model behavior aligns with real-world market dynamics—interpretable linear models suffice for stable firms, whereas sequential and multimodal approaches are better suited to capture uncertainty in high-volatility contexts.

While certain limitations remain, we identify several valuable directions for future work. Future research could extend the proposed framework to multilingual and cross-market datasets, incorporate multimodal information, and explore hybrid designs that integrate sentiment-based predictors with traditional stochastic financial models. Beyond methodological advances, this study offers practical implications for financial analysts and investors. The findings indicate that sentiment signals derived from financial news can serve as early indicators of potential market shifts, especially under volatile conditions where conventional quantitative indicators may lag. For stable firms, interpretable models can capture predictable patterns efficiently, while for more volatile firms, sequential or multimodal architectures are better equipped to handle uncertainty. Incorporating sentiment-based forecasting models into decision-making pipelines can thus enhance risk evaluation and support more informed portfolio management strategies.

Author Contributions

H.-M.C. conducted conceptualization, methodology, writing, and editing. H.-C.H. conducted software experiments and validation. M.-C.H. conducted data curation and labeling. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Science and Technology Council, Taiwan, under Grant NSTC 114-2221-E-033-017.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to copyright restriction.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tetlock, P.C. Giving content to investor sentiment: The role of media in the stock market. J. Financ. 2007, 62, 1139–1168. [Google Scholar] [CrossRef]
Nguyen, T.H.; Shirai, K. Topic modeling based sentiment analysis on social media for stock market prediction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 1354–1364. [Google Scholar]
Araci, D. FinBERT: Financial sentiment analysis with pre-trained language models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
Shen, Y.; Zhang, P.K. Financial sentiment analysis on news and reports using large language models and FinBERT. In Proceedings of the 2024 IEEE 6th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 26–28 July 2024; pp. 717–721. [Google Scholar]
Mo, K.; Liu, W.; Xu, X.; Yu, C.; Zou, Y.; Xia, F. Fine-tuning Gemma-7B for enhanced sentiment analysis of financial news headlines. In Proceedings of the 2024 IEEE 4th International Conference on Electronic Technology, Communication and Information (ICETCI), Changsha, China, 17–19 May 2024; pp. 130–135. [Google Scholar]
Mehtab, S.; Sen, J. A time series analysis-based stock price prediction using machine learning and deep learning models. Int. J. Bus. Forecast. Mark. Intell. 2020, 6, 272–335. [Google Scholar] [CrossRef]
Xiao, D.; Su, J. Research on stock price time series prediction based on deep learning and autoregressive integrated moving average. Sci. Program. 2022, 2022, 4758698. [Google Scholar] [CrossRef]
Fan, J.; Shen, Y. StockMixer: A simple yet strong MLP-based architecture for stock price forecasting. Proc. AAAI Conf. Artif. Intell. 2024, 38, 8389–8397. [Google Scholar] [CrossRef]
Hiew, J.Z.G.; Huang, X.; Mou, H.; Li, D.; Wu, Q.; Xu, Y. BERT-based financial sentiment index and LSTM-based stock return predictability. arXiv 2019, arXiv:1906.09024. [Google Scholar]
Gu, W.J.; Zhong, Y.H.; Li, S.Z.; Wei, C.S.; Dong, L.T.; Wang, Z.Y.; Yan, C. Predicting stock prices with FinBERT-LSTM: Integrating news sentiment analysis. In Proceedings of the 2024 8th International Conference on Cloud and Big Data Computing (ICCBDC), Oxford, UK, 15–17 August 2024; pp. 67–72. [Google Scholar]
Ho, T.T.; Huang, Y. Stock price movement prediction using sentiment analysis and candlestick chart representation. Sensors 2021, 21, 7957. [Google Scholar] [CrossRef]
Nguyen, N.H.; Nguyen, T.T.; Ngo, Q.T. DASF-Net: A multimodal framework for stock price forecasting with diffusion-based graph learning and optimized sentiment fusion. J. Risk Financ. Manag. 2025, 18, 417. [Google Scholar] [CrossRef]
Koval, R.; Andrews, N.; Yan, X. Multimodal language models with modality-specific experts for financial forecasting from interleaved sequences of text and time series. arXiv 2025, arXiv:2509.19628. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The LLaMA 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Kogan, S.; Levin, D.; Routledge, B.R.; Sagi, J.S.; Smith, N.A. Predicting risk from financial reports with regression. In Proceedings of the Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2009), Boulder, CO, USA, 31 May–5 June 2009; pp. 272–280. [Google Scholar]
Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Ding, X.; Zhang, Y.; Liu, T.; Duan, J. Deep learning for event-driven stock prediction. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015; pp. 2327–2333. [Google Scholar]
Xu, Y.; Cohen, S.B. Stock movement prediction from tweets and historical prices. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1970–1979. [Google Scholar]
Liu, Z.; Huang, D.; Huang, K.; Li, Z.; Zhao, J. FinBERT: A pre-trained financial language representation model for financial text mining. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-21), Montreal, QC, Canada, 19–26 January 2021; pp. 4513–4519. [Google Scholar]
Lan, Y.; Wu, Y.; Xu, W.; Feng, W.; Zhang, Y. Chinese fine-grained financial sentiment analysis with large language models. Neural Comput. Appl. 2025, 37, 24883–24892. [Google Scholar] [CrossRef]
Loughran, T.; McDonald, B. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. J. Financ. 2011, 66, 35–65. [Google Scholar] [CrossRef]
Hu, Z.; Liu, W.; Bian, J.; Liu, X.; Liu, T.Y. Listening to chaotic whispers: A deep learning framework for news-oriented stock trend prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM 2018), Los Angeles, CA, USA, 5–9 February 2018; pp. 261–269. [Google Scholar]
Tang, Y.; Yang, Y. FinMTEB: Finance massive text embedding benchmark. arXiv 2025, arXiv:2502.10990. [Google Scholar] [CrossRef]
Xu, X.; Wen, F.; Chu, B.; Fu, Z.; Lin, Q.; Liu, J.; Yang, Z. FinBERT2: A specialized bidirectional encoder for bridging the gap in finance-specific deployment of large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025), Barcelona, Spain, 3–7 August 2025; pp. 5117–5128. [Google Scholar]
Nassirtoussi, A.K.; Aghabozorgi, S.; Wah, T.Y.; Ngo, D.C.L. Text mining for market prediction: A systematic review. Expert Syst. Appl. 2014, 41, 7653–7670. [Google Scholar] [CrossRef]
Akita, R.; Yoshihara, A.; Matsubara, T.; Uehara, K. Deep learning for stock prediction using numerical and textual information. In Proceedings of the 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), Okayama, Japan, 26–29 June 2016; pp. 1–6. [Google Scholar]
Li, Q.; Wang, T.; Li, P.; Liu, L.; Gong, Q.; Chen, Y. The effect of news and public mood on stock movements. Inf. Sci. 2014, 278, 826–840. [Google Scholar] [CrossRef]
Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A dual-stage attention-based recurrent neural network for time series prediction. arXiv 2017, arXiv:1704.02971. [Google Scholar]
Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
Chen, K.; Zhou, Y.; Dai, F. An LSTM-based method for stock returns prediction: A case study of the China stock market. In Proceedings of the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, 29 October–1 November 2015; pp. 2823–2824. [Google Scholar]
Nelson, D.M.; Pereira, A.C.; de Oliveira, R.A. Stock market’s price movement prediction with LSTM neural networks. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1419–1426. [Google Scholar]
Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked autoencoders and long short-term memory. PLoS ONE 2017, 12, e0180944. [Google Scholar] [CrossRef]
Priya, S.B.; Kumar, M.; JD, N.P. Advanced financial sentiment analysis using FinBERT to explore sentiment dynamics. In Proceedings of the 2025 3rd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Tirunelveli, India, 20–22 February 2025; pp. 889–897. [Google Scholar]
Sawhney, R.; Agarwal, S.; Wadhwa, A.; Shah, R. Deep attentive learning for stock movement prediction from social media text and company correlations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 8415–8426. [Google Scholar]
Karadaş, F.; Eravcı, B.; Özbayoğlu, A.M. Multimodal stock price prediction. arXiv 2025, arXiv:2502.05186. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Nath, S.K.; Zhang, Y.; Li, J.V. Earnings call scripts generation with large language models using few-shot learning prompt engineering and fine-tuning methods. Appl. AI Lett. 2025, 6, e110. [Google Scholar] [CrossRef]
Yang, H.; Zhang, B.; Wang, N.; Guo, C.; Zhang, X.; Lin, L.; Wang, J.; Zhou, T.; Guan, M.; Zhang, R.; et al. FinRobot: An open-source AI agent platform for financial applications using large language models. arXiv 2024, arXiv:2405.14767. [Google Scholar] [CrossRef]
Huang, J.; Xiao, M.; Li, D.; Jiang, Z.; Yang, Y.; Zhang, Y.; Qian, L.; Wang, Y.; Peng, X.; Ren, Y.; et al. Open-FinLLMs: Open multimodal large language models for financial applications. arXiv 2024, arXiv:2408.11878. [Google Scholar]
Bigeard, A.; Nashold, L.; Krishnan, R.; Wu, S. Finance Agent Benchmark: Benchmarking LLMs on real-world financial research tasks. arXiv 2025, arXiv:2508.00828. [Google Scholar]
Xie, Q.; Huang, J.; Li, D.; Chen, Z.; Xiang, R.; Xiao, M.; Yu, Y.; Somasundaram, V.; Yang, K.; Yuan, C.; et al. FinNLP-AgentSCEN-2024 Shared Task: Financial challenges in large language models—FinLLMs. In Proceedings of the Eighth Financial Technology and Natural Language Processing and the First Agent AI for Scenario Planning (FinNLP-AgentSCEN 2024), Bangkok, Thailand, 25–26 August 2024; pp. 119–126. [Google Scholar]
Kumar, S.; ElKholy, M.; Liu, D.; Boulenger, A. Bridging the gap: Efficient cross-lingual NER in low-resource financial domain. In Proceedings of the Joint Workshop of the Ninth Financial Technology and Natural Language Processing (FinNLP), the Sixth Financial Narrative Processing (FNP), and the First Workshop on Large Language Models for Finance and Legal (LLMFinLegal), Luxembourg, 20–21 January 2025; pp. 54–62. [Google Scholar]
Churi, A.; Chakraborty, D.; Khatwani, R.; Pinto, G.; Shah, P.; Sekhar, R. Stock price prediction using deep learning and sentiment analysis. In Proceedings of the 2023 2nd International Conference on Futuristic Technologies (INCOFT), Coimbatore, India, 9–10 November 2023; pp. 1–6. [Google Scholar]
Jia, Y.; Anaissi, A.; Suleiman, B. ResNLS: An improved model for stock price forecasting. Comput. Intell. 2024, 40, e12608. [Google Scholar] [CrossRef]
Li, Q.; Kamaruddin, N.; Yuhaniz, S.S.; Al-Jaifi, H.A.A. Forecasting stock price changes using long short-term memory neural network with symbolic genetic programming. Sci. Rep. 2024, 14, 422. [Google Scholar] [CrossRef] [PubMed]
Shao, Z.; Yao, X.; Chen, F.; Wang, Z.; Gao, J. Revisiting time-varying dynamics in stock market forecasting: A multi-source sentiment analysis approach with large language model. Decis. Support Syst. 2025, 190, 114362. [Google Scholar] [CrossRef]
He, S.; Gu, S. Multi-modal attention network for stock movements prediction. arXiv 2021, arXiv:2112.13593. [Google Scholar]
Long, W.; Gao, J.; Bai, K.; Lu, Z. A hybrid model for stock price prediction based on multi-view heterogeneous data. Financ. Innov. 2024, 10, 48. [Google Scholar] [CrossRef]
Hayashi, Y.; Yanagimoto, H. Headline generation with recurrent neural network. In New Trends in E-Service and Smart Computing; Springer International Publishing: Cham, Switzerland, 2018; pp. 81–96. [Google Scholar]
Wasson, M. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 17th International Conference on Computational Linguistics (COLING 1998), Montreal, QC, Canada, 10–14 August 1998; Volume 2, pp. 1364–1368. [Google Scholar]
GoEast Mandarin. Beginner’s Guide to Punctuation in Chinese. Available online: https://goeastmandarin.com/beginners-guide-to-punctuation-in-chinese/ (accessed on 5 October 2025).
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman and Hall/CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1958, 20, 215–232. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Lopez-Lira, A.; Tang, Y. Can ChatGPT forecast stock price movements? Return predictability and large language models. arXiv 2023, arXiv:2304.07619. [Google Scholar] [CrossRef]
Black, F.; Scholes, M. The pricing of options and corporate liabilities. J. Political Econ. 1973, 81, 637–654. [Google Scholar] [CrossRef]
Heston, S.L. A closed-form solution for options with stochastic volatility with applications to bond and currency options. Rev. Financ. Stud. 1993, 6, 327–343. [Google Scholar] [CrossRef]
Engle, R.F.; Rangel, J.G. The spline-GARCH model for low-frequency volatility and its global macroeconomic causes. Rev. Financ. Stud. 2008, 21, 1187–1222. [Google Scholar] [CrossRef]
Gu, S.; Kelly, B.; Xiu, D. Empirical asset pricing via machine learning. Rev. Financ. Stud. 2020, 33, 2223–2273. [Google Scholar] [CrossRef]

Figure 1. Illustrative example of financial sentiment classification and stock price trends.

Figure 2. System architecture of the proposed framework for Chinese financial news sentiment classification and stock price forecasting.

Figure 3. Classification performance across 50 companies.

Figure 4. Distribution of these companies sorted by CV values.

Figure 5. Bayesian linear regression by MSE for 5-day vs 15-day forecasts across volatility groups.

Figure 6. LSTM regression by MSE for 5-day vs 15-day forecasts across volatility groups.

Figure 7. The 2330 Bayesian linear regression: 5-day stock prediction.

Figure 8. The 2330 LLaMA: 5-day stock prediction.

Figure 9. The 3661 Bayesian linear regression: 5-day stock prediction.

Figure 10. The 3661 LLaMA: 5-day stock prediction.

Table 1. Parameter settings for model architectures.

Parameter	CNN/RNN/GRU/LSTM	BERT	LLaMA3
Batch size	32	16	8
Epoch (classification)	5	5	5
Epoch (regression)	20	20	20
Max length	256	512	256
Learning rate	1 × 10⁻⁴	2 × 10⁻⁵	1 × 10⁻⁵
Dropout	0.2	–	–

Notes. Batch size denotes the number of samples per training iteration. Epoch indicates the number of passes through the entire training set. Max length refers to the maximum sequence length of input tokens. Learning rate controls the step size in parameter updates. Dropout represents the probability of randomly deactivating neurons during training (range: 0–1); here set to

0.2

for CNN/RNN/GRU/LSTM models. For BERT and LLaMA3, regularization relies on their built-in mechanisms rather than explicit dropout settings.

Table 2. Performance of machine learning models across five embeddings.

Embedding	Model	Precision	Recall	F1
One-Hot	Naïve Bayes	0.564	0.699	0.606
	Multilayer Perceptron (MLP)	0.577	0.597	0.579
	Random Forest (RF)	0.564	0.648	0.575
	Gradient Boosting (GB)	0.553	0.627	0.574
	Logistic Regression (LR)	0.572	0.593	0.573
	Support Vector Machine (SVM)	0.528	0.676	0.569
	AdaBoost	0.550	0.570	0.548
	Decision Tree (DT)	0.553	0.557	0.546
	K-Nearest Neighbors (KNN)	0.537	0.583	0.514
TF-IDF	Naïve Bayes	0.561	0.700	0.598
	Multilayer Perceptron (MLP)	0.582	0.608	0.588
	Logistic Regression (LR)	0.545	0.668	0.578
	Support Vector Machine (SVM)	0.545	0.688	0.570
	K-Nearest Neighbors (KNN)	0.562	0.592	0.569
	Gradient Boosting (GB)	0.552	0.612	0.568
	Random Forest (RF)	0.559	0.636	0.567
	AdaBoost	0.547	0.562	0.545
	Decision Tree (DT)	0.551	0.556	0.544
CBOW	Gradient Boosting (GB)	0.560	0.593	0.569
	Random Forest (RF)	0.559	0.593	0.567
	Multilayer Perceptron (MLP)	0.564	0.576	0.564
	Support Vector Machine (SVM)	0.546	0.631	0.563
	AdaBoost	0.556	0.579	0.562
	Logistic Regression (LR)	0.550	0.585	0.560
	K-Nearest Neighbors (KNN)	0.553	0.574	0.555
	Decision Tree (DT)	0.544	0.538	0.536
	Naïve Bayes	0.547	0.512	0.519
skip-gram	Logistic Regression (LR)	0.564	0.625	0.582
	Random Forest (RF)	0.569	0.611	0.576
	Multilayer Perceptron (MLP)	0.567	0.583	0.569
	K-Nearest Neighbors (KNN)	0.553	0.599	0.568
	Gradient Boosting (GB)	0.559	0.594	0.568
	Support Vector Machine (SVM)	0.534	0.657	0.565
	AdaBoost	0.557	0.569	0.558
	Decision Tree (DT)	0.546	0.540	0.538
	Naïve Bayes	0.550	0.545	0.530
BERT	Gradient Boosting (GB)	0.550	0.596	0.565
	Multilayer Perceptron (MLP)	0.556	0.571	0.558
	Random Forest (RF)	0.545	0.595	0.555
	K-Nearest Neighbors (KNN)	0.540	0.584	0.555
	AdaBoost	0.553	0.565	0.553
	Logistic Regression (LR)	0.552	0.560	0.550
	Support Vector Machine (SVM)	0.506	0.648	0.543
	Decision Tree (DT)	0.547	0.547	0.542
	Naïve Bayes	0.545	0.515	0.518

Notes: Precision is the proportion of predicted positives that are correct, Recall is the proportion of actual positives that are correctly predicted, and F1 is their harmonic mean. All metrics are reported in the range [0, 1]. Scores are macro-averaged across all companies. The best F1-score within each embedding group is highlighted in bold.

Table 3. Performance of deep learning models with Word2Vec (CBOW, skip-gram) and BERT embeddings.

Embedding	Model	Precision	Recall	F1
CBOW	RNN	0.606	0.744	0.651
	CNN	0.623	0.707	0.650
	LSTM	0.482	0.820	0.604
	GRU	0.481	0.820	0.603
skip-gram	CNN	0.570	0.804	0.660
	RNN	0.595	0.760	0.651
	GRU	0.485	0.840	0.611
	LSTM	0.482	0.820	0.604
BERT	CNN	0.606	0.738	0.646
	GRU	0.580	0.740	0.642
	RNN	0.592	0.724	0.636
	LSTM	0.562	0.746	0.630

Notes: Metrics follow the same definitions as in Table 2. Results are grouped by embedding type (CBOW, skip-gram, BERT), and the best F1-score within each group is highlighted in bold.

Table 4. Performance of transformer-based and large language models (fine-tuned).

Model	Precision	Recall	F1
RoBERTa	0.588	0.987	0.734
BERT	0.621	0.918	0.730
BART	0.614	0.886	0.719
LLaMA3	0.603	0.987	0.746

Notes: Metrics follow the same definitions as in Table 2. Results are grouped by embedding type (CBOW, skip-gram, BERT), and the best F1-score within each group is highlighted in bold.

Table 5. Regression performance (5-day forecasting) across volatility-based groups.

Group	Model	MAE	MSE	$R^{2}$
High Volatility	AdaBoost	107.607	73,894.819	−2.512
	Bayesian Linear Regression	27.381	2457.805	0.486
	Decision Tree	107.028	68,001.734	−2.505
	Gradient Boosting	106.154	75,853.076	−2.125
	K-Nearest Neighbors	117.268	74,583.398	−3.329
	Linear Regression	36.737	4,940.415	0.069
	Multilayer Perceptron	49.318	7,772.169	−0.374
	Random Forest	101.533	68,095.776	−1.963
	Support Vector Machine	183.136	244,182.356	−6.215
Medium Volatility	AdaBoost	8.976	417.490	−0.261
	Bayesian Linear Regression	5.284	126.658	0.590
	Decision Tree	10.093	555.342	−0.192
	Gradient Boosting	8.433	331.692	0.035
	K-Nearest Neighbors	11.561	473.776	−1.094
	Linear Regression	7.954	251.859	−0.043
	Multilayer Perceptron	8.858	272.338	−0.203
	Random Forest	8.612	369.052	0.064
	Support Vector Machine	10.449	497.305	−0.772
Low Volatility	AdaBoost	2.648	32.882	−0.100
	Bayesian Linear Regression	1.954	14.601	0.368
	Decision Tree	2.585	30.632	−0.059
	Gradient Boosting	2.356	24.431	0.105
	K-Nearest Neighbors	2.801	32.366	−0.343
	Linear Regression	2.277	17.544	0.116
	Multilayer Perceptron	2.347	21.622	0.067
	Random Forest	2.511	27.191	−0.014
	Support Vector Machine	4.159	55.239	−1.011

Notes: MAE measures the average magnitude of prediction errors, disregarding direction, with lower values indicating better accuracy. MSE penalizes larger errors more heavily by squaring residuals; lower values indicate a better fit.

R^{2}

indicates the proportion of variance explained by the model, with values closer to 1 representing stronger explanatory power (0 is no explanatory power, and a negative value is worse than predicting the mean). All metrics are averaged over companies within each volatility group. The best-performing model in each group is highlighted in bold.

Table 6. Regression performance (5-day forecasting) of deep learning models across volatility-based groups.

Group	Deep Learning Model	MAE	MSE	$R^{2}$
High Volatility	CNN	30.911	3710.939	0.527
	GRU	27.160	2971.989	0.588
	LSTM	27.582	2942.481	0.604
	RNN	28.871	3164.994	0.436
Medium Volatility	CNN	5.879	137.296	0.604
	GRU	6.208	168.123	0.604
	LSTM	5.698	144.058	0.654
	RNN	5.490	127.778	0.685
Low Volatility	CNN	2.029	16.362	0.324
	GRU	1.911	15.578	0.420
	LSTM	1.899	15.170	0.438
	RNN	1.980	16.923	0.353

Notes: Evaluation metrics (MAE, MSE, and

R^{2}

) are defined as in Table 5. All values are averaged across companies within each volatility group, and the best-performing model in each group is highlighted in bold.

Table 7. Regression performance (5-day forecasting) of transformer-based and large language models across volatility-based groups.

Group	Model	MAE	MSE	$R^{2}$
High Volatility	BART	168.242	159,313.827	−8.337
	BERT	129.532	70,239.636	−6.663
	RoBERTa	187.712	187,917.005	−7.033
	LLaMA3	32.228	3257.128	0.190
Medium Volatility	BART	10.773	474.726	−1.644
	BERT	11.564	487.606	−2.587
	RoBERTa	10.625	452.133	−2.310
	LLaMA3	8.981	386.785	−1.142
Low Volatility	BART	4.453	88.592	−2.130
	BERT	4.842	95.551	−2.930
	RoBERTa	4.740	95.971	−2.483
	LLaMA3	3.407	45.519	−1.696

Notes: Evaluation metrics (MAE, MSE, and

R^{2}

) are defined as in Table 5. All values are averaged across companies within each volatility group, and the best-performing model in each group is highlighted in bold.

Table 8. Regression results using different feature types for TSMC (2330) and Alchip (3661).

Company	Model	Feature Type	MSE (5 Days)	MSE (15 Days)
TSMC (2330)	LLaMA3	Text + Numerical Features	87	93
	LLaMA3	Numerical Only	388	380
	LLaMA3	Text Only	3807	3974
Alchip (3661)	LLaMA3	Text + Numerical Features	16,893	24,965
	LLaMA3	Numerical Only	433,507	470,034
	LLaMA3	Text Only	2,828,654	2,828,654

Notes: The evaluation metric is mean squared error (MSE), defined as in Table 5. Results are reported separately for short-term (5-day) and mid-term (15-day) forecasting horizons. MSE values follow the native stock price scale of each company.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chuang, H.-M.; He, H.-C.; Hu, M.-C. Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models. Big Data Cogn. Comput. 2025, 9, 263. https://doi.org/10.3390/bdcc9100263

AMA Style

Chuang H-M, He H-C, Hu M-C. Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models. Big Data and Cognitive Computing. 2025; 9(10):263. https://doi.org/10.3390/bdcc9100263

Chicago/Turabian Style

Chuang, Hsiu-Min, Hsiang-Chih He, and Ming-Che Hu. 2025. "Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models" Big Data and Cognitive Computing 9, no. 10: 263. https://doi.org/10.3390/bdcc9100263

APA Style

Chuang, H.-M., He, H.-C., & Hu, M.-C. (2025). Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models. Big Data and Cognitive Computing, 9(10), 263. https://doi.org/10.3390/bdcc9100263

Article Menu

Chinese Financial News Analysis for Sentiment and Stock Prediction: A Comparative Framework with Language Models

Abstract

1. Introduction

2. Related Work

2.1. Word Embedding Techniques in Financial NLP

2.2. Sentiment Analysis for Financial Applications

2.3. Financial Forecasting Models

2.4. Large Language Models for Financial Forecasting

3. Methodology

3.1. Problem Formulation

3.2. System Architecture

3.3. Data Collection, Labeling, and Preprocessing

3.4. Word Embedding

3.5. Model Training

4. Experiments

4.1. Experimental Setup and Evaluation Metrics

4.2. Experimental Results and Comparative Analysis: Sentiment Classification

4.3. Regression Task Analysis for Volatility-Based Grouping

4.4. Case Study: TSMC and Alchip

5. Discussion

5.1. Comparison with Prior Studies and Empirical Validation

5.2. Theoretical Implications and Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI