PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis

Yang, Hai; Chen, Hainan; Jiang, Chang; He, Juntao; Li, Pengyang

doi:10.3390/bdcc10020059

Open AccessArticle

PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis

by

Hai Yang

¹,

Hainan Chen

^1,*,

Chang Jiang

²

,

Juntao He

¹

and

Pengyang Li

¹

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

Knorr-Bremse Commercial Vehicle Systems (Shanghai) Co., Ltd. Suzhou Branch, Suzhou 215000, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(2), 59; https://doi.org/10.3390/bdcc10020059

Submission received: 11 December 2025 / Revised: 14 January 2026 / Accepted: 2 February 2026 / Published: 11 February 2026

Download

Browse Figures

Versions Notes

Abstract

Financial sentiment analysis leverages natural language processing techniques to quantitatively assess sentiment polarity and emotional tendencies in financial texts. Its practical application in investment decision-making and risk management faces two major challenges: the scarcity of high-quality labeled data due to expert annotation costs, and semantic drift caused by the continuous evolution of market language. To address these issues, this study proposes PLTA-FinBERT, a pseudo-label generation-based test-time adaptation framework that enables dynamic self-learning without requiring additional labeled data. The framework consists of two modules: a multi-perturbation pseudo-label generation mechanism that enhances label reliability through consistency voting and confidence-based filtering, and a test-time dynamic adaptation strategy that iteratively updates model parameters based on high-confidence pseudo-labels, allowing the model to continuously adapt to new linguistic patterns. PLTA-FinBERT achieves 0.8288 accuracy on the sentiment classification dataset of financial sentiment analysis, representing an absolute improvement of 2.37 percentage points over the benchmark. On the FiQA sentiment intensity prediction task, it obtains an

R^{2}

of 0.58, surpassing the previous state-of-the-art by 3 percentage points.

Keywords:

financial sentiment analysis; pseudo-labeling; test-time adaptation; FinBERT

1. Introduction

The rapid development of natural language processing has fostered remarkable progress in numerous application domains, including healthcare, law, and finance [1,2,3]. Among these, financial sentiment analysis has become a highly active research direction due to its significant impact on investment decision-making, market monitoring, and risk management [4]. Despite the strong sentiment understanding capabilities of large language models, their high deployment cost underscores the need for efficient, domain-specific alternatives like the pre-trained BERT [5] and its financial variant FinBERT [6], which have driven significant progress in this task. However, financial sentiment analysis based on these models still faces two core challenges: the problem of high-cost manual annotation and semantic drift caused by the dynamic evolution of financial language [7]. Financial texts are highly professional, containing complex terminology, and their emotional tendencies often need to be comprehensively judged in a macroeconomic background. Annotating financial texts requires a large amount of expert effort, leading to high manual annotation costs. For instance, building corpora such as FinLin demands extensive expert participation and multi-stage validation mechanisms to ensure labeling consistency [8]. The scarcity of high-quality labeled data severely limits the performance of supervised models. The expression of financial language evolves with changes in the macroeconomic environment and market events, presenting a phenomenon of “semantic drift”. For example, the term “quantitative easing” was regarded as an active intervention measure during the 2008 global financial crisis, but was often considered a signal of risk against the backdrop of high inflation in 2022. Research by Rubtsova et al. [9] highlights the performance degradation of financial sentiment analysis models over time. They observed an annual decay rate of 8 to 12 percentage points on financial review corpora, indicating that traditional static models lack sufficient adaptability to dynamic language expressions.

Existing studies have explored several adaptation strategies to mitigate these issues. Sedinkina et al. [10] developed an unsupervised adaptive method that automatically constructs sentiment lexicons for target domains, effectively improving model performance on financial market data. Despite its effectiveness, this approach depends on pre-built sentiment lexicons as semantic anchors, and its rule-based expansion mechanism can only add new terms rather than capturing semantic drift in existing vocabulary. Building upon domain-specific pre-training, Araci et al. [6] proposed FinBERT, which employs a two-stage training strategy involving general pre-training followed by domain-specific fine-tuning. This design significantly enhances the model’s ability to capture financial terminology and sentiment cues, yet its offline training mode fixes model parameters, preventing adaptation to temporal shifts in language usage. Similarly, Gururangan et al. [11] introduced a Domain-Adaptive Pre-Training framework that improves a language model’s domain adaptability and task performance through task-aware continual pre-training. Nevertheless, it still suffers from an “evaluation–adaptation gap” where test sets serve merely for validation instead of being integrated into iterative optimization, leading to delayed model updates and slow response to evolving semantics.

Therefore, this paper proposes PLTA-FinBERT, a pseudo-label-driven test-time adaptation framework. The code (commit hash 7f8d6db) is publicly available at the GitHub repository, https://github.com/galaxywwww/PLTA-FinBERT (accessed on 15 September 2025). The proposed method enables the model to autonomously generate pseudo-labels for unlabeled test samples, and iteratively update parameters through confidence thresholds. The main contributions of this work are summarized as follows:

1.: Proposed a pseudo-label generation mechanism that integrates multi-perturbation prediction with confidence-based filtering to ensure pseudo-label reliability and reduce reliance on manual annotation.
2.: Established a test-time adaptation strategy that enables FinBERT to dynamically update itself during inference, thereby overcoming the limitations of traditional static models.
3.: Conducted empirical evaluations on the FiQA and financial sentiment analysis benchmark datasets, demonstrating that the proposed method achieves state-of-the-art performance across multiple metrics.

2. Related Work

2.1. Financial Sentiment Analysis

Research on financial text sentiment analysis has evolved from dictionary-based rules [12], statistical learning methods [13] to deep neural networks and pre-trained language models. Early studies primarily employed lexicon-based sentiment analysis approaches, such as the financial-domain dictionary developed by Loughran and McDonald [14], along with conventional machine learning methods [15], exemplified by the support vector machine combined with the bag-of-words model for sentiment classification of financial product reviews proposed by Malo et al. [16]. Although these methods are effective in specific scenarios, they are difficult to handle complex semantic expressions. With the development of deep learning, Yang et al. [17] constructed a financial news sentiment classification system based on the LSTM model, which effectively captured the time series characteristics and contextual dependencies of the text. Building on this foundation, Johnson et al. [18] introduced a contrastive learning mechanism to enhance the model’s ability to perceive subtle semantic differences, making it more robust in long text processing and fine-grained sentiment recognition. In recent years, pre-trained language models have driven paradigm shifts in financial sentiment analysis tasks. FinBERT, proposed by Araci et al. [6], optimized financial text representation capabilities through domain adaptive training. Wu et al. [19] further developed BloombergGPT, which is trained on massive financial corpora and demonstrates excellent generalization performance in downstream tasks such as sentiment classification and event detection. Beyond model architecture innovations, model adaptation strategies have also been continuously optimized. FinGPT, proposed by Wang et al. [20] and FinLLaMA, proposed by Konstantinidis et al. [21], utilize parameter-efficient fine-tuning strategies such as LoRA and PEFT, respectively, to achieve rapid migration of large language models in financial tasks. Although existing models have achieved significant progress in domain adaptability and accuracy, most approaches still adopt a static training paradigm: models are optimized in a single pass during the training phase, with parameters remaining fixed during inference. This design struggles to address the “semantic drift” problem prevalent in financial markets. Therefore, there is an urgent need to explore a sustainably updatable learning framework that endows models with “continual learning” capabilities, enabling dynamic adaptation to emerging semantic expressions.

2.2. Test-Time Training

Test-Time Training (TTT) is an emerging paradigm for model adaptation whose core idea is to allow the model to update a subset of its parameters during the testing phase, thereby enhancing its generalization ability in unseen or distribution-shifted environments [22]. Depending on the source of the adaptation signal, TTT methods can be categorized into auxiliary self-supervised task-driven [23] and pseudo-label-driven approaches [24]. Early studies primarily relied on auxiliary self-supervised tasks and were mostly applied in the field of computer vision. For example, Sun et al. [22] designed a rotation-prediction-based auxiliary task for image classification, which enabled fine-tuning of the feature extractor without relying on labels, effectively improving model robustness under distribution shifts. As this paradigm was introduced into natural language processing, pseudo-label-driven TTT methods emerged, in which the model leverages its own predictions to generate pseudo-labels for adaptation. Banerjee et al. [25] proposed the Test-Time Self-Training framework, which dynamically fine-tunes a BERT model on reading comprehension tasks using generated pseudo question–answer pairs, significantly improving accuracy and F1 scores on the SQuAD and NewsQA datasets, and demonstrating the feasibility and potential of TTT in NLP tasks. The method proposed in this work also belongs to the pseudo-label-driven TTT variant. The model first performs inference on the test samples and treats high-confidence predictions as pseudo-labels to iteratively update model parameters during the testing phase, thereby improving performance on unseen data.

3. Method

The PLTA-FinBERT framework, as illustrated in Figure 1, is based on the pre-trained FinBERT model and includes two core modules: (1) multi-perturbation prediction and confidence filtering: data augmentation is performed through multi-perturbation prediction to obtain multiple prediction results, and different statistical confidence thresholds are designed according to different task types to strictly filter pseudo-labels; (2) dynamic update during testing: During the inference phase, each test sample was utilized to generate a self-supervised signal, which was used to update the model parameters. The updated model was subsequently applied to the next round of inference on the test set, forming an alternating cycle of prediction and adaptation. In this process, the test set played a dual role, serving both as an evaluation benchmark and as an unlabeled data source to support model adaptation, thereby enabling lifelong learning of the model.

3.1. Multi-Perturbation Prediction and Confidence Filtering

3.1.1. Data Augmentation

First, we fine-tune FinBERT on the labeled training set to optimize its performance for the specific financial sentiment analysis task, while cultivating meta-learning capabilities for subsequent pseudo-label generation. Second, to enhance label diversity while preserving sentiment polarity, we employ a controlled-noise perturbation strategy that operates through random implementations of operations similar to synonym substitution for non-critical terms and stylistic adaptation toward informal expression registers. Each original sample undergoes N independent perturbations to generate N semantically consistent yet linguistically varied counterparts. This augmentation approach effectively captures real-world financial text variations while maintaining the stability of the underlying sentiment orientation.

3.1.2. Confidence Filtering

We design different confidence filtering mechanisms for classification and regression tasks, respectively, to ensure the quality of pseudo-labels. For classification tasks, a majority voting strategy based on prediction consistency is adopted. Prediction distributions of N variant samples are counted, and the class with the highest occurrence frequency is taken as the candidate pseudo-label. Only when the occurrence frequency of this class exceeds a preset threshold, the pseudo-label is accepted for model updating.

For classification tasks, a majority voting strategy with confidence thresholding was employed,

{\hat{y}}_{c} = \{\begin{matrix} arg max_{c \in C} \sum_{n = 1}^{N} I (f (x_{n}) = c), & if \frac{{max}_{c} \sum_{n = 1}^{N} I (f (x_{n}) = c)}{N} \geq τ_{c} \\ ⌀, & otherwise \end{matrix}

(1)

where

τ_{c} \in [0, 1]

is the classification confidence threshold,

C

denotes the set of target categories, ⌀ represents filtered samples, N is the number of perturbed variants per sample, and

I (\cdot)

is the indicator function,

I (x) = \{\begin{matrix} 1 & if x is true \\ 0 & otherwise \end{matrix}

(2)

For regression tasks, a variance constraint strategy was implemented based on prediction stability. Calculate the mean of N predicted values as a candidate pseudo-label, and it is only adopted when the prediction variance is lower than a threshold,

{\hat{y}}_{r} = \{\begin{matrix} \frac{1}{N} \sum_{n = 1}^{N} f (x_{n}), & if \sqrt{\frac{1}{N} \sum_{n = 1}^{N} {(f (x_{n}) - \bar{f})}^{2}} \leq τ_{r} \\ ⌀, & otherwise \end{matrix}

(3)

where

τ_{r}

is the regression threshold for variance control, and

\bar{f} = \frac{1}{N} \sum_{n = 1}^{N} f (x_{n})

represents the mean of predictions.

Through this mechanism, only part of the test samples are selected to participate in the self-update of the model, effectively improving the stability and generalization ability of self-supervised training.

3.2. Dynamic Update During Testing

To address the semantic drift problem in financial text data, we designed a Test-Time Self-Learning mechanism that alternated between inference and parameter updating, thereby continuously refining the model and enhancing its adaptability to changes in data distribution. It leverages high-confidence pseudo-labels to construct self-supervised signals, driving the model to achieve dynamic evolution during the testing phase. Specifically, through the multi-perturbation prediction and confidence filtering mechanism proposed in Section 3.1, high-quality pseudo-labels are generated for each input sample. If a valid pseudo-label is successfully generated, the sample is constructed into a single-sample training batch, and the corresponding loss function is adopted to optimize the model parameters according to the task type.

For classification tasks, the loss function combines cross-entropy loss with entropy regularization,

L_{cls} = \underset{L_{PCE}}{\underset{⏟}{- \sum_{c \in C} {\hat{y}}_{c} log p_{c}}} + λ \underset{L_{ENT}}{\underset{⏟}{\sum_{c \in C} p_{c} log p_{c}}}

(4)

where

L_{PCE}

is the standard cross-entropy loss,

L_{ENT}

is the entropy regularization loss, and

λ \in [0, 1]

is the hyperparameter controlling regularization strength. Entropy regularization loss is introduced to encourage the model to maintain moderate uncertainty during the prediction process, thereby alleviating the negative impact brought by the spread of incorrect pseudo-labels.

For the regression task, mean squared error is used as the loss function,

L_{reg} = \frac{1}{N} \sum_{n = 1}^{N} {(f_{θ} (x^{(n)}) - {\hat{y}}_{r})}^{2}

(5)

After updating the parameters, the updated model is immediately used to perform a full inference on the entire test set. Subsequently, the above process is repeated as the model proceeds to the processing flow of the next sample, until all samples have undergone one round of processing. As shown in Algorithm 1, the system workflow consists of four key steps, and this closed-loop mechanism of “prediction → filtering → updating → prediction” endows the model with the ability of continuous evolution.

Algorithm 1 Self-Learning Framework

Require:: Data $D$ , initial model M, confidence threshold $τ_{c}$ , variance threshold $τ_{v}$
Ensure:: Updated model M

1:: for each instance x in $D$ do
2:: Step 1: Generate augmented predictions
3:: Initialize prediction set $P \leftarrow \emptyset$
4:: for $i = 1$ to n do
5:: $x_{i}^{'} \leftarrow augment (x)$
6:: $p_{i} \leftarrow M . predict (x_{i}^{'})$
7:: $P \leftarrow P \cup {p_{i}}$
8:: end for
9:: Step 2: Pseudo-label generation & validation
10:: if $TaskType = classification$ then
11:: $y^{*} \leftarrow majority_vote (P)$
12:: $conf \leftarrow count (p = y^{*} ∣ p \in P) / n$
13:: $valid \leftarrow (conf \geq τ_{c})$
14:: else
15:: $y^{*} \leftarrow mean (P)$
16:: $valid \leftarrow (variance (P) < τ_{v})$
17:: end if
18:: Step 3: Conditional model update
19:: if valid then
20:: $M . update (x, y^{*})$
21:: end if
22:: Step 4: Make updated prediction
23:: $\hat{y} \leftarrow M . predict (D)$
24:: end for

4. Experiment

4.1. Dataset

Financial Sentiment Analysis. It is an extended and updated version based on the classic Financial Phrase Bank corpus released by Malo et al. [16]. Compared with the original Financial Phrase Bank, the dataset incorporates significant enhancements to better reflect the landscape of modern financial language, with key statistics summarized in Table 1. The total sample size has been expanded from the original 4837 to 5840, with additional financial news and social media texts from the 2020–2022 period, better reflecting recent market language characteristics; emerging financial fields such as cryptocurrency and ESG investment have been added to the text domain, whereas the original dataset mainly focused on traditional corporate financial news; the language expressions also include more social media-style financial expressions, making up for the limitation that the original dataset mainly contained formal financial texts.

FiQA-SA. Sentiment regression dataset FiQA-SA [26] is widely used in financial sentiment analysis tasks. We focus on its Task 1, which aims to evaluate models’ regression prediction capability for sentiment orientation in financial texts. The dataset contains financial news headlines and finance-related tweets, with each sample annotated by human experts with three elements: the target financial entity, the relevant aspect, and a continuous sentiment score. Unlike traditional financial sentiment classification datasets, FiQA Task 1 provides fine-grained continuous sentiment scores. (ranging from −1 to 1, where 1 indicates the most positive sentiment and −1 the most negative) The distribution of sentiment scores is shown in Figure 2. This design better reflects real-world financial market analysis scenarios and enables models to develop more nuanced sentiment analysis.

4.2. Evaluation Metrics

As shown in Table 2, for classification tasks, accuracy evaluates the overall prediction accuracy, and Macro F1 average is used to eliminate the impact of class imbalance, which can more fairly reflect the model’s overall classification ability across various classes. In regression tasks, we use mean squared error (MSE) to quantify prediction bias, and the R² to measure the model’s ability to explain the variance of the target variable.

To assess the stability of the PLTA-FinBERT, we propose the Adaptive Gain Frequency (AGF) metric. It quantifies how frequently the model’s performance surpasses the baseline level across repeated evaluations, thereby distinguishing consistent performance improvements from stochastic fluctuations. The closer the value is to 1, the better the stability of the method. The definition of AGF is as follows:

AGF = \frac{1}{N} \sum_{i = 1}^{N} I (V_{i} > V_{base})

(6)

where

I (\cdot)

evaluates to 1 (when

V_{i}

outperforms

V_{base}

) or 0 (otherwise),

V_{i}

is the index value of the i-th test iteration, and

V_{base}

is the benchmark value of the initial model.

4.3. Baseline

LSTM + GLoVe. This traditional deep learning approach utilizes GLoVe word embeddings to capture global semantic representations and employs LSTM networks to model sequential dependencies in text. It has demonstrated robust performance in financial text classification tasks [27].

Fin-R1. Developed based on the Qwen2.5-7B architecture, this financial domain-specific large language model adopts a two-stage training strategy [28]. First, supervised fine-tuning is performed on the Fin-R1-Data dataset. Subsequently, the model is optimized using Group Relative Policy Optimization, a reinforcement learning algorithm that enhances professional reasoning capabilities in financial scenarios.

XLNet. This model employs permutation language modeling to achieve bidirectional context modeling, effectively addressing the mask bias issue inherent in BERT-style models [29].

BERT. As a bidirectional Transformer model based on masked language modeling, BERT learns general semantic representations through large-scale pre-training and has become a benchmark in natural language processing [5]. Its lightweight variant DistilBERT, improves inference efficiency through knowledge distillation techniques [30].

4.4. Experimental Parameters

This study builds upon the pre-trained language model FinBERT and adopts a discriminative hierarchical fine-tuning strategy to optimize model performance through differentiated learning rate mechanisms and selective parameter freezing. The lower layers maintain a low learning rate to preserve general linguistic features, while the higher layers utilize an elevated learning rate to enhance task-specific performance, thereby effectively mitigating catastrophic forgetting. The experimental setup is shown in Table 3.

4.5. Experimental Results

4.5.1. Experimental Results on Financial Sentiment Analysis

Results demonstrate that the proposed PLTA-FinBERT framework effectively enhances model performance in financial sentiment analysis. As shown in Table 4, PLTA-FinBERT outperforms the baseline FinBERT model across all evaluation metrics. All experimental results for PLTA-FinBERT are derived from five independent runs with random seeds set to 42, 123, 456, 789, and 999 to eliminate potential impacts of random initialization.

From an evolutionary perspective, the experimental results reveal a clear progressive pattern. The BERT baseline achieves a

21.2 %

higher F1-score than LSTM + GLoVe, demonstrating the superiority of transformer architectures. After domain adaptation, FinBERT shows a

4 %

improvement in recall over BERT, validating the value of financial domain pre-training. The PLTA-FinBERT framework further improves performance by

2.87 %

over FinBERT, completing the technical evolution path of “general pre-training → domain adaptation → self-learning optimization”. XLNet’s intermediate performance suggests that while autoregressive pre-training enhances semantic understanding, combining adaptation with dynamic self-learning outperforms pure architecture-based approaches. The failures of DistilBERT and Fin-R1 underscore the necessity of balanced efficiency-domain specificity trade-offs in financial NLP.

To intuitively demonstrate the dynamic performance during training, we plot the accuracy curve on the test set versus training iterations, as shown in Figure 3a. Each iteration strictly follows the workflow of Algorithm 1, one input sample is processed in each iteration, including multi-perturbation pseudo-label generation, confidence threshold filtering, model parameter update, and subsequent full test set evaluation to obtain the performance metric of this iteration. The accuracy exhibits an overall upward trend with increasing iterations and eventually stabilizes. The AGF reaches

99 %

, indicating that the proposed method delivers performance gains and exhibits strong stability.

The confusion matrix (Figure 4) demonstrates that the model achieves

89 %

and

91 %

accuracy for positive and negative sentiment classes, respectively, indicating robust discriminative capability for polar emotion categories. In contrast, neutral sentiment recognition exhibits relatively weaker performance, with misclassification rates of

17 %

(as negative) and

5 %

(as positive). This performance gap likely stems from the inherent semantic ambiguity of neutral texts, which lack clear emotional polarity and thus pose greater challenges for boundary discrimination.

4.5.2. Experimental Results on FiQA-SA

The experimental results in Table 5 show that FinBERT achieves significant performance improvements under our proposed PLTA method. Specifically, the model reduces MSE from 0.07 to 0.06, representing a

14.3 %

relative reduction, while improving the R² from 0.55 to 0.58, establishing a new state-of-the-art performance.

As shown in the

R^{2}

evolution curve (Figure 3b), the

R^{2}

exhibits rapid growth during early training iterations, indicating substantial performance gains in the initial phase. The metric stabilizes after approximately 150 iterations, demonstrating the convergence efficiency of the framework.

4.5.3. Hyperparameter Analysis

In the classification task, the number of perturbations serves as a key hyperparameter. We investigate its impact on model performance within the range of 6 to 10 (

Δ = 2

) perturbations. As illustrated in Figure 5a, the model’s prediction stability metric (AGF) demonstrates a significant and steady improvement with increasing perturbation counts, while the accuracy remains relatively stable. These findings indicate that augmenting the number of perturbations within a certain range can enhance the stability of model predictions, yet provides limited improvement to the optimal values of classification accuracy.

In the regression task, the standard deviation serves as the primary hyperparameter. By establishing a threshold gradient ranging from 0.10 to 0.25 (

Δ = 0.05

), we analyze the influence of threshold magnitude on model performance. The experimental results (Figure 5b) demonstrate a clear declining trend in model performance as the threshold increases: the AGF decreases from 0.972 to 0.762, while the R² drops from 0.583 to 0.517. This observation validates the theoretical expectation that more lenient threshold conditions lead to degraded pseudo-label quality, thereby adversely affecting model performance.

4.5.4. Dynamic Adaptability Analysis

The self-supervised learning framework proposed in this study endows the model with dynamic adaptation capabilities. To validate this characteristic, we present several representative examples in Table 6. These cases show that the model can effectively identify and correct initial prediction biases.

5. Conclusions

This study proposes a self-learning framework, PLTA-FinBERT. It demonstrates significant advantages in financial sentiment analysis through the integration of multi-perturbation prediction, confidence estimation and test-time dynamic adaptation. Experimental results show that for sentiment classification on the financial sentiment analysis dataset, our method achieves an accuracy of

82.88 %

and F1-score of

79.92 %

, representing comprehensive improvements over the baseline FinBERT model. For sentiment intensity prediction on the FiQA-SA dataset, it obtains an MSE of 0.06 and an R² of 0.58, outperforming existing state-of-the-art methods. The framework achieves AGFs of

99 %

and

97 %

for classification and regression tasks, respectively, confirming its ability to enhance model performance while maintaining strong stability without requiring additional labeled data.

It should be noted that the quality of pseudo-label generation in the current framework remains dependent on the base model’s capabilities. This study focuses on tasks in the financial domain. Future research will further optimize the pseudo-label generation and test-time adaptation strategies, and explore the performance of the framework on classification and regression tasks in other domains, so as to fully demonstrate its universal adaptive value.

Author Contributions

Conceptualization, H.Y. and H.C.; methodology, H.Y. and H.C.; formal analysis, C.J., P.L. and J.H.; writing—original draft preparation, H.Y. and H.C.; writing—review and editing, All authors; visualization, C.J.; supervision, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available: Financial sentiment analysis dataset from https://huggingface.co/datasets/mltrev23/financial-sentiment-analysis (accessed on 15 September 2025) and FiQA-SA dataset from https://huggingface.co/datasets/dohonba/fiqa-2018 (accessed on 15 September 2025).

Conflicts of Interest

Author Chang Jiang was employed by the company Knorr-Bremse Commercial Vehicle Systems (Shanghai) Co., Ltd. Suzhou Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Manaka, T.; Zyl, T.V.; Kar, D.; Wade, A. Multi-step transfer learning in natural language processing for the health domain. Neural Process. Lett. 2024, 56, 177. [Google Scholar] [CrossRef]
Sheik, R.; Sundara, K.P.S.; Nirmala, S.J. Neural data augmentation for legal overruling task: Small deep learning models vs. large language models. Neural Process. Lett. 2024, 56, 121. [Google Scholar] [CrossRef]
Meng, Z.; Cai, Z.; Feng, J.; Ma, H.; Zhang, H.; Li, S. Braille Character Segmentation Algorithm Based on Gaussian Diffusion. Comput. Mater. Contin. 2024, 79, 1143–1159. [Google Scholar] [CrossRef]
Ranjan, R.; Sharma, K.; Kumar, A. Introduction to NLP in Finance: Sentiment Analysis and Risk Management. In Transformative Natural Language Processing: Bridging Ambiguity in Healthcare, Legal, and Financial Applications; Springer: Cham, Switzerland, 2025; pp. 75–100. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Araci, D. Finbert: Financial sentiment analysis with pre-trained language models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
Guo, Y.; Hu, C.; Yang, Y. Predict the future from the past? On the temporal data distribution shift in financial sentiment classifications. arXiv 2023, arXiv:2310.12620. [Google Scholar] [CrossRef]
Daudert, T. A multi-source entity-level sentiment corpus for the financial domain: The FinLin corpus. Lang. Resour. Eval. 2022, 56, 333–356. [Google Scholar] [CrossRef] [PubMed]
Rubtsova, Y. Reducing the deterioration of sentiment analysis results due to the time impact. Information 2018, 9, 184. [Google Scholar] [CrossRef]
Sedinkina, M.; Breitkopf, N.; Schütze, H. Automatic domain adaptation outperforms manual domain adaptation for predicting financial outcomes. arXiv 2020, arXiv:2006.14209. [Google Scholar] [CrossRef]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2020; pp. 8342–8360. [Google Scholar]
Sohangir, S.; Petty, N.; Wang, D. Financial sentiment lexicon analysis. In Proceedings of the 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 31 January–2 February 2018; IEEE: New York, NY, USA, 2018; pp. 286–289. [Google Scholar]
Li, G.; Lin, Z.; Wang, H.; Wei, X. A discriminative approach to sentiment classification. Neural Process. Lett. 2020, 51, 749–758. [Google Scholar] [CrossRef]
Loughran, T.; McDonald, B. Textual analysis in accounting and finance: A survey. J. Account. Res. 2016, 54, 1187–1230. [Google Scholar] [CrossRef]
Renault, T. Sentiment analysis and machine learning in finance: A comparison of methods and models on one million messages. Digit. Financ. 2020, 2, 1–13. [Google Scholar] [CrossRef]
Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Minneapolis, MN, USA, 2016; pp. 1480–1489. [Google Scholar]
Johnson, E.; Nasir, W.; Smith, C. Contrastive Learning-Based Sentiment Analysis. Preprints 2024. [Google Scholar] [CrossRef]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
Wang, N.; Yang, H.; Wang, C.D. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. arXiv 2023, arXiv:2310.04793. [Google Scholar]
Konstantinidis, T.; Iacovides, G.; Xu, M.; Constantinides, T.G.; Mandic, D. Finllama: Financial sentiment classification for algorithmic trading applications. arXiv 2024, arXiv:2403.12285. [Google Scholar] [CrossRef]
Sun, Y.; Wang, X.; Liu, Z.; Miller, J.; Efros, A.; Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. In Proceedings of the 37th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2020; pp. 9229–9248. [Google Scholar]
He, H.; Hosseini, M.S.; Wang, Y. PathTTT: Test-Time Training with Meta-auxiliary Learning for Pathology Image Classification. In Proceedings of the International Conference on Information Processing in Medical Imaging, Kos, Greece, 25–30 May 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 33–46. [Google Scholar]
Goyal, S.; Sun, M.; Raghunathan, A.; Kolter, J.Z. Test time adaptation via conjugate pseudo-labels. In Advances in Neural Information Processing Systems; NeurIPS Foundation: San Diego, CA, USA, 2022; Volume 35, pp. 6204–6218. [Google Scholar]
Banerjee, P.; Gokhale, T.; Baral, C. Self-supervised test-time learning for reading comprehension. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Minneapolis, MN, USA, 2021; pp. 1200–1211. [Google Scholar]
Maia, M.; Handschuh, S.; Freitas, A.; Davis, B.; McDermott, R.; Zarrouk, M.; Balahur, A. Www’18 open challenge: Financial opinion mining and question answering. In Companion Proceedings of the Web Conference 2018, Lyon, France, 23–27 April 2018; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2018; pp. 1941–1942. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Minneapolis, MN, USA, 2014; pp. 1532–1543. [Google Scholar]
Liu, Z.; Guo, X.; Lou, F.; Zeng, L.; Niu, J.; Wang, Z.; Xu, J.; Cai, W.; Yang, Z.; Zhao, X.; et al. Fin-r1: A large language model for financial reasoning through reinforcement learning. arXiv 2025, arXiv:2503.16252. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems; NeurIPS Foundation: San Diego, CA, USA, 2019; Volume 32. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Yang, S.; Rosenfeld, J.; Makutonin, J. Financial aspect-based sentiment analysis using deep representations. arXiv 2018, arXiv:1808.07931. [Google Scholar] [CrossRef]
Piao, G.; Breslin, J.G. Financial aspect and sentiment predictions with deep neural networks: An ensemble approach. In Companion Proceedings of the Web Conference 2018, Lyon, France, 23–27 April 2018; International World Wide Web Conferences Steering Committee: Lyon, France, 2018; pp. 1973–1977. [Google Scholar]

Figure 1. Pseudo-label generation-based Test-Time Adaptation Framework.

Figure 2. The label value distribution in the FiQA-SA dataset. Peaks occur in the 0.3–0.5 score range.

Figure 3. Evolution curves. (a) Accuracy evolution curve on the financial sentiment analysis dataset. (b)

R^{2}

evolution curve on the FiQA-SA dataset.

Figure 3. Evolution curves. (a) Accuracy evolution curve on the financial sentiment analysis dataset. (b)

R^{2}

evolution curve on the FiQA-SA dataset.

Figure 4. Confusion matrix.

Figure 5. Effect of Hyperparameter Variations on Model Performance. (a) Impact of varying prediction times on accuracy and AGFs on the financial sentiment analysis dataset. (b) Impact of varying thresholds on

R^{2}

and AGFs on the FiQA-SA dataset. Other parameters remain fixed as described in Table 3.

Figure 5. Effect of Hyperparameter Variations on Model Performance. (a) Impact of varying prediction times on accuracy and AGFs on the financial sentiment analysis dataset. (b) Impact of varying thresholds on

R^{2}

and AGFs on the FiQA-SA dataset. Other parameters remain fixed as described in Table 3.

Table 1. Key statistics of the financial sentiment dataset.

Feature Category	Statistical Value
Total Samples	5840
Time Coverage	2006–2022
Text Type Distribution
Traditional Financial News	72.6%
Social Media Texts	12.7%
Emerging Field Texts	14.%
Sentiment Distribution
Positive	28%
Negative	12%
Neutral	59%

Table 2. Evaluation metrics.

Task Type	Evaluation Metrics
Classification Task	Accuracy Macro F1-average
Regression Task	MSE (Mean Squared Error) $R^{2}$ (Coefficient of Determination)
Method Stability	AGF (Adaptive Gain Frequency)

Table 3. Experimental parameter settings.

Parameter	Value
Optimizer	AdamW
Max sequence length	64 tokens
Batch size	1
Number of perturbed variants per sample	10
Regression task:
Learning rate	48 $\times 10^{- 7}$
Variance threshold ( $τ_{r}$ )	0.1
Noise weight	0.1
Classification task:
Learning rate	5 $\times 10^{- 6}$
Modal confidence threshold ( $τ_{c}$ )	0.8
Noise weight	0.1
Computing device	NVIDIA RTX 3090 GPU (24 GB VRAM)
Random seed	42

Table 4. Experimental results on the financial sentiment analysis dataset.

Model	Accuracy	Precision	Recall	F1 Score
LSTM + GLoVe	0.7086	0.6209	0.5713	0.5708
Fin-R1	0.7577	0.7279	0.7537	0.7274
XLNet	0.8033	0.8209	0.8033	0.8093
DistilBERT	0.6758	0.6002	0.6758	0.6326
BERT	0.7905	0.7809	0.7905	0.7835
FinBERT	0.8051	0.7700	0.8300	0.7700
PLTA-FinBERT (Ours)	$0.8288 \pm 0.008$	$0.7839 \pm 0.009$	$0.8587 \pm 0.007$	$0.7992 \pm 0.008$

Table 5. Results on the FiQA-SA dataset.

Model	MSE	$R^{2}$
Yang et al. [31] ¹	0.08	0.40
Piao and Breslin [32] ¹	0.09	0.41
PLTA-FinBERT	$0.06 \pm 0.009$	$0.58 \pm 0.018$

Bold face indicates the best result for each metric. ¹ Yang et al. [31] and Piao and Breslin [32] report results on the official test set. All PLTA-FinBERT results are from five independent runs with different random seeds to mitigate random initialization effects.

Table 6. Case examples of model self-correction.

Example Text	True Label	Initial Prediction	Corrected Prediction
“The Vaisala Group is a successful international technology company that develops, manufactures and markets electronic measurement systems and products.”	Positive	Neutral	Positive
“Operating loss totalled EUR 0.3 mn, down from a profit of EUR 5.1 mn in the first half of 2009.”	Neutral	Negative	Neutral
“My $DWA play up 6% today. I’m still skeptical. Will take profits. Not a time cheer.”	Neutral	Positive	Neutral

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, H.; Chen, H.; Jiang, C.; He, J.; Li, P. PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis. Big Data Cogn. Comput. 2026, 10, 59. https://doi.org/10.3390/bdcc10020059

AMA Style

Yang H, Chen H, Jiang C, He J, Li P. PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis. Big Data and Cognitive Computing. 2026; 10(2):59. https://doi.org/10.3390/bdcc10020059

Chicago/Turabian Style

Yang, Hai, Hainan Chen, Chang Jiang, Juntao He, and Pengyang Li. 2026. "PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis" Big Data and Cognitive Computing 10, no. 2: 59. https://doi.org/10.3390/bdcc10020059

APA Style

Yang, H., Chen, H., Jiang, C., He, J., & Li, P. (2026). PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis. Big Data and Cognitive Computing, 10(2), 59. https://doi.org/10.3390/bdcc10020059

Article Menu

PLTA-FinBERT: Pseudo-Label Generation-Based Test-Time Adaptation for Financial Sentiment Analysis

Abstract

1. Introduction

2. Related Work

2.1. Financial Sentiment Analysis

2.2. Test-Time Training

3. Method

3.1. Multi-Perturbation Prediction and Confidence Filtering

3.1.1. Data Augmentation

3.1.2. Confidence Filtering

3.2. Dynamic Update During Testing

4. Experiment

4.1. Dataset

4.2. Evaluation Metrics

4.3. Baseline

4.4. Experimental Parameters

4.5. Experimental Results

4.5.1. Experimental Results on Financial Sentiment Analysis

4.5.2. Experimental Results on FiQA-SA

4.5.3. Hyperparameter Analysis

4.5.4. Dynamic Adaptability Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI