6.1. Experimental Setup
6.1.1. Dataset Description
We constructed a multimodal dataset consisting of historical stock prices and financial news headlines for a selected subset of S&P 500 companies over the period from January 2018 to December 2023. Daily stock market data, including open, high, low, close (OHLC) prices and trading volume, were obtained from Yahoo Finance (
https://finance.yahoo.com (accessed on 10 August 2025)) using the yfinance Python library (
https://github.com/ranaroussi/yfinance (accessed on 10 August 2025)). Financial news headlines were aggregated from various reputable sources (e.g., Reuters, Bloomberg, MarketWatch) via public APIs, licensed news aggregators (e.g., RavenPack, News API), and commercial financial datasets, with all news data timestamped and aligned to market trading hours to preserve temporal causality. For reproducibility and benchmarking, all of our used data can be found at
https://github.com/Zdong104/FNSPID (accessed on 10 August 2025).
For each stock on trading day t, we aligned price and volume features with sentiment signals extracted from all headlines published within a 24 h window preceding market close on t. News appearing after 4:00 p.m. EST was assigned to the next trading day to preserve temporal causality.
We note that our focus on S&P 500 large-cap stocks was motivated by the availability of reliable, high-frequency news coverage, which is necessary for consistent headline-based sentiment modeling. This choice ensured sufficient textual data density across firms and time, enabling meaningful signal extraction and robust evaluation. However, we acknowledge that this introduced a media exposure bias, as large-cap firms typically receive more consistent and timely coverage compared to mid- or small-cap stocks. In lower-visibility contexts, model performance may degrade due to sparser or noisier sentiment signals, and this remains an important limitation on generalization. Nonetheless, the use of the S&P 500 is justified by its liquidity, representativeness of major market sectors, and wide acceptance as a benchmarking index in both academic and industry settings.
6.1.2. Label Definition
The final input feature vector
for each trading day
t was constructed by concatenating multiple feature groups that capture both quantitative market behavior and textual sentiment signals. These include the following: (i) technical indicators such as
k-day moving averages, the relative strength index (RSI), moving average convergence divergence (MACD), and volatility estimators (e.g., rolling standard deviation); (ii) FinBERT-derived sentiment features computed from daily financial news, including the average sentiment score, maximum sentiment, and sentiment dispersion (standard deviation); and (iii) lagged price-based features, specifically the log returns over the past five trading days. To ensure numerical stability and comparability across features, all inputs were standardized using z-score normalization:
where
and
denote the sample mean and standard deviation of feature
computed from the training set. This normalization was applied independently to each feature dimension, preserving temporal integrity and preventing information leakage.
6.1.3. Model Configuration
We trained an XGBoost classifier with a logistic loss function and hyperparameters selected via cross-validation on the training set. The final configuration included a maximum tree depth of 6, a learning rate of 0.05, 300 boosting rounds, a subsample ratio of 0.8 for each boosting iteration, and a column subsample ratio of 0.8 at the tree level. To ensure temporal integrity and avoid lookahead bias, we adopted a rolling-window evaluation strategy. Specifically, for each evaluation fold, the model was trained on a historical window and evaluated on a subsequent window , thereby preserving chronological order and simulating a realistic forecasting scenario.
6.3. Results
6.3.1. Predictive Performance
To assess the efficacy of incorporating sentiment features—particularly those extracted using FinBERT—we evaluated the predictive performance of five model variants across a six-year dataset spanning January 2018 to December 2023. We used five rolling test windows, each covering six months of held-out data, with training performed on the preceding 24 months.
Table 1 reports the mean scores across windows.
As shown in
Table 1, the addition of FinBERT sentiment features (T + F) leads to a consistent and statistically significant improvement across all performance metrics. The model achieves a relative gain of 12.6% in AUC over the technical-only baseline, and a 26.3% increase in average simulated PnL (the reported PnL figures assume idealized execution, and do not account for transaction costs, bid-ask spreads, slippage, or market impact. These simplifications may overstate real-world profitability, and are used solely for model comparison under controlled conditions.). Precision improves from 0.601 to 0.678, indicating a higher proportion of correctly predicted upward movements, which is particularly valuable in directional trading strategies.
The model that integrates only VADER sentiment (T + V) yields improvements over the technical baseline, but performs notably worse than T + F in all metrics, highlighting the limitations of lexicon-based sentiment in financial contexts. Furthermore, the momentum-only variant (T + M) performs marginally better than T, suggesting that traditional trend-following indicators capture some temporal dependencies, but lack the broader informational context provided by news sentiment.
We performed paired t-tests on AUC and F1 across the five test periods. Improvements of T + F over both T and T + V are statistically significant at the 99% confidence level (p < 0.01), confirming the robustness of FinBERT sentiment features across multiple temporal regimes and stocks.
6.3.2. Ablation Findings
To further investigate the contribution of each feature group, we conducted controlled ablation experiments, wherein subsets of the input feature space were selectively removed from the T + F configuration.
Table 2 shows comparsions of feature group mapping to prior literatures. The results are included in
Table 1 under rows T + F − V (FinBERT without volatility) and T + F − M (FinBERT without momentum).
Removing volatility-based features caused a decrease of 1.8% in AUC and a 0.7% reduction in average precision. This reflects the sensitivity of short-term price movement to market uncertainty, which is not fully encoded in price trends or sentiment. Removing momentum indicators had a slightly more pronounced effect, reducing PnL by over 1% and AUC by 2.5 points, suggesting their non-trivial interaction with FinBERT sentiment during trend reversals or sideways markets.
Interestingly, the model trained without both momentum and volatility features (not shown in table) still outperformed all baselines except T + F, with an AUC of 0.707 and PnL of 6.05%, illustrating the dominant role of FinBERT-enhanced sentiment features in the full-stack architecture.
6.3.3. Feature Group Contributions Assessed via SHAP Analysis
To rigorously evaluate the relative importance of different feature categories, we applied a global SHAP analysis over the test set. For each sample, the SHAP values quantified the marginal contribution of each feature to the model’s output. We aggregated the absolute SHAP values by feature group and normalized them to obtain a global importance distribution.
Table 3 reports the percentage of the total SHAP contribution attributed to each group. FinBERT-based sentiment features—including the mean, maximum, and standard deviation of daily sentiment scores—emerge as the most influential group, accounting for nearly 29% of the total model attribution. Volatility-related indicators and momentum signals rank next, while raw price returns and volume statistics show lower but non-negligible importance.
To better visualize the relative ranking, we present a horizontal bar chart in
Figure 3. This figure clearly shows that FinBERT-enhanced sentiment features are the dominant driver of predictive behavior in our model.
The alignment between SHAP-based importance and ablation results offers model-agnostic validation of feature utility. Furthermore, SHAP provides interpretable, post hoc explanations that are essential for risk-aware deployment in finance—an industry governed by auditability and transparency constraints.
6.3.4. Market Regime Stability
To assess the robustness and adaptability of each model under varying economic conditions, we evaluated predictive performance across distinct market regimes spanning the 2018–2023 period. Specifically, we partitioned the test data into three macro regimes:
Volatile Regime (February–April 2020): Marked by the onset of the COVID-19 pandemic and rapid equity market drawdowns.
Bullish Regime (May 2020–December 2021): Characterized by a prolonged market recovery and strong upward trends.
Stagnant/Sideways Regime (January–December 2022): Defined by high inflation, monetary tightening, and low directional bias in price movement.
Table 4 presents the regime-specific AUC scores and average simulated profit-and-loss (PnL) percentages for the three primary models: technical-only (T), technical with VADER (T + V), and technical with FinBERT (T + F).
The results reveal that the FinBERT-enhanced model (T + F) maintained high performance across all regimes, consistently achieving AUC scores above 0.72 and generating significantly higher PnL than the baselines. Notably, during the volatile COVID-19 period, the T + F model achieved a 4.9% average return, versus 1.2% and 2.0% for the technical-only and VADER models, respectively. This suggests that sentiment extracted from FinBERT effectively captures investor fear and news-driven risk signals that are not present in price-based features.
During the bullish recovery phase of 2020–2021, sentiment signals remained predictive, likely reflecting optimistic language in earnings reports and macroeconomic news. The T + F model achieved an AUC of 0.748 and a simulated PnL of 7.8%, outperforming the T + V baseline by over 3.5 percentage points.
In the stagnant regime of 2022, where traditional trend-following indicators tended to degrade in efficacy, the T + F model continued to deliver strong performance (AUC: 0.723), while the T and T + V models degraded to near-random classification levels (AUC: 0.593 and 0.614, respectively). This demonstrates that FinBERT sentiment features contribute not only discriminative power, but also resilience across structurally different market environments.
These results highlight that FinBERT-derived features generalize well under regime shifts, capturing latent behavioral and emotional cues that are difficult to model with price signals alone. In contrast, lexicon-based sentiment (T + V) improves modestly over T, but fails to deliver consistent gains during turbulent or directionless markets. Thus, FinBERT sentiment offers both predictive value and temporal robustness, supporting its integration into production-grade forecasting systems.
6.3.5. SHAP-Based Interpretation
To understand the inner decision-making process of our predictive model, we conducted a comprehensive interpretability analysis using TreeSHAP [
28]. SHAP (SHapley Additive exPlanations) assigns each feature an additive contribution to the model’s output for a specific instance, allowing for both local and global interpretability.
We first present the global SHAP feature ranking in
Table 5, listing the ten most influential features based on their average importance. FinBERT-derived features occupy three of the top five ranks, with the mean sentiment score contributing the most to prediction decisions. Volatility indicators (GARCH and rolling standard deviation) also play a critical role, reflecting the model’s sensitivity to market risk conditions. Classical technical indicators and price returns, while still relevant, exhibit lower average influence.
To illustrate how individual predictions were formed, we examined several SHAP force plots, which show how each feature contributes positively or negatively to the model’s output probability relative to the baseline prediction . One representative case occurred on the day following a major Q3 2021 earnings announcement for a large-cap technology company. The FinBERT-derived mean sentiment score was strongly positive, at +0.86, with low dispersion (standard deviation = 0.12), indicating agreement among news sources. The model assigned a prediction score of 0.92 (upward movement), significantly above the dataset baseline of 0.51. The SHAP force plot for this instance confirmed that FinBERT sentiment features and reduced volatility were the dominant positive contributors to the model’s confidence in a bullish signal.
We also analyzed interactions between features using SHAP dependence plots. A particularly strong interaction was found between FinBERT mean sentiment and GARCH volatility. The model showed higher confidence in predictions when sentiment was positive and volatility was low, but became more conservative when volatility was elevated—even in the presence of strong sentiment. This behavior suggests that the model implicitly adjusts sentiment influence based on market uncertainty, a desirable property for risk-sensitive financial forecasting.
To verify that the model does not rely disproportionately on any single feature or regime, we generated SHAP summary plots across different time periods. These plots confirmed a smooth, multi-feature contribution distribution and revealed that sentiment features became significantly more important during news-heavy periods (e.g., earnings season, macroeconomic announcements). In contrast, technical features became more relevant during low-news, trend-driven intervals.
Finally, the SHAP results closely mirror those from the ablation and performance analysis, creating a consistent and interpretable story. The combination of FinBERT-enhanced sentiment and volatility measures provides a reliable foundation for both predictive accuracy and model transparency. These findings validate the use of SHAP in financial machine learning, enabling practitioners and analysts to not only build performant models, but also justify their behavior to regulators, stakeholders, and auditors.
6.4. Robustness and Generalization
To evaluate the robustness and generalization capability of our proposed model, we conducted cross-sectional and temporal experiments across multiple equity assets and market regimes. Specifically, we tested the FinBERT-enhanced (T + F) model against the technical-only (T) and VADER-based (T + V) baselines on a diverse set of U.S. large-cap stocks: Apple (AAPL), Microsoft (MSFT), JPMorgan Chase (JPM), and Tesla (TSLA). These stocks were selected to represent a mix of sectors (technology, financials, and consumer cyclicals) and volatility profiles.
Each stock was evaluated over three key temporal partitions:
Pre-COVID Period: January 2018 to December 2019—relatively stable market with strong growth.
COVID Crash: February 2020 to April 2020—high volatility, systemic panic, sharp sell-offs.
Post-COVID Recovery: May 2020 to December 2022—prolonged rebound, rotation across sectors.
Table 6 reports the average AUC and F1-scores for each model–stock pair, along with the standard deviation (SD) of these metrics across regimes. The FinBERT-enhanced model exhibits not only superior average performance, but also lower variance, indicating consistent predictive ability across assets and market conditions.
Across all stocks, the FinBERT-enhanced model achieved the highest average AUC and F1-score, outperforming both the technical-only and VADER baselines by margins ranging from 6.8 to 9.5 percentage points in AUC. Notably, the standard deviation of AUC across market regimes was substantially lower for the T + F model (mean SD: 0.021) than for T (0.036) or T + V (0.032), suggesting greater temporal stability and resilience to structural shifts in market dynamics. While VADER performs well in general-purpose sentiment analysis, it often misinterprets financial terminology—treating terms like “depreciation” or “shortfall” as inherently negative—highlighting the need for domain-specific models like FinBERT that better capture the nuances of financial language.
We also computed Sharpe ratios of the model-generated signals (using daily returns from a hypothetical long-short strategy) across regimes. The T + F model yielded Sharpe ratios consistently above 1.5 in the post-COVID period, and maintained ratios above 1.0 even during the crash period, whereas the ratios for T and T + V often fell below 1.0, indicating noisier, less risk-adjusted signal quality.
An analysis of error consistency revealed that T + F exhibited fewer false positives during high-volatility drawdowns and better recall during periods of trend reversals—likely due to FinBERT’s ability to encode forward-looking sentiment signals ahead of market realization.
From a generalization perspective, the model demonstrated minimal overfitting despite increased feature dimensionality, as evidenced by the narrow gap between in-sample and out-of-sample performance (mean difference in AUC across folds). This can be attributed to the regularization in the XGBoost framework and the relatively low correlation between FinBERT sentiment features and traditional indicators.
Overall, these findings confirm that FinBERT-enhanced sentiment features contribute not only predictive accuracy, but also robustness and generalization across both asset classes and temporal regimes. The model’s consistent performance across heterogeneous conditions makes it suitable for deployment in real-world, multi-asset trading systems, where stability and interpretability are paramount.
6.5. Privacy Protection Results
We report the model’s predictive performance under varying levels of differential privacy. The privacy mechanism follows DP-SGD with and varying privacy budgets . Here, corresponds to the non-private model.
The results in
Table 7 provide a quantitative reference for selecting appropriate privacy budgets in production settings. Higher privacy (smaller
) leads to moderate decreases in all metrics. However, performance degradation remains bounded, with over 95% of the baseline AUC preserved for
.