7.2.1. Stage 1: Initial Comparison Between Baseline and Deep Learning Models
This stage’s goal is to present a preliminary comparison between deep learning models and baseline classifiers under a preliminary prediction specification. Establishing a reference framework and identifying potential variations in prediction behavior among model classes are the goals of this first stage.
Baseline Models
In order to create a baseline for further comparison, we start by assessing the prediction performance of benchmark models. A common linear probabilistic model in empirical finance is logistic regression, whereas the majority class classifier is a naïve reference that reflects the unconditional class distribution.
The majority class baseline produces an AUC near 0.50, as predicted, suggesting little discriminatory power. With an AUC marginally higher than random classification and minor improvements in balanced accuracy and F1-score, logistic regression outperforms this benchmark. These findings suggest that linear models might only be able to represent a small percentage of the underlying relationships, which encourages the investigation of more adaptable modeling techniques.
Deep Learning Models
We then assess deep learning architectures, such as GRU, LSTM, and CNN1D, to investigate their capacity to capture possible non-linear and temporal dependencies in the data, building on the benchmark results presented in Section Baseline Models.
The additional metrics used to assess model performance include AUC, precision–recall AUC, accuracy, balanced accuracy, F1-score, precision, recall, and the Matthews correlation coefficient (MCC). This allows for a more comprehensive comparison across different aspects of model performance.
The out-of-sample predictive performance of all assessed models, including benchmark classifiers (LogReg and Majority) and deep learning architectures (GRU, LSTM, CNN1D), on the test set is shown in
Table 6.
The predictive performance of deep learning models in comparison to benchmark classifiers is shown in
Table 6. With an AUC of 0.669 and a PR-AUC of 0.645, GRU outperforms CNN1D (AUC = 0.570) and LSTM (AUC = 0.593) among the assessed models.
GRU also obtains the best accuracy (0.641) and balanced accuracy (0.593) in terms of total classification performance, indicating a more advantageous trade-off between sensitivity and specificity under this initial specification.
Despite having a better F1-score (0.496) and recall (0.475) than GRU (F1 = 0.368; Recall = 0.243), the LSTM model’s lower AUC suggests that it has an inferior ranking ability. CNN1D performs mediocrely on most evaluation metrics, but it still lags behind GRU in this initial comparison.
In general, deep learning models perform better than benchmark classifiers. The majority classifier produces AUC of 0.500, showing near-random discrimination, whereas logistic regression achieves an AUC of 0.538. By contrast, GRU’s discriminative performance clearly outperforms these baselines.
These preliminary findings imply that using temporal structures and non-linear modeling could improve predictive accuracy in comparison to linear techniques, but more validation is needed in later phases.
Although the outcomes presented in
Table 6 offer a preliminary evaluation of out-of-sample prediction performance, they do not take into consideration possible statistical uncertainty resulting from temporal dependency in the data. Specifically, rolling input sequences and overlapping 12-month forward TSR horizons may cause serial correlation across data.
Table 7 presents 95% block bootstrap confidence intervals for both AUC and PR-AUC, calculated using 12-month blocks, in order to solve this problem and guarantee a more stringent evaluation framework. This method offers reliable uncertainty estimates for model performance while maintaining the temporal structure of the data.
In terms of key performance indicators, the findings verify that deep learning models perform better than the logistic regression baseline. With an AUC of 0.669 and a confidence interval of [0.545; 0.775], the GRU model in particular performs the best and most consistently. With confidence intervals that partially overlap the random benchmark, the logistic regression baseline, on the other hand, has poorer and less trustworthy prediction power. Although deep learning models perform better overall, the provided intervals show that there is no negligible uncertainty.
Robustness and Stability Analysis
To assess the robustness of the projected results, a number of additional analyses are performed. Classification performance is initially evaluated across different probability thresholds to demonstrate that the observed patterns are not solely impacted by a certain cutoff selection. Second, permutation testing shows that the reported AUC values are unlikely to have happened by chance. Finally, multi-seed research demonstrates that deep learning models function quite consistently across a range of random initializations.
Together, these results provide more proof of the consistency of the findings, but further validation is carried out in subsequent stages.
The permutation test results for the top-performing deep learning model (GRU) are shown in
Figure 4. While the empirical null distribution derived from 300 random label permutations is centered around 0.50, with the majority of simulated AUC values falling between around 0.45 and 0.56, the observed test AUC exceeds 0.669.
With an empirical p-value below the 5% significance level, the observed AUC is located in the extreme right tail of the null distribution. This finding implies that random chance is unlikely to be the only factor influencing the model’s predictive performance.
These results provide additional evidence that the model’s discriminative power is resilient under the original specification.
The stability of the GRU model under different random initialization is shown in
Figure 5. The test AUC has an average value of about 0.583 and ranges from 0.536 to 0.613. The spread across seeds is about 0.077 AUC points, with seed 7 showing the best performance (AUC = 0.613) and seed 77 showing the lowest (AUC = 0.536).
Despite this variation, all AUC values remain above the random benchmark of 0.50, suggesting that the predictive signal is not driven by a specific initialization. The GRU model shows a respectable level of stability and reproducibility under this initial setup, as reflected in the relatively small variation between seeds.
The (ROC) curves for each model on the test sample are shown in
Figure 6. With an AUC of 0.669, the GRU model outperforms CNN1D (0.570), LSTM (0.593), and the logistic regression baseline (0.538) in terms of overall discrimination.
The difference in performance is notable: GRU shows an AUC gain of 0.076 over LSTM and 0.131 over the logistic regression benchmark. Although CNN1D and LSTM also outperform the random benchmark (AUC = 0.500), their gains remain modest.
The GRU curve generally lies above the other models across most false positive rate levels, indicating a higher true positive rate at similar false positive levels. These results suggest that, under this initial specification, sequential architectures particularly GRU may provide better discriminative performance in predicting beta-adjusted shareholder value.
To complement the ROC-based evaluation,
Figure 7 presents the confusion matrices of the deep learning models and the logistic regression benchmark on the test sample.
The model produces 15 false positives and 153 false negatives, while correctly identifying 251 true negatives and 49 true positives. This corresponds to a recall of about 24.2% (49 + 153) and a specificity of 94.4% (251/(251 + 15)), suggesting a strong ability to identify non-performing firms but a more limited ability to detect outperforming firms.
With a recall of 47.5% (96/(96 + 106)) and a specificity of 66.6% (177/(177 + 89)), the LSTM model shows better detection of positive cases, recognizing 96 true positives, but with 89 false positives. This may explain its more balanced F1-score compared to logistic regression and CNN1D.
With a recall of 21.8% and a specificity of 86.8%, CNN1D finds 44 true positives and 231 true negatives, along with 35 false positives and 158 false negatives.
Overall, these findings show that different models under this initial specification have different trade-offs between sensitivity and specificity.
The GRU model’s F1-score progression over various probability thresholds is shown in
Figure 8. At a threshold of 0.10, the F1-score reaches its maximum value of roughly 0.56, suggesting that, under this initial specification, a relatively low cutoff produces the best balance between precision and recall.
The F1-score drops to about 0.37 at the traditional threshold of 0.50, which is a fall of almost 34% from its peak level. The dashed vertical lines in
Figure 8 indicate the threshold maximizing the F1-score (0.10) and the threshold selected using the Youden criterion (0.29), respectively. The F1-score stays relatively high at about 0.49 when the Youden threshold (0.29) is applied, indicating that minor threshold tweaks do not significantly affect model performance.
The F1-score decreases as the threshold rises above 0.70, reaching about 0.21 at 0.90, which is consistent with a clear decrease in recall.
Overall, performance remains relatively stable within the 0.10–0.40 threshold range.
The GRU model’s precision–recall curve on the test sample is shown in
Figure 9, with an average precision (AP) of 0.645. This result indicates strong predictive ability on the positive class and is above the no-skill baseline, which reflects the prevalence of the positive class.
Precision stays near 1.00 for low recall levels (below 0.10), indicating that the model’s most confident positive predictions are highly accurate. Precision stays between 0.63 and 0.65 when recall rises to roughly 0.40, suggesting that a large share of outperforming firms is correctly identified. The model maintains relatively strong precision in identifying positive cases, as evidenced by precision remaining over 0.52 even at recall levels close to 0.60.
Precision progressively drops toward 0.44–0.45 as recall gets closer to 1.00, illustrating the usual trade-off between higher recall and increased false positives. Overall, the AP of 0.645 suggests that the GRU model maintains strong precision–recall performance, particularly when precision remains above 0.60.
Overall, this stage suggests that GRU provides the best overall performance under the initial specification and highlights the usefulness of sequential modeling for this task.
7.2.2. Stage 2: Deep Learning Comparison Under Stricter Classification
At this stage, a more stringent panel-safe classification approach is applied to analyze just deep learning architectures (GRU, LSTM, and CNN1D). The modeling setup becomes more constrained, avoiding potential knowledge leakage and ensuring a more realistic forecasting setting, even though the notion of shareholder value generation is still based on a beta-adjusted market benchmark.
Because the models must capture more nuanced and economically meaningful patterns related to abnormal performance rather than raw returns, the prediction task becomes more difficult. This improved methodology allows for a more thorough evaluation of the relative performance of deep learning models.
The out-of-sample performance of deep learning models under the more stringent classification framework is reported in
Table 8. CNN1D shows the strongest discriminative performance in predicting future shareholder value creation, with the highest AUC (0.635) among the evaluated models.
Nonetheless, LSTM achieves the highest recall (0.243) and F1-score (0.344), suggesting a better ability to identify positive cases. This highlights a trade-off between minority class detection (F1/Recall) and overall ranking performance (AUC).
GRU shows moderate performance across most metrics, indicating performance that is relatively stable but not leading.
The current specification relies on a more constrained modeling setup and a more stringent panel-safe temporal design than the first stage. These changes make the classification task more challenging and may explain the observed changes in model ranking.
This methodological change makes the classification task more challenging, as reflected in the following:
Lower recall levels (varying from 0.168 to 0.243), indicating more cautious detection of positive events, a relative compression of AUC values across models (CNN1D = 0.635, GRU = 0.605, LSTM = 0.597).
A shift in the model ranking, with CNN1D emerging as the top-performing architecture in terms of AUC, while recurrent models were more competitive in the earlier phase.
Table 9 presents 95% block bootstrap confidence interval for AUC and PR-AUC using 12-month blocks to account for possible statistical uncertainty resulting from temporal dependence.
The findings show that all models have moderate but consistent predictive performance, with a confidence interval that is completely above the random benchmark. CNN1D achieves the highest AUC (0.635), indicating comparatively strong discriminative ability. A confidence interval that overlaps the 0.5 threshold indicates that LSTM produces weaker and less stable results, whereas GRU performs similarly but slightly worse. The reported intervals show a non-negligible degree of uncertainty, underscoring the difficult nature of the prediction task under the more stringent modeling framework, even though deep learning models generally outperform random classification.
The three deep learning architectures’ out-of-sample ROC curves under the more stringent classification framework are shown in
Figure 10. The models with the largest area under the curve (AUC = 0.635) are CNN1D, GRU (AUC = 0.605), and LSTM (AUC = 0.597). These findings imply that CNN1D has the best overall discriminative performance in identifying beta-adjusted shareholder value creation events, as model comparison is mostly based on AUC.
Additionally, the ROC curves also show that CNN1D generally outperforms the other models across intermediate ranges (FPR between 0.2 and 0.6, where classification trade-offs are most informative. This trend implies that CNN1D would be more appropriate than recurrent architectures for capturing pertinent temporal patterns under this specification, even though the performance differences are still modest.
To complement the ROC-based evaluation,
Figure 11 presents the confusion matrix of the best-performing CNN1D model on the test set.
A more detailed understanding of classification performance is offered by the CNN1D model confusion matrix. The model accurately identifies 34 value-creation events (true positives) and 247 non-value-creation situations (true negatives) out of 468 out-of-sample observations. Nevertheless, it wrongly identifies 19 negative cases as positive (false positives) and misclassifies 168 real positive cases as negatives (false negatives).
These numbers correspond to a precision of 0.642, meaning that 64.2% of anticipated value-creation signals are accurate. Only 16.8% of actual value creation events are effectively identified, according to the recall, which is still restricted at 0.168. The model prioritizes avoiding false alarms (low FP = 19) at the risk of missing a significant number of real positive opportunities (FN = 168). This asymmetry represents a conservative classification behavior. This imbalance between recall and precision is confirmed by the resulting F1-score of 0.267. Under the set threshold of 0.5, the model’s sensitivity to actual shareholder value creation events remains relatively moderate, despite its great reliability when predicting outperformance.
The three deep learning architectures’ validation AUC progression throughout training epochs is shown in
Figure 12. During the first few epochs, CNN1D’s performance increases quickly, going from roughly 0.57 to nearly 0.65 by epoch 5, following which it stabilizes at this level.
GRU, by contrast, shows more variation over time, with validation AUC ranging between about 0.60 and 0.67. LSTM shows more stable performance at a lower level, typically remaining below 0.60 during training.
For the selected CNN1D model,
Figure 13 shows how the ROC curve (AUC) evolves during epochs for both the training and validation sets.
Increasing from about 0.62 in the first epoch to nearly 0.86 in the final epoch, the training AUC shows a steady upward trend. This pattern suggests that the panel-structured financial data helps the model capture relevant temporal patterns.
The validation AUC, by contrast, increases more gradually and stabilizes between 0.63 and 0.65 after the initial epochs. This suggests a moderate level of overfitting, which is common in deep learning model applications involving noisy financial data, may be indicated by the difference between training and validation AUC. Nonetheless, the validation AUC consistently stays above 0.60, indicating that the model still has significant discriminative power outside of the sample.
The benefit of early stopping in limiting excessive divergence between training and validation performance is consistent with the validation AUC stabilizing after the mid-training epochs.
The training and validation loss development for the chosen CNN1D model over epochs is shown in
Figure 14. Improved in-sample fit during training is indicated by the training loss, which steadily decreases from roughly 0.30 in the first epoch to roughly 0.23 in the last epoch.
The validation loss, on the other hand, varies within a comparatively small range, rising marginally from around 0.67 to roughly 0.72 in the last epochs. Since the model keeps optimizing the training objective while validation performance stabilizes, the ensuring divergence between training and validation loss may be a sign of moderate overfitting.
The validation loss shows no signs of sudden instability, indicating relatively stable training dynamics. This is consistent with the stabilization observed in the validation AUC, suggesting that despite some overfitting, the model still maintains reasonable generalization performance.
Overall, these findings provide more evidence for the dependability of the chosen model under the more stringent categorization framework by indicating that the learning process is steady under the chosen training configuration.
7.2.3. Stage 3: Final Model Evaluation and Validation
This stage reintroduces benchmark models and expands the study to a wider range of performance diagnostics and validation processes, in contrast to stage 2, which solely concentrates on relative performance among deep learning architectures. It seeks to verify the chosen architecture’s performance in comparison to both deep learning and baseline models while assessing its resilience, stability, and usefulness. The chosen model configuration is used in this step, which offers a more thorough and accurate evaluation of out-of-sample prediction performance.
Final Comparative Performance
Using fixed classification threshold of 0.5,
Table 10 presents the predictive performance of the benchmark models (logistic regression and the majority classifier) and deep learning architectures (CNN1D, GRU, and LSTM) on the test set.
With an AUC of 0.700, the results show that CNN1D performs the best in terms of discrimination under the major AUC criterion. This result is much higher than the majority classifier (AUC = 0.500) and logistic regression (AUC = 0.538), indicating that deep learning architectures in this context capture more relevant temporal patterns than linear and naïve baselines.
CNN1D’s highest PR-AUC (0.727) and MCC (0.312), which indicate better overall classification performance, further support its selection as the best performing model. Despite achieving slightly higher balanced accuracy (0.608 and 0.613, respectively) and accuracy (0.643 and 0.650, respectively), GRU and LSTM’s lower AUC values indicate relatively poorer global ranking ability.
LSTM outperforms baseline models but falls short of CNN1D and GRU with a decent discriminative performance (AUC = 0.607). In contrast, logistic regression only slightly surpasses random classification, suggesting that only a small portion of the underlying predictive structure is captured by linear decision boundaries.
CNN1D attains the maximum accuracy (0.894) in terms of classification trade-offs, but GRU and LSTM show greater recall values (0.351 and 0.342, respectively), emphasizing a trade-off between sensitivity to positive instances and prediction purity. All things considered, our findings confirm that CNN1D is the best model in terms of global discriminative performance.
Table 11 presents 95% block bootstrap confidence intervals for AUC and PR-AUC using 12-month blocks to account for statistical uncertainty caused by temporal dependency.
With an AUC of 0.700 and a 95% confidence interval of [0.585;0.803], the findings verify that CNN1D delivers the best discriminative performance. Although their intervals show more modest predictive stability, GRU and LSTM likewise perform better than the logistic regression baseline. Overall, the confidence intervals show a non-negligible level of uncertainty in the final prediction framework while supporting the superiority of deep learning models over the linear baseline.
To formally assess the significance of performance differences across models,
Table 12 presents pairwise bootstrap comparisons of AUC values.
The pairwise bootstrap results provide additional insight into the statistical significance of performance differences across models. Both CNN1D and GRU significantly outperform the logistic regression baseline, with AUC improvements of +0.132 and +0.154, respectively (p = 0.002). LSTM also shows a statistically significant improvement over logistic regression (ΔAUC = +0.094; p = 0.050). In contrast, the differences among the deep learning architectures themselves remain statistically uncertain. The AUC difference between CNN1D and GRU is not significant (ΔAUC = −0.023; p = 0.584), nor are the differences between CNN1D and LSTM (ΔAUC = +0.038; p = 0.430) and between GRU and LSTM (ΔAUC = +0.061; p = 0.296). These findings indicate that the superiority of deep learning models over the linear benchmark is statistically supported, whereas the ranking among deep learning architectures should be interpreted with caution despite CNN1D achieving the highest overall AUC in the final evaluation stage.
ROC-Based Validation
The ROC curves of the benchmark logistic regression model and the deep learning models (CNN1D, GRU, and LSTM) that were assessed on the set for the classification of beta-adjusted stock outperformance are shown in
Figure 15.
The findings demonstrate that the CNN1D model obtains the greatest discriminative performance with an AUC of 0.700, which is obviously higher than the random benchmark (AUC = 0.5). The GRU (AUC = 0.638) and LSTM (AUC = 0.607) models are ranked next, while the logistic regression baseline (AUC = 0.538) stays around the diagonal reference line, showing low predictive power.
Over the majority of the false positive rate range, the CNN1D curve often lies above the other models, suggesting a better ability to distinguish between organizations that are operating well and those that are not. Even if the performance differences are still modest, the consistent model differences are still modest, the consistent model ranking validates the results shown in
Table 10.
These findings imply that deep learning architectures are better at capturing temporal connections in financial data than linear models. Specifically, CNN1D’s relative performance within the chosen sequential framework could be a reflection of its capacity to identify pertinent local temporal patterns.
Confusion Matrix Diagnostics
The final confusion matrices used for model evaluation are presented in
Figure 16.
Beyond overall performance indicators, the confusion matrices provide a thorough understanding of each model’s categorization behavior. CNN1D shows a precision-oriented classification pattern among the assessed models. It achieves a high precision of 0.894, indicating that most predicted outperformance cases are correct, with 261 true negatives and only 5 false positives. Nevertheless, its recall is still low at 0.208, indicating a conservative identification of positive cases.
With 71 true positives properly identified and a relatively high number of true negatives (230), the GRU model shows a more balanced classification performance. In comparison to CNN1D, this yields a higher recall (0.351), but at the expense of more false positives (36), which leads to lower precision (0.664). This reflects a trade-off between sensitivity and precision, which is reflected in this pattern.
With 69 true positives and 133 false negatives, the LSTM model shows intermediate performance, producing a recall of 0.342 and an accuracy of 0.690. Its performance reflects a trade-off between classification reliability and detection ability, without outperforming other models in every dimension.
Logistic regression, on the other hand, performs poorly in identifying high-performing firms. Although it correctly identifies many negative cases (243 true negatives), it achieves a low recall (0.119) with only 24 true positives and 178 false negatives. This suggests that linear models may be less effective at capturing complex patterns in the data.
Overall, these findings show that different models have different classification profiles, with CNN1D emphasizing better sensitivity in identifying favorable results.
Threshold Optimization and Decision Rule Diagnostics
For the top-performing deep learning model (CNN1D),
Figure 17 shows how sensitive the F1-score is to changes in the classification threshold. As the barrier climbs from 0.10 to roughly 0.30, the F1-score grows gradually and reaches its maximum value (≈0.69) at a threshold of 0.30. The best trade-off between recall and precision is found at this moment.
Higher thresholds decrease the model’s capacity to accurately detect positive cases (outperformance events), which lowers recall and weakens the balance conveyed by the F1 metric. This is demonstrated by the F1-score’s sharp reduction beyond this level. The fixed-threshold performance metrics shown in
Table 10, which are calculated using the traditional classification threshold of 0.5, are not directly comparable to the F1-score values reported in
Figure 17, which are produced under threshold optimization
This contrast emphasizes how crucial threshold selection is in classification problems because the decision rule selected can have a substantial impact on model performance.
Crucially, the Youden J threshold and the ideal F1 threshold (≈0.30) overlap, indicating a consistent decision boundary from both angles:
- ▪
The F1-max criterion, which balanced recall and precision;
- ▪
Youden’s J statistic (maximization of specificity + sensitivity − 1).
These findings imply that this classification problem may not be best served by the traditional criterion of 0.5. Precision threshold is increased to about 0.30, especially when identifying companies that outperform the beta-adjusted market benchmark.
Precision–Recall Validation
The CNN1D model’s precision–recall (PR) curve, as assessed on the out-of-sample test set, is shown in
Figure 18. The PR curve average precision, AP of 0.727, indicates solid performance in identifying positive cases, with firms outperforming the beta-adjusted market benchmark.
Precision remains very high (above 0.9) at low recall levels (around 0.1–0.2), suggesting that high confidence positive predictions are generally reliable. Precision decreases as recall increases, reflecting the usual trade-off between maintaining precision and identifying more true positives.
Precision is comparatively stable (around 0.65–0.70) at intermediate recall levels (about 0.5–0.6), suggesting that the model retains acceptable classification quality despite recognizing a significant percentage of outperforming enterprises.
Because the classification problem in this study is somewhat unbalanced (≈54% class 0 vs. 46% class 1), the PR curve is particularly instructive. In these situations, compared to ROC-AUC, the PR-AUC (or AP) offers a supplementary and more insightful assessment of performance on the positive class.
Overall, the AP value of 0.727 supports the findings from the ROC study by showing that the CNN1D model has good precision–recall performance in identifying businesses that provide anomalous performance in comparison to their beta-adjusted benchmark.
Statistical Significance and Stability
The empirical null distribution of the AUC derived from 500 random permutations of the target variable is shown in
Figure 19. The distribution of AUC value under the null hypothesis that there is no predictive association between the explanatory variables and the target (random labeling) is shown by the histogram. The CNN1D model’s observed test AUC (AUC = 0.700) is shown by the dashed vertical line.
The performance of a random classifier is represented by the null distribution, which is centered at roughly 0.50. The observed AUC is obviously outside the range of permuted AUC values and is located far in the distribution’s right tail. This distinction offers compelling evidence that random labeling is unlikely to account for CNN1D’s predictive effectiveness.
The measured AUC is statistically significant at conventional levels, according to the empirical p-value (p < 0.01). When combined, these findings offer compelling evidence in favor of the theory that the CNN1D model has significant discriminative power when it comes to forecasting firm-level outperformance in comparison to the beta-adjusted market benchmark.
Out-of-Sample Temporal Stability Analysis
To further assess the robustness of the predictive results, the out-of-sample test set was divided into three consecutive chronological sub-periods (P1, P2, and P3). The predictive performance of CNN1D, GRU, LSTM, and Logistic Regression was then evaluated separately within each sub-period using AUC and PR-AUC.
The predictive performance of CNN1D, GRU, LSTM, and Logistic Regression was then evaluated separately within each sub-period using AUC and PR-AUC. The results are summarized in
Table 13.
The results reveal notable differences in temporal stability across the competing models. Although GRU achieves the highest AUC in P1 (0.820), its performance declines markedly in P2 (0.535) and remains relatively weak in P3 (0.624). LSTM exhibits a similar pattern, with AUC values decreasing from 0.794 in P1 to 0.566 in P3. In contrast, CNN1D maintains the most stable performance across all sub-periods, with AUC values ranging from 0.754 to 0.782 and PR-AUC values from 0.754 to 0.774. Logistic Regression also shows considerable fluctuations, with AUC values varying between 0.435 and 0.655. Overall, these findings indicate that the superior performance of CNN1D is not driven by a single favorable interval but reflects a more persistent and temporally stable predictive signal throughout the out-of-sample period.
Continuous Regression Robustness Analysis
To examine whether the binary formulation of shareholder value creation influences the results, an additional robustness analysis was conducted using abnormal TSR as a continuous target variable.
The continuous regression robustness analysis reveals limited predictive performance across all specifications. Linear Regression and Ridge Regression produce very poor out-of-sample results, with R2 values of −7.982 and −7.977, respectively, and RMSE values exceeding 1.34. Although ensemble methods perform relatively better, their predictive power remains weak. Random Forest achieves an RMSE of 0.935 and an R2 of −3.353, while Gradient Boosting delivers the best performance with an RMSE of 0.587, an MAE of 0.336, and an R2 of −0.714. Nevertheless, all models generate negative R2 values, indicating that they fail to predict the exact magnitude of abnormal shareholder returns more accurately than a simple mean-based benchmark. Directional accuracy remains modest, ranging from 43.8% for linear specifications to 55.3% for Random Forest. These findings suggest that predicting the continuous abnormal TSR is particularly challenging in the Moroccan market context. Consequently, the binary classification framework adopted in the main analysis appears more suitable for distinguishing shareholder value creation from value destruction and yields more robust predictive performance, as evidenced by the substantially higher classification results reported for the deep learning models (AUC up to 0.692).
The performance of the regression models under the continuous robustness analysis is summarized in
Table 14.
7.2.5. Variable Block Analysis
This section looks at model results across various groupings of explanatory variables to further explore the role of various information sources on predicted performance. In particular, accounting variables, market-based indicators, macroeconomic parameters, and their different combinations are used to re-estimate models.
The predictive performance of all assessed models across these variable blocks is shown in
Table 17.
Based on the findings presented in
Table 17,
Table 18 presents the top-performing model for each variable block to aid in interpretation.
While MCC is reported as a complementary metric, model comparison in this study mainly relies on AUC, as it provides a threshold-independent view of discriminative performance.
The results show clear differences across variable blocks. Interestingly, the full specification does not produce the highest AUC. The best result is instead obtained when accounting variables are excluded, with CNN1D reaching an AUC of 0.763. This suggests that adding all available variables does not always improve predictive performance, and that some inputs may introduce noise or redundant information.
Macroeconomic variables, taken on their own, already provide a strong signal, with GRU reaching an AUC of 0.723. In contrast, market variables alone lead to much weaker results. The role of macroeconomic information becomes even clearer when it is removed, as performance drops sharply, with the best AUC falling to 0.544.
Overall, these results point to the central role of macroeconomic variables in the predictive setup and highlight the importance of selecting relevant inputs rather than simply increasing their number.