In this section, we evaluate the results of the proposed correction methods when applied to trained rLTCN classifiers. We first compare the performance of the four correction strategies (Temperature, Platt, Beta, and Shifting) on a benchmark of imbalance datasets. Then, we extend this performance comparison to contrast the rLTCN corrected with the best strategy and state-of-the-art classifiers. Finally, we illustrate the practical value of the correction methods in a real case study concerning churn prediction.
5.1. Comparing Correction Strategies
We adopt the 30 classification problems (see
Table 1) from [
6] to assess the performance differences of the proposed corrected methods. These problems involve different statistical properties and varying complexities, as indicated by class imbalance scores (the higher the score, the greater the imbalance) and the separability indices (the smaller the index, the more challenging the problem). The latter is determined as the proportion of presumably correct instances [
52] relative to the total instances. In these problems, the number of features ranges from 2 to 240, the number of instances ranges from 846 to 10,992, while the number of decision classes ranges from 2 to 100.
Next, we measure the performance improvements when adding the correction layer to the rLTCN architecture after performing 5-fold nested cross-validation. The nested cross-validation procedure uses a stratified 5-fold outer loop and a stratified 5-fold inner loop. Stratification ensures that the class proportions of the original dataset are approximately preserved in every training, validation, and test partition. Both loops are controlled by a fixed random seed of 42, which was applied to Python 3.13.9’s built-in random module, NumPy 2.3.5, and TensorFlow 2.20.0 before any split construction. The 30 benchmark datasets were loaded from CSV files in which the last column contains the class label and all preceding columns contain the input features. Each dataset was preprocessed to retain only decision classes with at least five instances. This lower bound matches the number of outer folds and guarantees that every class appears in each fold during stratified splitting. Class labels were encoded as consecutive integers using a label encoder fitted on the full dataset before any fold was constructed. The rLTCN classifier supports this integer encoding since it internally converts class labels to one-hot representations, where each output neuron maps to a single decision class. The outer and inner folds were built using scikit-learn’s StratifiedKFold with shuffle enabled, so random variation in fold assignment is governed by the fixed seed. In each outer fold, the held-out partition served as the test set, and the remaining instances formed the training partition. The training partition was passed to the inner grid search cross-validator, which explored the hyperparameter grid using macro-averaged F1 as the internal scoring metric. The combination yielding the highest mean macro-averaged F1 across the inner validation partitions was selected, and the winning model was retrained on the full outer training partition and evaluated on the corresponding outer test fold. The performance estimates reported in the paper correspond to the mean performance metric computed across the five outer test folds.
Since the proposed methods are post-optimization procedures, we first need to train the rLTCN classifiers. Concerning hyperparameter tuning, we optimized the following hyperparameters using grid search: the number of iterations during reasoning
, the activation function
, the nonlinearity coefficient
, and the regularization penalty
. The correction layer is optimized using the analytically derived partial derivatives and the Limited-memory Broyden–Fletcher–Goldfarb–Shanno with Box constraints (L-BFGS-B) optimizer [
53] from
scipy.optimize.minimize. We choose L-BFGS-B over standard gradient descent for two reasons. First, the problem has a moderate number of parameters. L-BFGS-B then achieves superlinear convergence, which is much faster than the linear convergence of gradient descent. Second, L-BFGS-B eliminates the need to manually tune the learning rate, as it performs an exact line search at each iteration. For this optimizer, we use a maximum of 1000 iterations and tight tolerances (≤10
−12) to reduce premature stopping.
In our experiments, we use seven performance measures, each capturing a distinct aspect of predictive quality. Cohen’s Kappa () quantifies the agreement between predicted and true class labels beyond what would be expected by chance alone. It takes values in , where corresponds to random agreement, to perfect agreement, and to perfect systematic disagreement (agreement worse than chance). Unlike raw accuracy, Kappa penalizes models that exploit class prevalence and thus provides a stronger indicator of discriminative power in imbalanced settings. Macro-averaged F1 () computes the harmonic mean of precision and recall independently for each class and then averages across classes without weighting, which gives equal importance to rare and frequent classes. Weighted F1 () follows the same logic but weights each class contribution by its support, rendering it sensitive to the overall distribution of examples. Balanced accuracy (BA) averages recall across all classes and serves as a complement to Kappa when examining per-class sensitivity. Macro-averaged precision (P) and recall (R) separately measure the proportion of correctly identified positives among all predicted positives and among all actual positives. The area under the precision-recall curve (PR-AUC) summarizes the trade-off between precision and recall at multiple thresholds. It is especially informative under class imbalance, as it is not inflated by the large number of true negatives that tends to distort the ROC curve.
Table 2,
Table 3,
Table 4 and
Table 5 report the improvement expressed in percentage points for each dataset and correction method. Across all four methods, the largest Kappa gains concentrate in datasets with severe class imbalance, with
reaching
under all methods in D2 (imb
) and ranging from
to
in D13 (imb
). Other imbalanced datasets also show consistent gains, with
generally increasing alongside the degree of imbalance and reaching up to
in datasets with imb ≥ 0.59 such as D28–D30. This pattern aligns with the motivation for the correction approach, given that the Moore-Penrose inverse learning step is governed by the mean squared error, which is dominated by the majority class in imbalanced problems. A post-hoc adjustment of the output logits can recover substantial recall for minority classes, as confirmed by the large
and
values in the same high-imbalance datasets. These recall gains occasionally come at the cost of precision, since repositioning decision boundaries to favor minority classes increases true positives while also admitting additional false positives. This trade-off is most visible under Platt and Beta corrections, where
reaches
in D2, with smaller negative precision gains appearing in D11, D26, D29, and D30 under those same methods. Datasets D5, D7, and D9 (imb
, sep
) show gains no larger than
in
under any method, while D3 (sep
) and D21 (imb
, sep
) yield zero improvement across every metric. In all these cases, the correction layer finds no logit distortion to address once the baseline rLTCN produces good predictions.
Among the four methods, Beta and Platt corrections consistently produce the largest improvements. Beta correction achieves the highest mean of across the 30 datasets, followed by Platt at , while Temperature and Shifting both reach . In datasets combining high imbalance with low separability, the gains in and are substantial across all four methods, though the relative advantage of Beta and Platt over the softmax-based corrections is more pronounced when imbalance and low separability co-occur. This advantage is consistent with the sigmoid-based parameterizations of Platt and Beta being better suited to settings where the relationship between raw logits and ideal decision scores is nonlinear. The additional parameter in Beta correction introduces a logarithmic curvature term that provides an extra degree of freedom unavailable to the other methods, which is particularly useful when the logit distribution is asymmetric. In large multi-class problems with low separability (sep , 100 classes), all four corrections produce substantial gains, with ranging from to . In these settings, Shifting occasionally delivers the largest single , possibly because a uniform additive translation of the score space suffices to re-balance a large number of equally represented classes. However, does not always track in these problems, with Platt leading on macro-averaged F1 for some datasets while Shifting leads on Kappa, which suggests that the relative ranking of methods can depend on the specific metric under consideration. The broader picture across all datasets is that and generally follow the Kappa ranking, while occasionally diverges under high-imbalance conditions.
To further explore the observed performance differences, we apply the Friedman test [
54], a nonparametric procedure that evaluates whether at least one method produces systematically different
improvements across the 30 datasets. The test yields a statistic of
with
, which confirms that the four methods are not statistically equivalent. To identify which pairs drive this difference, we apply the Wilcoxon signed-rank test [
55] for all six method pairs, followed by Holm’s correction [
56] to control the family-wise error rate. The Wilcoxon test does not assume normality and accounts for both the direction and magnitude of pairwise differences, which makes it appropriate for comparing classifiers over a finite benchmark. Holm’s correction adjusts the rejection threshold sequentially from the most to the least significant pair, which is less conservative than Bonferroni correction while still controlling the family-wise error rate.
Table 6 reports the raw and corrected
p-values, the test outcome, the win/tie/loss (W/T/L) counts, and Cohen’s
d for each pair. The W/T/L counts record how many times Method 1 produces a higher, equal, or lower
than Method 2 across the 30 datasets. This count is a direct measure of dominance that is independent of the
p-value. Cohen’s
d gives the standardized mean paired difference between Method 1 and Method 2, where values near
,
, and
indicate small, medium, and large effects. A positive
d means Method 1 improves more than Method 2 on average, while a negative
d means Method 2 holds the advantage. Only one pair reaches significance after Holm’s correction, namely Beta against Shifting (
,
, W/T/L
). Therefore, Beta correction produces larger
improvements than Shifting across the benchmark. The comparisons of Beta against Temperature and Platt do not reach significance after correction (
in both cases). Nonetheless, the raw
p-values are borderline (
) and the W/T/L counts strongly favor Beta (13/14/3 and 11/16/3), with moderate effect sizes (
–
). This pattern points to a practical advantage of Beta over Temperature and Platt, although the correction penalty prevents it from reaching formal significance, given the 30 datasets used in our study. The remaining pairs do not approach significance, and their W/T/L counts and effect sizes confirm the absence of any meaningful difference.
Figure 2 illustrates, as an example, the inner workings of the proposed logit correction methods used to adjust the outputs produced by the rLTCN model. Each plot compares the original output scores (dashed gray line) with the adjusted scores after correction (solid blue line) as a function of the input logit values. As a reminder, the four class-specific correction methods operate as follows: (a) shifting correction applies additive shifts to the logits, (b) temperature correction softens or sharpens the logit distribution using a scaling factor, (c) Platt correction applies a parametric adjustment to the sigmoid function, and (d) Beta correction applies a more flexible nonlinear transformation that incorporates logit magnitude. It is worth noting that the first two methods use the softmax activation, while the latter two operate with the sigmoid activation function.
Concerning the computational overhead of the correction layer, it is negligible relative to the two-step training cost of the rLTCN classifier. For a dataset with K instances and M decision classes, Shifting and Temperature corrections each optimize M parameters, Platt correction optimizes parameters, and Beta correction optimizes parameters. Let denote the number of correction parameters for a given method. Each L-BFGS-B iteration requires one evaluation of the macro-averaged soft F1 objective at cost and one gradient evaluation at the same cost, since per-class gradients accumulate instance-wise terms over K instances. The L-BFGS-B algorithm maintains a limited-memory approximation of the inverse Hessian using the m most recent curvature pairs, where m is typically a small constant such as . The Hessian update and direction computation at each iteration cost , which is dominated by whenever . The total cost over I iterations is , where I is bounded by the maximum of 1000 iterations set in our experiments. Since I, m, and P are all small relative to K, the correction layer adds a cost that scales linearly in K and M. This is substantially lower than the cost of the pseudoinverse computation in the supervised learning step of the rLTCN classifier, where N denotes the number of input features.
5.2. Comparing Against State-of-the-Art Classifiers
In the second part of our empirical study, we will compare the performance of the cLTCN classifier (rLTCN using only Beta correction for simplicity) against state-of-the-art classifiers. The selected methods include the FCM with threshold correction (FCM-A) [
9], the FCM Multiclass Classifier (FCMMC) [
15], the uncorrected rLTCN classifier [
6], Logistic Regression (LR), Decision Trees (DTs), Gaussian Naive Bayes (GNB), Support Vector Machines (SVMs), Light Gradient-Boosting Machines (LGBMs) [
57], and Attentive Interpretable Tabular Learning (TabNet) [
58]. Each classifier undergoes hyperparameter tuning through nested 5-fold cross-validation and grid search, using the same 30 pattern classification datasets from the previous subsection.
For DTs, the tuned parameters are criterion∈ {gini, entropy}, splitter∈ {best, random}, and max_features∈ {sqrt, log2}. For LR, the tuned parameters are solver∈ {lbfgs, saga}, C∈ {0.01, 0.1, 1, 10, 100}, and penalty∈ {l2, none} for lbfgs and {l1, l2, none} for saga. For these solvers, none means that no regularization is enforced. For SVMs, the tuned parameters are kernel∈ {linear, poly, rbf, sigmoid}, C∈ {0.01, 0.1, 1, 10, 100}, and gamma∈ {scale, auto}. For LGBMs, the tuned parameters are n_estimators∈ {100, 300, 500}, max_depth∈ {10, 20, 30}, and learning_rate∈ {0.01, 0.05, 0.1}. For TabNet, the tuned parameters are N_d = N_a∈ {8, 16, 32}, n_steps∈ {3, 5, 7}, and gamma∈ {1.0, 1.3, 1.5}. For FCMMC, the parameters are {sigmoid}, training_loss∈ {softmax}, optimizer∈ {rmsprop}, depth∈ {2, 3, 5}, epochs∈ {50, 100}, learning_rate∈ {0.001, 0.01, 0.05, 0.1, 0.5}, and batch_size∈ {16, 32, 64}. For LTCN, the hyperparameters are the same as those defined in the previous experiment. Note that GNB and FCM-A do not involve relevant hyperparameters to be tuned during grid search.
Table 7,
Table 8 and
Table 9 report the results of the Friedman test and the Wilcoxon signed-rank test with Holm’s correction for all pairs of classifiers within each group. In this study, we use
as the primary performance metric. The Friedman test is significant in all three groups, with statistics of
,
, and
for FCM-based, white-box, and black-box classifiers, respectively, all with
. The W/T/L counts and Cohen’s
d values supplement the
p-values by providing directional evidence and effect size estimates, following the same conventions described earlier in this section.
Table 7 shows that every pair among the FCM-based classifiers is significantly different after Holm’s correction. FCM-A is the weakest method by a large margin, losing against all three competitors on all 30 datasets and producing large negative effect sizes (
against rLTCN and
against cLTCN). This outcome is expected due to the structural limitation of FCM-A, which relies on a single output neuron and
M-1 decision thresholds. FCMMC improves substantially over FCM-A but still loses to both rLTCN and cLTCN on 25 out of 30 datasets. The comparison between rLTCN and cLTCN is the closest in this group, with cLTCN winning on 24 datasets against only 2 for rLTCN, with 4 ties (
,
). This confirms that the Beta correction layer produces a reliable and statistically supported gain over the uncorrected baseline.
Table 8 shows that cLTCN is significantly better than LR, DTs, and GNB after Holm’s correction, with large effect sizes (
) and W/T/L counts heavily favoring cLTCN in all three comparisons. LR and DTs are not significantly different from each other (
,
), and the near-zero effect size confirms that their mean
scores are practically indistinguishable. DTs and GNB also fail to reach significance after correction (
), though the W/T/L count (20/0/10) and moderate effect size (
) point to a practical advantage for DTs. LR is substantially better than GNB (
,
), with LR winning on 23 out of 30 datasets.
Table 9 shows that cLTCN is better than all three black-box classifiers after Holm’s correction. TabNet is the weakest model in this group, losing to SVMs on 26 datasets, to LGBMs on 26 datasets, and to cLTCN on 27 datasets, with large effect sizes in all three comparisons. SVMs and LGBMs are not significantly different from each other (
,
), though SVMs hold a modest W/T/L advantage of 17/2/11. The comparisons of cLTCN against SVMs (
,
) and against LGBMs (
,
) are the closest in this group, yet both remain significant after correction, with cLTCN winning on 18 and 22 datasets, respectively.
5.3. Case Study Concerning Churn Prediction
To assess the practical relevance of our correction methods in a real-world scenario, we consider a customer churn prediction case study using the Orange Telecom dataset. This dataset has been widely adopted in the churn modeling literature [
59] and represents a typical binary classification task with class imbalance. Churn prediction is a key problem in electronic commerce and subscription-based services, since the accurate identification of customers at risk of attrition helps with retention strategies [
60].
The Orange Telecom dataset contains 3333 customer records, each described by 68 features, including service plans, usage statistics, and customer interaction variables. The target variable indicates whether a customer churned within the considered period. The dataset does not contain missing values. The observed churn rate is 14.49%, which means that 85.51% of customers were retained, which gives an imbalance ratio of 5.9:1 (see
Figure 3a).
Figure 3b shows the top 10 features correlated with the target. The total number of customer service calls and features that signal high usage are positively correlated with churning, while the voicemail usage is negatively correlated.
Exploratory analysis reveals clear behavioral differences between churned and retained customers. The box plot in
Figure 3c indicates that the number of customer service calls is strongly associated with churn. Retained customers make on average 1.45 service calls, whereas churned customers average 2.23 calls. Moreover, customers with four or more service calls exhibit an average churn rate of 51.7%, compared to 11.3% among those with fewer than four calls.
Figure 3d decomposes the distribution of churn rate per number of customer service calls. In this plot, it is clear that the cut-off point of four or more customer calls is a predictor of churning. These differences in the number of customer service calls are a clear signal of dissatisfaction with churning customers.
Figure 3e,f show that voicemail usage shows the opposite effect. Customers with voicemail activated, representing 27.7% of the sample, have a churn rate of 8.68%, compared to 16.72% among those without voicemail. Another interesting observation is that a high number of voicemail messages is also associated with no churning behavior. In contrast, total usage variables such as aggregate call minutes show only minor differences between groups, with churned customers exhibiting slightly higher average usage (see
Figure 3g). Finally,
Figure 3h shows that geographic variation across U.S. states is also limited, with a modest churn rate variance of 5.76%.
Using the same settings as in the previous section, we measure the performance improvements after adding correction layers to the rLTCN classifier.
Table 10 reports the simulation results after learning and post-optimization.
The results show that the Kappa gain for every correction strategy lies in the range of +4.34% to +4.49%. The macro-averaged F1 gains are nearly identical across correction methods, which suggests that all strategies recover minority class recall to a comparable degree. The balanced accuracy and recall gains follow a similar pattern, with Shifting producing the largest and at the cost of the steepest precision drop of −2.96%. Temperature and Platt scaling obtain identical results across all measures. Beta correction is the most balanced option, combining the highest Kappa gain of +4.49% and the largest PR-AUC improvement of +3.51% with the smallest precision penalty of −0.1%. However, the comparability in the results suggests that all proposed correction strategies similarly redistribute the decision boundary for this dataset.