Output Correction of Recurrence-Aware Long-Term Cognitive Network Classifiers

Nápoles, Gonzalo; Grau, Isel; Salgueiro, Yamisleydi

doi:10.3390/bdcc10060178

Open AccessArticle

Output Correction of Recurrence-Aware Long-Term Cognitive Network Classifiers

by

Gonzalo Nápoles

¹

,

Isel Grau

²

and

Yamisleydi Salgueiro

^3,*

¹

Department of Intelligent Systems, Tilburg University, 5037 AB Tilburg, The Netherlands

²

Information Systems Group, Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands

³

Department of Industrial Engineering, Faculty of Engineering, Universidad de Talca, Campus Curicó, Curicó 3340000, Chile

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(6), 178; https://doi.org/10.3390/bdcc10060178

Submission received: 8 March 2026 / Revised: 12 May 2026 / Accepted: 19 May 2026 / Published: 1 June 2026

(This article belongs to the Section Cognitive System)

Download

Browse Figures

Versions Notes

Abstract

Recurrence-Aware Long-Term Cognitive Network (rLTCN) classifiers have reported comparable performance to mainstream black-box models, including tree ensembles and support vector machines, in tabular pattern classification tasks. These classifiers use a two-step learning algorithm to address issues that arise during the training of recurrent neural networks. While the weights in the recurrent block are computed using unsupervised learning, recurrence-aware weights are determined using a one-step learning rule based on the Moore-Penrose inverse. However, the related least-squares learning problem tends to favor easy instances and common patterns, particularly those associated with the majority class in imbalanced datasets. In such scenarios, a loss function that directly optimizes a robust metric, such as the F1 score, would lead to models with stronger generalization capabilities. Unfortunately, incorporating such a metric into the Moore-Penrose inverse learning procedure presents challenges from a mathematical viewpoint. In this paper, we propose four gradient-based correction methods that modify the output logits of rLTCN classifiers once the two-step training process is done. Inspired by procedures such as Platt or Beta scaling, the proposed post-optimization correction methods seek to maximize the F1 score rather than produce calibrated probabilities. The simulations using real-world datasets show that adding a correction layer to rLTCNs improves their performance significantly at the expense of occasional reductions in the precision metric.

Keywords:

long-term cognitive networks; recurrent neural networks; quasi-nonlinear reasoning; post-hoc optimization; cost-sensitive learning

1. Introduction

Fuzzy Cognitive Maps (FCMs) [1] are graph-based recurrent neural networks originally developed to simulate causal reasoning in complex systems. However, they have been extended to machine learning tasks such as time series forecasting and pattern classification. FCM-based classifiers employ neural concepts that map problem features and decision classes, while the weighted edges capture their relationships with producing the class labels [2]. During reasoning, the network state in each iteration is determined from the neurons’ activation values, the weights connecting the neural concepts, the activation function, and the reasoning rule. In pattern classification tasks, the final activation values of designated output concepts are used to determine the decision class for each instance, while the temporal network states are usually ignored. Due to their transparent structure, FCM-based classifiers are a suitable alternative to black-box models, especially in domains where interpretability and expert knowledge integration are valued [3]. However, developing FCM-based classifiers performing as accurately as state-of-the-art black-box classifiers is challenging due to the limitations in network architecture, recurrent nature and the presence of unique fixed-point attractors [4,5].

To overcome these challenges, Nápoles et al. [6] proposed the Recurrence-Aware Long-Term Cognitive Networks (rLTCNs), which remove the weight constraint of classic FCM models. The rLTCNs also introduce a recurrence layer that connects temporal network states to the classifier’s output, which makes it a more expressive and trainable architecture for structured classification problems. Nonetheless, FCM-rooted classifiers often suffer from convergence to unique fixed-point attractors that render the network incapable of recognizing several decision classes. To address this issue, rLTCNs use a quasi-nonlinear reasoning rule [7]. It incorporates the initial activation vector into the computation of neurons’ activation values, rather than relying solely on the previous network state. The contribution of each term to the current network state is controlled using a nonlinearity coefficient. As shown in [8], there is a trade-off between the network expressiveness and its ability to approximate nonlinear patterns.

The rLTCN classifier uses a two-step learning algorithm that computes the inner-block weights via unsupervised learning and the outer-layer weights via the Moore-Penrose inverse. The Moore-Penrose inverse provides a closed-form solution to linear systems and yields the weight matrix that minimizes the mean squared error between predicted and target outputs. It is highly efficient but favors frequent or easily classified examples, as their collective error dominates the loss. As a result, the model may underperform on rare or difficult instances, particularly in imbalanced datasets where the majority class skews the learning process. A loss function that directly optimizes the F1 score would lead to models with stronger generalization capabilities under class imbalance. However, incorporating the F1 score directly into inverse-based learning is challenging since the F1 score is non-differentiable and defined in terms of discrete prediction outcomes (i.e., true positives, false positives, and false negatives). These quantities depend on hard thresholds rather than continuous-valued outputs, which prevents direct optimization through gradient-free closed-form solutions like the Moore-Penrose inverse.

In this paper, we propose four gradient-based correction methods that modify the output logits of rLTCN classifiers once the two-step training process is complete. These methods apply class-specific transformations to the model outputs, including additive shifts, temperature factors, and nonlinear mappings. Each transformation is learned to maximize a soft variant of the F1 score, which is continuous and differentiable under reasonable assumptions. First, under class imbalance, proposed methods help recover recall for minority classes without degrading overall accuracy. Second, under overlapping or nonlinear class boundaries, they allow the model to refine its outputs in regions that matter most for the F1 metric. While inspired by techniques such as Platt or Beta scaling, our post-hoc output correction methods focus on performance rather than executing a probability calibration. It should be noted that calibration methods aim to align model scores with correctness likelihoods, which is useful in uncertainty scenarios. However, they do not necessarily improve classification performance, particularly under class imbalance or complex decision boundaries. In that regard, our methods are closer to the one proposed by Froelich [9], who proposed a correction method to adjust decision thresholds for FCM classifiers using a single output neuron. The numerical simulations using real-world datasets show that adding a correction layer to the rLTCN classifier outperforms the baseline model, the decision threshold correction approach, as well as multiple FCM-based and traditional machine learning classifiers.

The rest of the paper is organized as follows. Section 2 covers relevant efforts to build FCM-based classifiers, while Section 3 defines the theoretical background of the rLTCN classifier. Section 4 elaborates on the four correction methods proposed in this paper. Section 5 presents the simulation results across benchmarks and Section 6 provides concluding remarks and further research directions.

2. Related Work

Early efforts to adapt FCMs for classification introduced modifications to the traditional FCM structure and learning mechanisms [10]. However, the issues caused by unique fixed-point attractors, the lack of solid learning algorithms, and the limited network architectures [4] soon evidenced the difficulty of building accurate FCM-classifiers.

Subsequently, research on FCM classifiers shifted toward architectures that enhance expressiveness and stability. For example, Rough Cognitive Networks [11,12] are FCM-rooted classifiers where concepts represent information granules derived from rough set theory. Later works by Concepción et al. [13] and Harmati [14] examined the convergence properties of these models, leading to simpler yet more robust classifiers. Following a different path, Szwed [15] put forth a class-per-output classifier that detected early signs of convergence to unique fixed points while employing backpropagation to fine-tune the weight matrix. Other approaches have resorted to relaxing the causality constraint of weights [16], where weights are no longer confined to the

[- 1, 1]

interval. More recently, Quesada et al. [17] proposed a class-per-output FCM classifier using relative activation values, the quasi-nonlinear reasoning rule, and a backpropagation learning algorithm to fine-tune synaptic and nonsynaptic parameters. The resulting quasi-nonlinear FCM classifier outperformed the classic approach on real-world datasets.

Recent advances have also explored hybrid architectures and theory-driven learning strategies to improve model performance. For instance, Karatzinis et al. [18] explored the concept of functional weights by using polynomial approximators of varying degrees derived from fuzzy inference systems. Aligned with deep learning models, Tianming et al. [19] proposed a class-per-output FCM classifier that incorporated a capsule network architecture into the inference mechanism to capture spatial relationships in the data. On a different path, Tyrovolas et al. [20] addressed spurious correlations in FCMs using Liang-Kleeman Information Flow, achieving second-best accuracy on a classification benchmark, only surpassed by the rLTCN classifier. This model has also been used for interpretable power grid overload detection [21] and fault diagnosis in industrial robotics [22]. In parallel, Yin [23] developed a fuzzy model for imbalanced data that combined FCMs with hypersphere information granules, which reduces distribution complexities such as small clusters and irregular boundaries.

The literature also includes advances in the context of time series classification. For instance, Homenda & Jastrzębska [24] put forth a time series classifier based on the similarity of weight matrices across instances. In their approach, concepts are extracted via clustering and evaluated with membership functions. Wu et al. [25] proposed Broad FCM systems, which integrate sparse autoencoder-based feature extraction, high-order FCM spatiotemporal aggregation, and a multilayer perceptron prediction layer. Meanwhile, Wesołowski et al. [26] introduced an ensemble approach that decomposes multiclass classification into binary subproblems using methods like one-vs-one and one-vs-all, and aggregates predicted memberships through voting to improve generalization. In [27], the authors compared FCM classifiers and Hidden Markov Models using both one-model-per-class and one-model-per-series approaches. In their empirical studies, they found that instance-specific models generally outperform class-level ones.

FCMs continue to be applied across diverse classification domains. For example, Hilal et al. [28] used FCMs for remote sensing image classification in combination with features extracted via RetinaNet and a swarm intelligence optimizer. Sovatzidi et al. [29] introduced an interpretable FCM classifier where concepts represent semantic granules formed from clusters of similar images. In addition, in [30], the authors developed a multi-label image classification method combining transformers and FCMs to enhance interpretability, and in [31], they proposed an automatic FCM model for explainable pneumonia detection. Karaköse [32] integrated FCMs with convolutional and transformer-based deep learning models, combined their outputs using an FCM layer, and tested multiple loss functions for efficient satellite image classification. FCMs have also been extended to prescriptive analysis in business intelligence [33] by creating and adjusting the weights of prescriptive concepts using metaheuristics. In the healthcare domain, Hoyos et al. [34] proposed federated FCM models for clinical decision-making in the management of dengue disease. Salmeron et al. introduced FCM-based solutions for vertical and horizontal federated learning [35], and a blind training framework that operates without a predefined global model [36]. More recently, Gagnon-Dufresne et al. [37] analyzed participation in global health research, while Dhir et al. [38] focused on participation in research on aging, modeled with FCMs. These contributions highlight the practical usability of FCMs and motivate the research towards solving their structural limitations in classification tasks.

3. The rLTCN Classifier

The rLTCN classifier is a recurrent neural model that integrates temporal states generated during the reasoning process. Let

X

be a

K \times N

input matrix, where K represents the number of instances and N denotes the number of features. Similarly, let

Y

be a

K \times M

output matrix, where M gives the number of decision classes.

The model consists of an inner block that performs recurrent reasoning and an outer block that maps temporal representations to class outputs. The inner block is an LTCN module [39] using the quasi-nonlinear reasoning rule. The process begins with an initial activation matrix

A^{(0)} = X

, representing the input data about the problem variables. At each iteration

t \in 1, \dots, T

, where T is the predefined number of reasoning steps, the network state is updated using the following rule:

A^{(t)} = ϕ \cdot f (A^{(t - 1)} W) + (1 - ϕ) \cdot A^{(0)},

(1)

where

W

is a fixed

N \times N

weight matrix encoding the relationships among input features and

f (\cdot)

is a nonlinear activation function such as the sigmoid or hyperbolic tangent, applied element-wise. The scalar

ϕ > 0

is the hyperparameter that regulates the contributions of the non-linear and linear components. This iterative procedure produces a sequence of states

A^{(0)}, A^{(1)}, \dots, A^{(T)}

, which are concatenated into a single matrix

H^{(T)} = [A^{(0)} | A^{(1)} | \dots | A^{(T)}]

, referred to as the global network state, since it captures the full trajectory of the network’s reasoning over T iterations.

The outer block uses

H^{(T)}

to generate the model’s outputs Y, from which class labels are obtained. This is done through a recurrence-aware output layer defined by

Y = H^{(T)} R \oplus B,

(2)

where

R

is an

N \cdot (T + 1) \times M

matrix of learnable weights that connect the temporal states to the output neurons,

B

is a

1 \times M

bias vector, and ⊕ denotes broadcasting addition between the matrix and the bias vector.

The rLTCN classifier operates in batch mode, processing all K input instances simultaneously. For each instance, the predicted class corresponds to the output neuron with the highest activation. Unlike other FCM-based classifiers that rely only on the final state for the output activation calculations, this model uses the entire sequence of temporal states generated during reasoning. Figure 1 shows how the rLTCN classifier exploits the relationships between the problem features.

In this model, the number of learnable parameters scales with

N \times (T + 1) \times M

, which represents the expanded capacity gained by using all temporal states. In addition, the quasi-nonlinear reasoning ensures that the network will never converge to a unique fixed-point attractor as long as

ϕ < 1

, which allows for recognizing multiple decision classes.

Concerning the unsupervised learning step, the weight matrix

W

in the inner block is computed using the right singular vectors of the input data matrix

X

:

W = \frac{V}{max (| V |)},

(3)

where

X = U Σ V^{⊤}

is the compact singular value decomposition of

X

and

V

contains the right singular vectors. This initialization has the following mathematical properties: the columns of

V

are orthonormal, span the principal directions of maximum variance in the data, and form a stable, data-driven basis for the recurrent connections. The element-wise scaling by

max (| V |)

guarantees that all weights satisfy

w_{j i} \in [- 1, 1]

. While not required in LTCN-based models, bounded weights in a fixed interval allow for interpretability while injecting meaningful linear feature interactions.

As for the supervised learning step, Equation (4) formalizes how to compute the weight matrices

R

and

B

using a regularized pseudoinverse learning rule:

[\begin{matrix} R \\ B \end{matrix}] = {[{(H^{(T)} ∣ 1)}^{T} (H^{(T)} ∣ 1) + λ I]}^{†} {(H^{(T)} ∣ 1)}^{T} Y,

(4)

where

1

denotes a

K \times 1

column vector of ones,

I

is the identity matrix with compatible dimensions, and

λ > 0

is a regularization parameter. This formulation corresponds to a least-squares problem with Tikhonov regularization, which reduces overfitting and improves numerical stability compared to the unregularized pseudoinverse.

Beyond their predictive capabilities, the rLTCN classifier offers a degree of intrinsic interpretability since it does not include hidden neurons. Instead, every neural concept in the network has a well-defined meaning for the problem domain being modeled. This architecture enables the derivation of a feature relevance score [6] that quantifies the contribution of each input feature to the classifier prediction. The feature relevance score aggregates the absolute values of the inner and outer weights computed during the unsupervised and supervised learning steps, respectively. The intuition behind this score is that features represented by neural concepts with large outgoing weights contribute more to the prediction [40]. Notice that rLTCN’s interpretability is not equivalent to being able to visualize the model as a whole, in the same way that a large decision tree becomes impractical to inspect despite being formally interpretable. Rather, it provides a mechanism to identify which problem features drive the classification.

4. Theoretical Contributions

In this section, we introduce a family of correction methods that modify the outputs produced by rLTCN classifiers to improve their performance after their training phase is done. These correction methods adjust the logits produced by the network to produce optimal F1 scores, which is a stronger performance metric than accuracy. Conceptually, the proposed methods are closer to the class-specific thresholding strategies studied in neural systems for addressing class imbalance [41,42] than to probability calibration methods such as Platt or Beta scaling. While calibration methods aim to align output scores with correctness likelihoods, the proposed corrections instead reshape the logit space to directly maximize a performance metric. This makes them decision-boundary adjustment procedures rather than probability estimation tools.

It should be mentioned that direct optimization of the F1 score is challenging in standard gradient-based learning, since F1 is non-differentiable and defined in terms of discrete prediction thresholds. To address this, the methods proposed in this section use a differentiable surrogate that approximates the F1 score using soft model outputs [43,44,45]. This allows gradient-based optimization of correction parameters while preserving the metric’s sensitivity to class-wise performance. In this paper, we use a proxy for the F1 score associated with the k-th decision class, which is given by

{F 1}_{k} (a_{k}, c_{k}) = \frac{2 \cdot {TP}_{k}}{2 \cdot {TP}_{k} + {FP}_{k} + {FN}_{k} + ϵ},

(5)

where

{TP}_{k} = \sum_{i} y_{k}^{(i)} f_{k} (z_{k}^{(i)}),

(6)

{FP}_{k} = \sum_{i} (1 - y_{k}^{(i)}) f_{k} (z_{k}^{(i)}),

(7)

{FN}_{k} = \sum_{i} y_{k}^{(i)} (1 - f_{k} (z_{k}^{(i)})),

(8)

such that

y_{k}^{(i)} \in {0, 1}

is the binary ground-truth indicator that equals one if instance i belongs to class k and zero otherwise. The variable

z_{k}^{(i)} \in R

gives the raw model output associated with class k before any correction is applied, while

f_{k} (z_{k}^{(i)}) \in [0, 1]

is the score assigned to class k after applying a post-hoc correction method to the original output. The constant

ϵ > 0

is added for numerical stability.

The overall multiclass training objective aggregates the class-wise soft F1 scores into a single scalar by computing their macro-average across all M decision classes:

L (θ) = \frac{1}{M} \sum_{k = 1}^{M} {F 1}_{k} (θ_{k}),

(9)

where

θ_{k}

collects the correction parameters specific to class k. Since each

{F 1}_{k}

depends only on

θ_{k}

, the gradient of

L

with respect to any class-specific parameter

θ_{k}

reduces to

\frac{1}{M} \frac{\partial {F 1}_{k}}{\partial θ_{k}}

, and the parameters for different classes are updated independently. This decoupling holds for all four correction methods proposed in this paper.

The two correction methods that operate through a coupled softmax function, namely Shifting and Temperature, produce a proper probability vector for each instance, so the predicted class at test time is the one with the highest corrected softmax score, exactly as in the uncorrected rLTCN. The two correction methods that operate through independent per-class sigmoid functions, namely Platt and Beta, do not produce a coupled output. Each class score

f_{k} (z_{k}^{(i)})

is computed independently, and the resulting scores do not sum to one. At test time, the predicted class is the one with the highest corrected sigmoid score, that is,

{\hat{y}}^{(i)} = arg {max}_{k} f_{k} (z_{k}^{(i)})

. In the rare event of a tie between two or more classes, the tie is broken by selecting the class with the highest raw logit

z_{k}^{(i)}

before correction. This rule is applied uniformly across all four methods when ties occur. It should be noted that ties are uncommon in practice because the sigmoid function is strictly monotone and the correction parameters differ across classes, so exact equality of corrected scores across two or more classes is unlikely when the raw logits are distinct.

4.1. Shifting Correction

In the shifting correction method, we use class-specific bias terms

β_{k}

to shift the output logits before applying the softmax activation function. This linear transformation independently adjusts each class’s contribution to the final prediction, which allows the model to emphasize or de-emphasize classes based on observed performance. From a geometric perspective, shifting the logits translates the class scores in the output space, modifying the decision boundaries between classes. This is particularly beneficial in imbalanced classification scenarios, as it helps reduce the dominance of frequent classes and enhances the representation of underrepresented ones. After this linear transformation, the corrected softmax score for class k is computed as follows:

f_{k} (z^{(i)}) = \frac{exp (z_{k}^{(i)} - β_{k})}{\sum_{j} exp (z_{j}^{(i)} - β_{j})} .

(10)

To optimize this expression with respect to

β \in R^{K}

, we compute gradients of the softmax output

f_{k}

with respect to the bias terms as done below:

\frac{\partial f_{k} (z^{(i)})}{\partial β_{j}} = \{\begin{matrix} - f_{k} (z^{(i)}) (1 - f_{k} (z^{(i)})) & if j = k \\ f_{k} (z^{(i)}) f_{j} (z^{(i)}) & if j \neq k . \end{matrix}

(11)

Next, let

D_{k} = 2 \cdot {TP}_{k} + {FP}_{k} + {FN}_{k} + ϵ

. Then, the partial derivative of the soft F1 score with respect to each

β_{j}

parameter is given by

\frac{\partial {F 1}_{k}}{\partial β_{j}} = \frac{2}{D_{k}^{2}} [D_{k} \cdot \frac{\partial {TP}_{k}}{\partial β_{j}} - {TP}_{k} \cdot P],

(12)

such that

P = 2 \frac{\partial {TP}_{k}}{\partial β_{j}} + \frac{\partial {FP}_{k}}{\partial β_{j}} + \frac{\partial {FN}_{k}}{\partial β_{j}},

(13)

and

\frac{\partial {TP}_{k}}{\partial β_{j}} = \sum_{i} y_{k}^{(i)} \frac{\partial f_{k} (z^{(i)})}{\partial β_{j}},

(14)

\frac{\partial {FP}_{k}}{\partial β_{j}} = \sum_{i} (1 - y_{k}^{(i)}) \frac{\partial f_{k} (z^{(i)})}{\partial β_{j}},

(15)

\frac{\partial {FN}_{k}}{\partial β_{j}} = - \sum_{i} y_{k}^{(i)} \frac{\partial f_{k} (z^{(i)})}{\partial β_{j}} .

(16)

Because the softmax couples all class scores through a shared denominator, the gradient

\frac{\partial {F 1}_{k}}{\partial β_{j}}

for

j \neq k

is nonzero, and the parameters

β

for all classes are therefore optimized jointly through the macro-averaged objective in Equation (9).

4.2. Temperature Correction

The traditional temperature-scaling method is a simple post-processing procedure designed for probability calibration [46,47]. In its standard form, a single scalar temperature

T > 0

is used to rescale all logits before applying the softmax function. In this paper, we modify this method for correcting the output logits of rLTCN classifiers by introducing a separate temperature parameter

T_{k} > 0

for each decision class k. The corrected softmax output can be rewritten as shown below:

f_{k} (z^{(i)}) = \frac{exp (z_{k}^{(i)} / T_{k})}{\sum_{j = 1}^{K} exp (z_{j}^{(i)} / T_{j})} .

(17)

The gradient of the corrected output

f_{k}

with respect to its own temperature

T_{k}

is derived as follows. First, let us define

s_{j}^{(i)}

as follows:

s_{j}^{(i)} = exp (z_{j}^{(i)} / T_{j}),

(18)

so that

f_{k} (z^{(i)}) = \frac{s_{k}^{(i)}}{\sum_{j = 1}^{K} s_{j}^{(i)}} .

(19)

Then, the partial derivative is given by

\frac{\partial f_{k} (z^{(i)})}{\partial T_{k}} = \frac{(\frac{\partial s_{k}^{(i)}}{\partial T_{k}}) (\sum_{j = 1}^{K} s_{j}^{(i)}) - s_{k}^{(i)} (\frac{\partial s_{k}^{(i)}}{\partial T_{k}})}{{(\sum_{j = 1}^{K} s_{j}^{(i)})}^{2}},

(20)

where

\frac{\partial s_{k}^{(i)}}{\partial T_{k}} = s_{k}^{(i)} \cdot (- \frac{z_{k}^{(i)}}{T_{k}^{2}}) .

(21)

Simplifying this, we obtain

\frac{\partial f_{k} (z^{(i)})}{\partial T_{k}} = - \frac{z_{k}^{(i)}}{T_{k}^{2}} \cdot f_{k} (z^{(i)}) \cdot (1 - f_{k} (z^{(i)})) .

(22)

Similarly, the gradient with respect to another class’s temperature

T_{j}

for

j \neq k

is derived by noting that only

s_{j}^{(i)}

in the denominator depends on

T_{j}

:

\frac{\partial f_{k} (z^{(i)})}{\partial T_{j}} = \frac{- s_{k}^{(i)} (\frac{\partial s_{j}^{(i)}}{\partial T_{j}})}{{(\sum_{l = 1}^{K} s_{l}^{(i)})}^{2}},

(23)

where

\frac{\partial s_{j}^{(i)}}{\partial T_{j}} = s_{j}^{(i)} \cdot (- \frac{z_{j}^{(i)}}{T_{j}^{2}}) .

(24)

Substituting and simplifying this, we obtain

\frac{\partial f_{k} (z^{(i)})}{\partial T_{j}} = f_{k} (z^{(i)}) f_{j} (z^{(i)}) \frac{z_{j}^{(i)}}{T_{j}^{2}} .

(25)

These gradients allow us to update the vector of per-class temperatures

T = [T_{1}, T_{2}, \dots, T_{K}]

using gradient ascent to maximize class-specific F1 scores. To ensure numerical stability and positivity, each temperature parameter is parameterized as

T_{k} = exp (τ_{k})

, where

τ_{k} \in R

. The gradient with respect to

τ_{k}

is then obtained via the chain rule:

\frac{\partial {F 1}_{k}}{\partial τ_{k}} = \frac{\partial {F 1}_{k}}{\partial T_{k}} \cdot \frac{\partial T_{k}}{\partial τ_{k}} = \frac{\partial {F 1}_{k}}{\partial T_{k}} \cdot exp (τ_{k}) .

(26)

4.3. Platt Correction

The traditional Platt scaling is a binary post-processing method that maps raw logits to calibrated probabilities using a sigmoid function, with parameters learned to minimize log-loss [48,49]. The method described in this section extends Platt scaling from probability calibration to output correction by treating the sigmoid transformation as a flexible mechanism to reshape output scores, not necessarily to interpret them as probabilities. Moreover, it generalizes the approach to multiclass classification by fitting class-specific correction parameters and directly optimizing the F1 score proxy.

For a given logit

z_{k} \in R

, the corrected scores are given by

f_{k} (z_{k}) = σ (a_{k} z_{k} + c_{k}) = \frac{1}{1 + exp (- a_{k} z_{k} - c_{k})},

(27)

where

a_{k}

and

c_{k}

are class-specific correction parameters and

σ

denotes the sigmoid function. Since the sigmoid is applied independently to each class logit, the corrected scores

f_{k} (z_{k}^{(i)})

across classes are decoupled and do not sum to one. Each class k is treated as a one-vs-rest binary problem during optimization, with

y_{k}^{(i)} \in {0, 1}

serving as the binary indicator. The parameters

(a_{k}, c_{k})

for class k are optimized by ascending the gradient of

{F 1}_{k}

with respect to those parameters alone, and the macro-averaged objective in Equation (9) is maximized by independently solving M such binary subproblems. At test time, the predicted class is the one with the highest corrected sigmoid score, with ties broken by the highest raw logit, as described in the preamble of this section.

Our variant improves the model performance by adjusting the corrected output through

a_{k}

and

c_{k}

. To optimize these parameters with respect to the F1 score, we compute their gradients via the chain rule as described below.

To start the derivation of this method, the partial derivatives of the corrected output with respect to the parameters are given as shown below:

\frac{\partial f_{k} (z_{k}^{(i)})}{\partial a_{k}} = f_{k} (z_{k}^{(i)}) (1 - f_{k} (z_{k}^{(i)})) z_{k}^{(i)},

(28)

\frac{\partial f_{k} (z_{k}^{(i)})}{\partial c_{k}} = f_{k} (z_{k}^{(i)}) (1 - f_{k} (z_{k}^{(i)})) .

(29)

Let

N_{k}

and

D_{k}

denote the numerator and denominator of the approximate F1 measure in Equation (5). We compute the partial derivatives of the components with respect to a single corrected output

f_{k} (z_{k}^{(j)})

as formalized in the expressions below:

\frac{\partial {TP}_{k}}{\partial f_{k} (z_{k}^{(j)})} = y_{k}^{(j)},

(30)

\frac{\partial {FP}_{k}}{\partial f_{k} (z_{k}^{(j)})} = 1 - y_{k}^{(j)},

(31)

\frac{\partial {FN}_{k}}{\partial f_{k} (z_{k}^{(j)})} = - y_{k}^{(j)} .

(32)

From this, the derivative of

D_{k}

becomes

\frac{\partial D_{k}}{\partial f_{k} (z_{k}^{(j)})} = 2 y_{k}^{(j)} + (1 - y_{k}^{(j)}) - y_{k}^{(j)} = 1 .

(33)

Now, using the quotient rule, the derivative of

{F 1}_{k}

with respect to

f_{k} (z_{k}^{(j)})

is given by

\frac{\partial {F 1}_{k}}{\partial f_{k} (z_{k}^{(j)})} = \frac{(2 y_{k}^{(j)}) D_{k} - N_{k}}{D_{k}^{2}} .

(34)

Finally, we obtain the gradients of the F1 score with respect to the output correction parameters as formalized below:

\frac{\partial {F 1}_{k}}{\partial a_{k}} = \sum_{i} \frac{\partial {F 1}_{k}}{\partial f_{k} (z_{k}^{(i)})} \cdot \frac{\partial f_{k} (z_{k}^{(i)})}{\partial a_{k}},

(35)

\frac{\partial {F 1}_{k}}{\partial c_{k}} = \sum_{i} \frac{\partial {F 1}_{k}}{\partial f_{k} (z_{k}^{(i)})} \cdot \frac{\partial f_{k} (z_{k}^{(i)})}{\partial c_{k}} .

(36)

4.4. Beta Correction

Beta correction enhances the flexibility of output transformation by introducing an additional nonlinear term compared to Platt correction [50]. While Platt correction uses a simple affine transformation followed by a sigmoid, Beta correction incorporates a logarithmic curvature term

b_{k} log \sqrt{z_{k}^{2} + δ}

that allows the transformation to adapt more precisely to the shape of the raw score distribution [51]. Like Platt correction, Beta correction applies independent per-class sigmoid functions and is optimized through M decoupled one-vs-rest subproblems. The macro-averaged objective in Equation (9) is maximized by independently ascending the gradient of

{F 1}_{k}

with respect to the class-specific parameters

(a_{k}, b_{k}, c_{k})

, and the same argmax decision rule with raw-logit tie-breaking applies at test time. In short, the corrected score for class k is defined as formalized below:

f_{k} (z_{k}) = σ (a_{k} z_{k} + b_{k} log \sqrt{z_{k}^{2} + δ} + c_{k}),

(37)

where

a_{k}

,

b_{k}

,

c_{k} \in R

are the correction parameters and

δ

is a small constant to ensure differentiability. Aiming to compute the gradients for

a_{k}

,

b_{k}

, and

c_{k}

, we first define the argument of the sigmoid as follows:

g_{k}^{(i)} = a_{k} z_{k}^{(i)} + b_{k} log \sqrt{{(z_{k}^{(i)})}^{2} + δ} + c_{k},

(38)

so that

f_{k} (z_{k}^{(i)}) = σ (g_{k}^{(i)}) = \frac{1}{1 + e^{- g_{k}^{(i)}}} .

(39)

The partial derivative of

f_{k} (z_{k}^{(i)})

with respect to each parameter is

\frac{\partial f_{k} (z_{k}^{(i)})}{\partial θ} = \frac{\partial σ (g_{k}^{(i)})}{\partial g_{k}^{(i)}} \cdot \frac{\partial g_{k}^{(i)}}{\partial θ},

(40)

where

θ \in {a_{k}, b_{k}, c_{k}}

and the partial derivative of the sigmoid given by

\frac{\partial σ (g)}{\partial g} = σ (g) (1 - σ (g)),

(41)

so

\frac{\partial σ (g_{k}^{(i)})}{\partial g_{k}^{(i)}} = f_{k} (z_{k}^{(i)}) (1 - f_{k} (z_{k}^{(i)})) .

(42)

We have the partial derivatives of

g_{k}^{(i)}

with respect to the parameters

\frac{\partial g_{k}^{(i)}}{\partial a_{k}} = z_{k}^{(i)},

(43)

\frac{\partial g_{k}^{(i)}}{\partial b_{k}} = \frac{1}{2} log ({(z_{k}^{(i)})}^{2} + δ),

(44)

\frac{\partial g_{k}^{(i)}}{\partial c_{k}} = 1 .

(45)

Combining these results, we get

\frac{\partial f_{k} (z_{k}^{(i)})}{\partial a_{k}} = f_{k} (z_{k}^{(i)}) (1 - f_{k} (z_{k}^{(i)})) z_{k}^{(i)},

(46)

\frac{\partial f_{k} (z_{k}^{(i)})}{\partial b_{k}} = f_{k} (z_{k}^{(i)}) (1 - f_{k} (z_{k}^{(i)})) \cdot \frac{1}{2} log ({(z_{k}^{(i)})}^{2} + δ),

(47)

\frac{\partial f_{k} (z_{k}^{(i)})}{\partial c_{k}} = f_{k} (z_{k}^{(i)}) (1 - f_{k} (z_{k}^{(i)})) .

(48)

Finally, the gradient of

{F 1}_{k}

with respect to

θ \in {a_{k}, b_{k}, c_{k}}

is as follows:

\frac{\partial {F 1}_{k}}{\partial θ} = \frac{2 (D_{k} - 2 {TP}_{k}) \frac{\partial {TP}_{k}}{\partial θ} - 2 {TP}_{k} (\frac{\partial {FP}_{k}}{\partial θ} + \frac{\partial {FN}_{k}}{\partial θ})}{D_{k}^{2}} .

(49)

Substituting

\frac{\partial {FN}_{k}}{\partial θ} = - \frac{\partial {TP}_{k}}{\partial θ}

and combining terms gives

\frac{\partial {F 1}_{k}}{\partial θ} = \frac{2}{D_{k}^{2}} [(D_{k} - {TP}_{k}) \frac{\partial {TP}_{k}}{\partial θ} - {TP}_{k} \frac{\partial {FP}_{k}}{\partial θ}],

(50)

and

\frac{\partial {TP}_{k}}{\partial θ} = \sum_{i} y_{k}^{(i)} f_{k}^{(i)} (1 - f_{k}^{(i)}) \frac{\partial g_{k}^{(i)}}{\partial θ},

(51)

\frac{\partial {FP}_{k}}{\partial θ} = \sum_{i} (1 - y_{k}^{(i)}) f_{k}^{(i)} (1 - f_{k}^{(i)}) \frac{\partial g_{k}^{(i)}}{\partial θ} .

(52)

The additional degree of nonlinearity in the Beta correction method enables the model to better fit decision boundaries in cases where the relationship between the logits and the ideal decision scores is more distorted.

5. Empirical Studies

In this section, we evaluate the results of the proposed correction methods when applied to trained rLTCN classifiers. We first compare the performance of the four correction strategies (Temperature, Platt, Beta, and Shifting) on a benchmark of imbalance datasets. Then, we extend this performance comparison to contrast the rLTCN corrected with the best strategy and state-of-the-art classifiers. Finally, we illustrate the practical value of the correction methods in a real case study concerning churn prediction.

5.1. Comparing Correction Strategies

We adopt the 30 classification problems (see Table 1) from [6] to assess the performance differences of the proposed corrected methods. These problems involve different statistical properties and varying complexities, as indicated by class imbalance scores (the higher the score, the greater the imbalance) and the separability indices (the smaller the index, the more challenging the problem). The latter is determined as the proportion of presumably correct instances [52] relative to the total instances. In these problems, the number of features ranges from 2 to 240, the number of instances ranges from 846 to 10,992, while the number of decision classes ranges from 2 to 100.

Next, we measure the performance improvements when adding the correction layer to the rLTCN architecture after performing 5-fold nested cross-validation. The nested cross-validation procedure uses a stratified 5-fold outer loop and a stratified 5-fold inner loop. Stratification ensures that the class proportions of the original dataset are approximately preserved in every training, validation, and test partition. Both loops are controlled by a fixed random seed of 42, which was applied to Python 3.13.9’s built-in random module, NumPy 2.3.5, and TensorFlow 2.20.0 before any split construction. The 30 benchmark datasets were loaded from CSV files in which the last column contains the class label and all preceding columns contain the input features. Each dataset was preprocessed to retain only decision classes with at least five instances. This lower bound matches the number of outer folds and guarantees that every class appears in each fold during stratified splitting. Class labels were encoded as consecutive integers using a label encoder fitted on the full dataset before any fold was constructed. The rLTCN classifier supports this integer encoding since it internally converts class labels to one-hot representations, where each output neuron maps to a single decision class. The outer and inner folds were built using scikit-learn’s StratifiedKFold with shuffle enabled, so random variation in fold assignment is governed by the fixed seed. In each outer fold, the held-out partition served as the test set, and the remaining instances formed the training partition. The training partition was passed to the inner grid search cross-validator, which explored the hyperparameter grid using macro-averaged F1 as the internal scoring metric. The combination yielding the highest mean macro-averaged F1 across the inner validation partitions was selected, and the winning model was retrained on the full outer training partition and evaluated on the corresponding outer test fold. The performance estimates reported in the paper correspond to the mean performance metric computed across the five outer test folds.

Since the proposed methods are post-optimization procedures, we first need to train the rLTCN classifiers. Concerning hyperparameter tuning, we optimized the following hyperparameters using grid search: the number of iterations during reasoning

T \in {5, 10, 15}

, the activation function

f \in {sigmoid, \tanh}

, the nonlinearity coefficient

ϕ = {0.8, 0.9, 1.0}

, and the regularization penalty

λ \in {0, 1.0 \times 10^{- 4}, 1.0 \times 10^{- 2}}

. The correction layer is optimized using the analytically derived partial derivatives and the Limited-memory Broyden–Fletcher–Goldfarb–Shanno with Box constraints (L-BFGS-B) optimizer [53] from scipy.optimize.minimize. We choose L-BFGS-B over standard gradient descent for two reasons. First, the problem has a moderate number of parameters. L-BFGS-B then achieves superlinear convergence, which is much faster than the linear convergence of gradient descent. Second, L-BFGS-B eliminates the need to manually tune the learning rate, as it performs an exact line search at each iteration. For this optimizer, we use a maximum of 1000 iterations and tight tolerances (≤10⁻¹²) to reduce premature stopping.

In our experiments, we use seven performance measures, each capturing a distinct aspect of predictive quality. Cohen’s Kappa (

κ

) quantifies the agreement between predicted and true class labels beyond what would be expected by chance alone. It takes values in

[- 1, 1]

, where

κ = 0

corresponds to random agreement,

κ = 1

to perfect agreement, and

κ = - 1

to perfect systematic disagreement (agreement worse than chance). Unlike raw accuracy, Kappa penalizes models that exploit class prevalence and thus provides a stronger indicator of discriminative power in imbalanced settings. Macro-averaged F1 (

F_{1}^{m}

) computes the harmonic mean of precision and recall independently for each class and then averages across classes without weighting, which gives equal importance to rare and frequent classes. Weighted F1 (

F_{1}^{w}

) follows the same logic but weights each class contribution by its support, rendering it sensitive to the overall distribution of examples. Balanced accuracy (BA) averages recall across all classes and serves as a complement to Kappa when examining per-class sensitivity. Macro-averaged precision (P) and recall (R) separately measure the proportion of correctly identified positives among all predicted positives and among all actual positives. The area under the precision-recall curve (PR-AUC) summarizes the trade-off between precision and recall at multiple thresholds. It is especially informative under class imbalance, as it is not inflated by the large number of true negatives that tends to distort the ROC curve.

Table 2, Table 3, Table 4 and Table 5 report the improvement expressed in percentage points for each dataset and correction method. Across all four methods, the largest Kappa gains concentrate in datasets with severe class imbalance, with

Δ κ

reaching

+ 24.25 %

under all methods in D2 (imb

= 0.87

) and ranging from

+ 20.85 %

to

+ 22.61 %

in D13 (imb

= 0.99

). Other imbalanced datasets also show consistent gains, with

Δ κ

generally increasing alongside the degree of imbalance and reaching up to

+ 7.46 %

in datasets with imb ≥ 0.59 such as D28–D30. This pattern aligns with the motivation for the correction approach, given that the Moore-Penrose inverse learning step is governed by the mean squared error, which is dominated by the majority class in imbalanced problems. A post-hoc adjustment of the output logits can recover substantial recall for minority classes, as confirmed by the large

Δ BA

and

Δ R

values in the same high-imbalance datasets. These recall gains occasionally come at the cost of precision, since repositioning decision boundaries to favor minority classes increases true positives while also admitting additional false positives. This trade-off is most visible under Platt and Beta corrections, where

Δ P

reaches

- 2.77 %

in D2, with smaller negative precision gains appearing in D11, D26, D29, and D30 under those same methods. Datasets D5, D7, and D9 (imb

= 0.00

, sep

\geq 0.90

) show gains no larger than

+ 0.56 %

in

Δ κ

under any method, while D3 (sep

= 1.00

) and D21 (imb

= 0.69

, sep

= 0.83

) yield zero improvement across every metric. In all these cases, the correction layer finds no logit distortion to address once the baseline rLTCN produces good predictions.

Among the four methods, Beta and Platt corrections consistently produce the largest improvements. Beta correction achieves the highest mean

Δ κ

of

4.63 %

across the 30 datasets, followed by Platt at

4.48 %

, while Temperature and Shifting both reach

4.34 %

. In datasets combining high imbalance with low separability, the gains in

Δ κ

and

Δ BA

are substantial across all four methods, though the relative advantage of Beta and Platt over the softmax-based corrections is more pronounced when imbalance and low separability co-occur. This advantage is consistent with the sigmoid-based parameterizations of Platt and Beta being better suited to settings where the relationship between raw logits and ideal decision scores is nonlinear. The additional

b_{k}

parameter in Beta correction introduces a logarithmic curvature term that provides an extra degree of freedom unavailable to the other methods, which is particularly useful when the logit distribution is asymmetric. In large multi-class problems with low separability (sep

\leq 0.54

, 100 classes), all four corrections produce substantial gains, with

Δ κ

ranging from

+ 5.70 %

to

+ 12.66 %

. In these settings, Shifting occasionally delivers the largest single

Δ κ

, possibly because a uniform additive translation of the score space suffices to re-balance a large number of equally represented classes. However,

Δ F_{1}^{m}

does not always track

Δ κ

in these problems, with Platt leading on macro-averaged F1 for some datasets while Shifting leads on Kappa, which suggests that the relative ranking of methods can depend on the specific metric under consideration. The broader picture across all datasets is that

Δ BA

and

Δ R

generally follow the Kappa ranking, while

Δ P

occasionally diverges under high-imbalance conditions.

To further explore the observed performance differences, we apply the Friedman test [54], a nonparametric procedure that evaluates whether at least one method produces systematically different

Δ κ

improvements across the 30 datasets. The test yields a statistic of

14.95

with

p = 0.0019

, which confirms that the four methods are not statistically equivalent. To identify which pairs drive this difference, we apply the Wilcoxon signed-rank test [55] for all six method pairs, followed by Holm’s correction [56] to control the family-wise error rate. The Wilcoxon test does not assume normality and accounts for both the direction and magnitude of pairwise differences, which makes it appropriate for comparing classifiers over a finite benchmark. Holm’s correction adjusts the rejection threshold sequentially from the most to the least significant pair, which is less conservative than Bonferroni correction while still controlling the family-wise error rate.

Table 6 reports the raw and corrected p-values, the test outcome, the win/tie/loss (W/T/L) counts, and Cohen’s d for each pair. The W/T/L counts record how many times Method 1 produces a higher, equal, or lower

Δ κ

than Method 2 across the 30 datasets. This count is a direct measure of dominance that is independent of the p-value. Cohen’s d gives the standardized mean paired difference between Method 1 and Method 2, where values near

0.2

,

0.5

, and

0.8

indicate small, medium, and large effects. A positive d means Method 1 improves more than Method 2 on average, while a negative d means Method 2 holds the advantage. Only one pair reaches significance after Holm’s correction, namely Beta against Shifting (

p = 0.0290

,

d = 0.384

, W/T/L

= 17 / 11 / 2

). Therefore, Beta correction produces larger

Δ κ

improvements than Shifting across the benchmark. The comparisons of Beta against Temperature and Platt do not reach significance after correction (

p = 0.2774

in both cases). Nonetheless, the raw p-values are borderline (

p \approx 0.06

) and the W/T/L counts strongly favor Beta (13/14/3 and 11/16/3), with moderate effect sizes (

d \approx 0.35

–

0.37

). This pattern points to a practical advantage of Beta over Temperature and Platt, although the correction penalty prevents it from reaching formal significance, given the 30 datasets used in our study. The remaining pairs do not approach significance, and their W/T/L counts and effect sizes confirm the absence of any meaningful difference.

Figure 2 illustrates, as an example, the inner workings of the proposed logit correction methods used to adjust the outputs produced by the rLTCN model. Each plot compares the original output scores (dashed gray line) with the adjusted scores after correction (solid blue line) as a function of the input logit values. As a reminder, the four class-specific correction methods operate as follows: (a) shifting correction applies additive shifts to the logits, (b) temperature correction softens or sharpens the logit distribution using a scaling factor, (c) Platt correction applies a parametric adjustment to the sigmoid function, and (d) Beta correction applies a more flexible nonlinear transformation that incorporates logit magnitude. It is worth noting that the first two methods use the softmax activation, while the latter two operate with the sigmoid activation function.

Concerning the computational overhead of the correction layer, it is negligible relative to the two-step training cost of the rLTCN classifier. For a dataset with K instances and M decision classes, Shifting and Temperature corrections each optimize M parameters, Platt correction optimizes

2 M

parameters, and Beta correction optimizes

3 M

parameters. Let

P \in {M, 2 M, 3 M}

denote the number of correction parameters for a given method. Each L-BFGS-B iteration requires one evaluation of the macro-averaged soft F1 objective at cost

O (K M)

and one gradient evaluation at the same cost, since per-class gradients accumulate instance-wise terms over K instances. The L-BFGS-B algorithm maintains a limited-memory approximation of the inverse Hessian using the m most recent curvature pairs, where m is typically a small constant such as

m = 10

. The Hessian update and direction computation at each iteration cost

O (m P)

, which is dominated by

O (K M)

whenever

K ≫ m

. The total cost over I iterations is

O (I \cdot K M)

, where I is bounded by the maximum of 1000 iterations set in our experiments. Since I, m, and P are all small relative to K, the correction layer adds a cost that scales linearly in K and M. This is substantially lower than the

O (K \cdot N^{2})

cost of the pseudoinverse computation in the supervised learning step of the rLTCN classifier, where N denotes the number of input features.

5.2. Comparing Against State-of-the-Art Classifiers

In the second part of our empirical study, we will compare the performance of the cLTCN classifier (rLTCN using only Beta correction for simplicity) against state-of-the-art classifiers. The selected methods include the FCM with threshold correction (FCM-A) [9], the FCM Multiclass Classifier (FCMMC) [15], the uncorrected rLTCN classifier [6], Logistic Regression (LR), Decision Trees (DTs), Gaussian Naive Bayes (GNB), Support Vector Machines (SVMs), Light Gradient-Boosting Machines (LGBMs) [57], and Attentive Interpretable Tabular Learning (TabNet) [58]. Each classifier undergoes hyperparameter tuning through nested 5-fold cross-validation and grid search, using the same 30 pattern classification datasets from the previous subsection.

For DTs, the tuned parameters are criterion∈ {gini, entropy}, splitter∈ {best, random}, and max_features∈ {sqrt, log2}. For LR, the tuned parameters are solver∈ {lbfgs, saga}, C∈ {0.01, 0.1, 1, 10, 100}, and penalty∈ {l2, none} for lbfgs and {l1, l2, none} for saga. For these solvers, none means that no regularization is enforced. For SVMs, the tuned parameters are kernel∈ {linear, poly, rbf, sigmoid}, C∈ {0.01, 0.1, 1, 10, 100}, and gamma∈ {scale, auto}. For LGBMs, the tuned parameters are n_estimators∈ {100, 300, 500}, max_depth∈ {10, 20, 30}, and learning_rate∈ {0.01, 0.05, 0.1}. For TabNet, the tuned parameters are N_d = N_a∈ {8, 16, 32}, n_steps∈ {3, 5, 7}, and gamma∈ {1.0, 1.3, 1.5}. For FCMMC, the parameters are

f \in

{sigmoid}, training_loss∈ {softmax}, optimizer∈ {rmsprop}, depth∈ {2, 3, 5}, epochs∈ {50, 100}, learning_rate∈ {0.001, 0.01, 0.05, 0.1, 0.5}, and batch_size∈ {16, 32, 64}. For LTCN, the hyperparameters are the same as those defined in the previous experiment. Note that GNB and FCM-A do not involve relevant hyperparameters to be tuned during grid search.

Table 7, Table 8 and Table 9 report the results of the Friedman test and the Wilcoxon signed-rank test with Holm’s correction for all pairs of classifiers within each group. In this study, we use

Δ κ

as the primary performance metric. The Friedman test is significant in all three groups, with statistics of

75.04

,

52.44

, and

50.96

for FCM-based, white-box, and black-box classifiers, respectively, all with

p < 0.0001

. The W/T/L counts and Cohen’s d values supplement the p-values by providing directional evidence and effect size estimates, following the same conventions described earlier in this section.

Table 7 shows that every pair among the FCM-based classifiers is significantly different after Holm’s correction. FCM-A is the weakest method by a large margin, losing against all three competitors on all 30 datasets and producing large negative effect sizes (

d = - 2.255

against rLTCN and

d = - 2.651

against cLTCN). This outcome is expected due to the structural limitation of FCM-A, which relies on a single output neuron and M-1 decision thresholds. FCMMC improves substantially over FCM-A but still loses to both rLTCN and cLTCN on 25 out of 30 datasets. The comparison between rLTCN and cLTCN is the closest in this group, with cLTCN winning on 24 datasets against only 2 for rLTCN, with 4 ties (

d = - 0.825

,

p = 0.0001

). This confirms that the Beta correction layer produces a reliable and statistically supported gain over the uncorrected baseline.

Table 8 shows that cLTCN is significantly better than LR, DTs, and GNB after Holm’s correction, with large effect sizes (

d \leq - 0.939

) and W/T/L counts heavily favoring cLTCN in all three comparisons. LR and DTs are not significantly different from each other (

p = 0.9099

,

d = - 0.068

), and the near-zero effect size confirms that their mean

κ

scores are practically indistinguishable. DTs and GNB also fail to reach significance after correction (

p = 0.0715

), though the W/T/L count (20/0/10) and moderate effect size (

d = + 0.449

) point to a practical advantage for DTs. LR is substantially better than GNB (

p = 0.0098

,

d = + 0.620

), with LR winning on 23 out of 30 datasets.

Table 9 shows that cLTCN is better than all three black-box classifiers after Holm’s correction. TabNet is the weakest model in this group, losing to SVMs on 26 datasets, to LGBMs on 26 datasets, and to cLTCN on 27 datasets, with large effect sizes in all three comparisons. SVMs and LGBMs are not significantly different from each other (

p = 0.3373

,

d = + 0.258

), though SVMs hold a modest W/T/L advantage of 17/2/11. The comparisons of cLTCN against SVMs (

p = 0.0092

,

d = - 0.524

) and against LGBMs (

p = 0.0041

,

d = - 0.617

) are the closest in this group, yet both remain significant after correction, with cLTCN winning on 18 and 22 datasets, respectively.

5.3. Case Study Concerning Churn Prediction

To assess the practical relevance of our correction methods in a real-world scenario, we consider a customer churn prediction case study using the Orange Telecom dataset. This dataset has been widely adopted in the churn modeling literature [59] and represents a typical binary classification task with class imbalance. Churn prediction is a key problem in electronic commerce and subscription-based services, since the accurate identification of customers at risk of attrition helps with retention strategies [60].

The Orange Telecom dataset contains 3333 customer records, each described by 68 features, including service plans, usage statistics, and customer interaction variables. The target variable indicates whether a customer churned within the considered period. The dataset does not contain missing values. The observed churn rate is 14.49%, which means that 85.51% of customers were retained, which gives an imbalance ratio of 5.9:1 (see Figure 3a). Figure 3b shows the top 10 features correlated with the target. The total number of customer service calls and features that signal high usage are positively correlated with churning, while the voicemail usage is negatively correlated.

Exploratory analysis reveals clear behavioral differences between churned and retained customers. The box plot in Figure 3c indicates that the number of customer service calls is strongly associated with churn. Retained customers make on average 1.45 service calls, whereas churned customers average 2.23 calls. Moreover, customers with four or more service calls exhibit an average churn rate of 51.7%, compared to 11.3% among those with fewer than four calls. Figure 3d decomposes the distribution of churn rate per number of customer service calls. In this plot, it is clear that the cut-off point of four or more customer calls is a predictor of churning. These differences in the number of customer service calls are a clear signal of dissatisfaction with churning customers.

Figure 3e,f show that voicemail usage shows the opposite effect. Customers with voicemail activated, representing 27.7% of the sample, have a churn rate of 8.68%, compared to 16.72% among those without voicemail. Another interesting observation is that a high number of voicemail messages is also associated with no churning behavior. In contrast, total usage variables such as aggregate call minutes show only minor differences between groups, with churned customers exhibiting slightly higher average usage (see Figure 3g). Finally, Figure 3h shows that geographic variation across U.S. states is also limited, with a modest churn rate variance of 5.76%.

Using the same settings as in the previous section, we measure the performance improvements after adding correction layers to the rLTCN classifier. Table 10 reports the simulation results after learning and post-optimization.

The results show that the Kappa gain for every correction strategy lies in the range of +4.34% to +4.49%. The macro-averaged F1 gains are nearly identical across correction methods, which suggests that all strategies recover minority class recall to a comparable degree. The balanced accuracy and recall gains follow a similar pattern, with Shifting producing the largest

Δ BA

and

Δ R

at the cost of the steepest precision drop of −2.96%. Temperature and Platt scaling obtain identical results across all measures. Beta correction is the most balanced option, combining the highest Kappa gain of +4.49% and the largest PR-AUC improvement of +3.51% with the smallest precision penalty of −0.1%. However, the comparability in the results suggests that all proposed correction strategies similarly redistribute the decision boundary for this dataset.

6. Conclusions

This paper presented four post-hoc correction methods that adjust the output logits of rLTCN classifiers after the training phase is done. These methods apply class-specific transformations such as additive shifts, temperature scaling, and nonlinear mappings. Each correction maximizes a differentiable surrogate of the F1 score, which allows the model to improve performance in settings affected by class imbalance or complex decision boundaries. Unlike calibration techniques that align output scores with correctness probabilities, our methods aim to directly enhance classification performance.

The empirical evaluation covered 30 tabular datasets with varying levels of imbalance and class separability. The largest improvements appeared in datasets with severe class imbalance, while datasets with near-perfect separability or balanced classes showed no measurable improvement. This pattern aligns with the motivation for the correction approach, given that the Moore-Penrose inverse learning step is governed by the mean squared error, which the majority class dominates in imbalanced problems. The recall gains produced by all four methods occasionally came at the cost of precision, since repositioning decision boundaries to favor minority classes increases true positives but also introduces false positives. Among the four methods, Beta correction achieved the highest mean Kappa improvement, and the statistical analysis confirmed it as the only method significantly better than Shifting. The corrected rLTCN model outperformed all FCM-based classifiers, all white-box models, and all black-box competitors in the benchmark study. The customer churn prediction case study further supported these findings, where all four corrections produced performance gains, with Beta correction achieving the best trade-off between discriminative improvement and precision retention.

The proposed methods have some limitations worth acknowledging. The non-convex nature of the soft F1 surrogate means that L-BFGS-B may converge to a local optimum, so the correction gains depend on the quality of the logits produced by the base rLTCN. The macro-averaged F1 objective also treats all classes equally, which may not suit problems where specific classes carry unequal misclassification costs. Additionally, the correction layer modifies output scores without providing direct insight into which features or boundaries were adjusted, which limits interpretability at the correction stage. Future research will explore the optimization dynamics of the correction methods, particularly their convergence behavior under the non-convex soft F1 loss. Investigating whether gradient norms decrease monotonically and identifying the assumptions needed for stability could clarify why certain corrections perform better under specific data conditions. Developing techniques to interpret the corrections would further enhance transparency by revealing how and why output logits are adjusted for different classes or instances.

Author Contributions

Conceptualization, G.N.; methodology, G.N.; software, G.N. and I.G.; validation, G.N., I.G. and Y.S.; formal analysis, G.N. and I.G.; resources, Y.S.; data curation, G.N. and I.G.; writing—original draft preparation, G.N.; writing—review and editing, G.N., I.G. and Y.S.; visualization, G.N. and I.G.; supervision, G.N.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

Y. Salgueiro would like to acknowledge the support provided by ANID Fondecyt Regular 1240293 and the National Center for Artificial Intelligence CENIA FB210017, Basal ANID.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kosko, B. Fuzzy cognitive maps. Int. J. Man-Mach. Stud. 1986, 24, 65–75. [Google Scholar] [CrossRef]
Giabbanelli, P.J.; Nápoles, G. Fuzzy Cognitive Maps: Best Practices and Modern Methods; Springer: Cham, Switzerland, 2024. [Google Scholar]
Apostolopoulos, I.D.; Groumpos, P.P. Fuzzy cognitive maps: Their role in explainable artificial intelligence. Appl. Sci. 2023, 13, 3412. [Google Scholar] [CrossRef]
Karatzinis, G.D.; Boutalis, Y.S. A Review Study of Fuzzy Cognitive Maps in Engineering: Applications, Insights, and Future Directions. Eng 2025, 6, 37. [Google Scholar] [CrossRef]
Baggio, G.; Bassoli, R.; Granelli, F. Cognitive Software-Defined Networking Using Fuzzy Cognitive Maps. IEEE Trans. Cogn. Commun. Netw. 2019, 5, 517–539. [Google Scholar] [CrossRef]
Nápoles, G.; Salgueiro, Y.; Grau, I.; Espinosa, M.L. Recurrence-Aware Long-Term Cognitive Network for Explainable Pattern Classification. IEEE Trans. Cybern. 2021, 53, 6083–6094. [Google Scholar] [CrossRef] [PubMed]
Nápoles, G.; Grau, I.; Concepción, L.; Koutsoviti Koumeri, L.; Papa, J.P. Modeling implicit bias with fuzzy cognitive maps. Neurocomputing 2022, 481, 33–45. [Google Scholar] [CrossRef]
Concepción, L.; Nápoles, G.; Jastrzębska, A.; Grau, I.; Salgueiro, Y. Estimating the limit state space of quasi-nonlinear Fuzzy Cognitive Maps. Appl. Soft Comput. 2025, 169, 112604. [Google Scholar] [CrossRef]
Froelich, W. Towards improving the efficiency of the fuzzy cognitive map classifier. Neurocomputing 2017, 232, 83–93. [Google Scholar] [CrossRef]
Papakostas, G.A.; Koulouriotis, D.E.; Polydoros, A.S.; Tourassis, V.D. Towards Hebbian learning of fuzzy cognitive maps in pattern classification problems. Expert Syst. Appl. 2012, 39, 10620–10629. [Google Scholar] [CrossRef]
Nápoles, G.; Falcon, R.; Papageorgiou, E.; Bello, R.; Vanhoof, K. Rough cognitive ensembles. Int. J. Approx. Reason. 2017, 85, 79–96. [Google Scholar] [CrossRef]
Li, X.; Luo, C. Neighborhood rough cognitive networks. Appl. Soft Comput. 2022, 131, 109796. [Google Scholar] [CrossRef]
Concepción, L.; Nápoles, G.; Grau, I.; Pedrycz, W. Fuzzy-rough cognitive networks: Theoretical analysis and simpler models. IEEE Trans. Cybern. 2020, 52, 2994–3005. [Google Scholar] [CrossRef]
Harmati, I.A. Dynamics of Fuzzy-Rough Cognitive Networks. Symmetry 2021, 13, 881. [Google Scholar] [CrossRef]
Szwed, P. Classification and feature transformation with Fuzzy Cognitive Maps. Appl. Soft Comput. 2021, 105, 107271. [Google Scholar] [CrossRef]
Nápoles, G.; Bello, M.; Salgueiro, Y. Long-term Cognitive Network-based architecture for multi-label classification. Neural Netw. 2021, 140, 39–48. [Google Scholar] [CrossRef]
Quesada, M.; Concepción, L.; Bello, R.; Vanhoof, K. Classification with Low-Level Fuzzy Cognitive Maps. In Computational Intelligence Applied to Decision-Making in Uncertain Environments; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
Karatzinis, G.D.; Boutalis, Y.S. Fuzzy cognitive networks with functional weights for time series and pattern recognition applications. Appl. Soft Comput. 2021, 106, 107415. [Google Scholar] [CrossRef]
Yu, T.; Gan, Q.; Feng, G.; Han, G. A new fuzzy cognitive maps classifier based on capsule network. Knowl.-Based Syst. 2022, 250, 108950. [Google Scholar] [CrossRef]
Tyrovolas, M.; Liang, X.; Stylios, C. Information flow-based fuzzy cognitive maps with enhanced interpretability. Granul. Comput. 2023, 8, 2021–2038. [Google Scholar] [CrossRef]
Parginos, K.; Tyrovolas, M.; Bessa, R.J.; San Liang, X.; Stylios, C.; Camal, S.; Kariniotakis, G. Interpretable power grid overload detection with information flow-based fuzzy cognitive maps. CSEE J. Power Energy Syst. 2025, early access. [Google Scholar]
Tyrovolas, M.; Stylios, C.; Aliev, K.; Antonelli, D. Leveraging Information Flow-Based Fuzzy Cognitive Maps for Interpretable Fault Diagnosis in Industrial Robotics. In Proceedings of the Doctoral Conference on Computing, Electrical and Industrial Systems, Caparica, Portugal, 3–5 July 2024; Springer: Cham, Switzerland, 2024; pp. 98–110. [Google Scholar]
Yin, R.; Lu, W.; Yang, J. A hypersphere information granule-based fuzzy classifier embedded with fuzzy cognitive maps for classification of imbalanced data. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 8, 175–190. [Google Scholar] [CrossRef]
Homenda, W.; Jastrzebska, A. Time-series classification using fuzzy cognitive maps. IEEE Trans. Fuzzy Syst. 2019, 28, 1383–1394. [Google Scholar] [CrossRef]
Wu, K.; Yuan, K.; Teng, Y.; Liu, J.; Jiao, L. Broad fuzzy cognitive map systems for time series classification. Appl. Soft Comput. 2022, 128, 109458. [Google Scholar] [CrossRef]
Wesołowski, P.; Walasek, K.; Homenda, W.; Ouyang, C.; Yu, F. Time series classification based on fuzzy cognitive maps and multi-class decomposition with ensembling. In Proceedings of the 2023 IEEE International Conference on Fuzzy Systems (FUZZ), Incheon, Republic of Korea, 13–17 August 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar]
Bilski, J.M.; Jastrzebska, A. Fuzzy Cognitive Maps and Hidden Markov Models: Comparative Analysis of Efficiency within the Confines of the Time Series Classification Task. In Proceedings of the 2023 IEEE International Conference on Fuzzy Systems (FUZZ), Incheon, Republic of Korea, 13–17 August 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Hilal, A.M.; Alsolai, H.; Al-Wesabi, F.N.; Nour, M.K.; Motwakel, A.; Kumar, A.; Yaseen, I.; Zamani, A.S. Fuzzy Cognitive Maps with Bird Swarm Intelligence Optimization-Based Remote Sensing Image Classification. Comput. Intell. Neurosci. 2022, 2022, 4063354. [Google Scholar] [CrossRef]
Sovatzidi, G.; Vasilakakis, M.D.; Iakovidis, D.K. Fuzzy cognitive maps for interpretable image-based classification. In Proceedings of the 2022 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Padua, Italy, 18–23 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Sovatzidi, G.; Vasilakakis, M.D.; Iakovidis, D.K. Towards the interpretation of multi-label image classification using transformers and fuzzy cognitive maps. In Proceedings of the 2023 IEEE International Conference on Fuzzy Systems (FUZZ), Incheon, Republic of Korea, 13–17 August 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Sovatzidi, G.; Vasilakakis, M.; Iakovidis, D. Automatic Fuzzy Cognitive Maps for Explainable Image-based Pneumonia Detection. In Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics, Lamia Greece, 24–26 November 2023; ACM: New York, NY, USA, 2023; pp. 74–78. [Google Scholar]
Karaköse, E. An Efficient Satellite Images Classification Approach Based on Fuzzy Cognitive Map Integration with Deep Learning Models Using Improved Loss Function. IEEE Access 2024, 12, 141361–141379. [Google Scholar] [CrossRef]
Hoyos, W.; Aguilar, J.; Toro, M. PRV-FCM: An extension of fuzzy cognitive maps for prescriptive modeling. Expert Syst. Appl. 2023, 231, 120729. [Google Scholar] [CrossRef]
Hoyos, W.; Aguilar, J.; Toro, M. Federated learning approaches for fuzzy cognitive maps to support clinical decision-making in dengue. Eng. Appl. Artif. Intell. 2023, 123, 106371. [Google Scholar] [CrossRef]
Salmeron, J.L.; Arévalo, I. Concurrent vertical and horizontal federated learning with fuzzy cognitive maps. Future Gener. Comput. Syst. 2025, 162, 107482. [Google Scholar] [CrossRef]
Salmeron, J.; Arévalo, I. Blind Federated Learning without initial model. J. Big Data 2024, 11, 56. [Google Scholar] [CrossRef]
Gagnon-Dufresne, M.C.; Sarmiento, I.; Cooper, S.; Rahman, M.M.; Ghosh, P.; Andersson, N.; Zinszer, K. Why urban communities in low-and middle-income countries participate in global health research: A scoping review and fuzzy cognitive mapping summary. Soc. Sci. Med. 2026, 393, 119039. [Google Scholar] [CrossRef]
Dhir, V.; Sarmiento, I.; McDonald, I.; Faucher, M.G.; Tremblay, S.A.; Yaffe, M.J.; Andersson, N.; Geddes, M.R. Gender-Related Facilitators and Barriers to Participation in Research on Aging using Fuzzy Cognitive Mapping. Neurobiol. Aging 2026, 162, 1–13. [Google Scholar] [CrossRef]
Nápoles, G.; Vanhoenshoven, F.; Falcon, R.; Vanhoof, K. Nonsynaptic error backpropagation in long-term cognitive networks. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 865–875. [Google Scholar] [CrossRef]
Nápoles, G.; Grau, I.; Salgueiro, Y. Sparseness-optimized feature importance with prior knowledge and reinforcement learning-powered optimization. Neurocomputing 2026, 674, 132925. [Google Scholar] [CrossRef]
Sheng, V.S.; Ling, C.X. Thresholding for Making Classifiers Cost-sensitive. In Proceedings of the 21st National Conference on Artificial Intelligence, Boston, MA, USA, 16–20 July 2006; AAAI Press: Washington, DC, USA, 2006; Volume 1, pp. 476–481. [Google Scholar]
Lipton, Z.C.; Elkan, C.; Naryanaswamy, B. Optimal Thresholding of Classifiers to Maximize F1 Measure. In Proceedings of the Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014, Nancy, France, 15–19 September 2014; Calders, T., Esposito, F., Hüllermeier, E., Meo, R., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8725, pp. 225–239. [Google Scholar] [CrossRef]
Bénédict, G.; Koops, V.; Odijk, D.; de Rijke, M. sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2022, arXiv:2108.10566. [Google Scholar] [CrossRef]
Sanyal, A.; Kumar, P.; Kar, P.; Chawla, S.; Sebastiani, F. Optimizing non-decomposable measures with deep networks. Mach. Learn. 2018, 107, 1597–1620. [Google Scholar] [CrossRef]
Eban, E.; Schain, M.; Mackey, A.; Gordon, A.; Rifkin, R.; Elidan, G. Scalable Learning of Non-Decomposable Objectives. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Lauderdale, FL, USA, 20–22 April 2017; Singh, A., Zhu, J., Eds.; Proceedings of Machine Learning Research; PMLR: Lauderdale, FL, USA, 2017; Volume 54, pp. 832–840. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Proceedings of Machine Learning Research; PMLR: Sydney, Australia, 2017; Volume 70, pp. 1321–1330. [Google Scholar]
Minderer, M.; Djolonga, J.; Romijnders, R.; Hubis, F.; Zhai, X.; Houlsby, N.; Tran, D.; Lucic, M. Revisiting the Calibration of Modern Neural Networks. Adv. Neural Inf. Process. Syst. 2021, 34, 15682–15694. [Google Scholar]
Platt, J. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Adv. Large Margin Classif. 1999, 10, 61–74. [Google Scholar]
Zadrozny, B.; Elkan, C. Transforming Classifier Scores into Accurate Multiclass Probability Estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; ACM: New York, NY, USA, 2002; pp. 694–699. [Google Scholar]
Kull, M.; Silva Filho, T.; Flach, P. Beta Calibration: A Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Proceedings of Machine Learning Research; PMLR: Fort Lauderdale, FL, USA, 2017; Volume 54, pp. 623–631. [Google Scholar]
Kull, M.; Silva Filho, T.M.; Flach, P. Beyond Sigmoids: How to Obtain Well-Calibrated Probabilities from Binary Classifiers with Beta Calibration. Electron. J. Stat. 2017, 11, 5052–5080. [Google Scholar] [CrossRef]
Nápoles, G.; Grau, I.; Jastrzębska, A.; Salgueiro, Y. Presumably correct decision sets. Pattern Recognit. 2023, 141, 109640. [Google Scholar] [CrossRef]
Byrd, R.H.; Lu, P.; Nocedal, J.; Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput. 1995, 16, 1190–1208. [Google Scholar] [CrossRef]
Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
Holm, S. A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Arık, S.Ö.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; AAAI Press: Washington, DC, USA, 2021; Volume 35, pp. 6679–6687. [Google Scholar]
Ullah, I.; Raza, B.; Malik, A.K.; Imran, M.; Islam, S.U.; Kim, S.W. A churn prediction model using random forest: Analysis of machine learning techniques for churn prediction and factor identification in telecom sector. IEEE Access 2019, 7, 60134–60149. [Google Scholar] [CrossRef]
Bhattacharjee, B.; Madhu, U.; Guha, S.K.; Bhadra, S.; Das, P.K.; Samantaray, S.P.; Zubairuddin, M.; Tamboli, S. Neural network approach enhancing churn prediction with categorical encoding and standard scaling. Sci. Rep. 2026, 16, 6274. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Architecture of the rLTCN classifier.

Figure 2. Illustration of the inner workings of the logit correction methods. Each subplot shows the effect of a method on the output for a given decision class as a function of the input logit. The dashed gray line represents the uncorrected output, while the solid blue line shows the adjusted output after applying the corresponding correction method.

Figure 3. Exploratory data analysis of the Orange Telecom churn dataset.

Table 1. Summary of datasets used in the experiments reporting the number of features (N), instances (K), classes (M), class imbalance (imb), and class separability (sep).

ID	Dataset	N	K	M	Imb	Sep
D1	banana	2	5300	2	0.19	0.76
D2	bank	16	4521	2	0.87	0.74
D3	cardiotocography-10	35	2126	10	0.91	1.00
D4	cardiotocography-3	35	2126	3	0.89	0.98
D5	mfeat-factors	216	2000	10	0.00	0.92
D6	mfeat-fourier	76	2000	10	0.00	0.62
D7	mfeat-karhunen	64	2000	10	0.00	0.90
D8	mfeat-morphological	6	2000	10	0.00	0.49
D9	mfeat-pixel	240	2000	10	0.00	0.95
D10	mfeat-zernike	47	2000	10	0.00	0.68
D11	musk2	166	6598	2	0.82	0.89
D12	optdigits	62	5620	10	0.03	0.97
D13	page-blocks	10	5473	5	0.99	0.91
D14	pendigits	16	10,992	10	0.08	0.98
D15	plant-margin	64	1600	100	0.00	0.49
D16	plant-shape	64	1600	100	0.00	0.35
D17	plant-texture	64	1599	100	0.06	0.54
D18	segment	18	2310	7	0.00	0.92
D19	spambase	57	4601	2	0.35	0.78
D20	vehicle	18	846	4	0.09	0.43
D21	vehicle0	18	846	2	0.69	0.83
D22	vehicle1	18	846	2	0.65	0.52
D23	vehicle2	18	846	2	0.65	0.86
D24	vehicle3	18	846	2	0.66	0.54
D25	waveform	40	5000	3	0.02	0.46
D26	wine-quality-white	11	4898	7	1.00	0.27
D27	wine-quality-red	11	1599	6	0.98	0.28
D28	yeast	8	1484	10	0.99	0.22
D29	yeast1	8	1484	2	0.59	0.47
D30	yeast3	8	1484	2	0.88	0.87

Table 2.

Δ

metrics (%) after Temperature correction for each dataset.

Table 2.

Δ

metrics (%) after Temperature correction for each dataset.

Dataset	$Δ κ$	$Δ F_{1}^{m}$	$Δ F_{1}^{w}$	$Δ BA$	$Δ P$	$Δ R$	ΔPR-AUC
D1	+0.89	+0.45	+0.41	+0.60	+0.29	+0.60	+0.11
D2	+24.25	+12.96	+2.93	+16.50	+0.00	+16.50	+14.78
D3	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D4	+0.67	+0.36	+0.24	+0.57	+0.11	+0.57	+0.59
D5	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D6	+0.84	+0.94	+0.94	+0.76	+1.45	+0.76	+1.63
D7	+0.28	+0.27	+0.26	+0.26	+0.29	+0.26	+0.49
D8	+1.67	+1.88	+1.87	+1.50	+2.08	+1.50	+2.08
D9	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D10	+2.23	+1.35	+1.35	+2.01	+3.40	+2.01	+2.90
D11	+2.95	+1.48	+0.65	+4.34	+0.00	+4.34	+2.85
D12	+0.49	+0.45	+0.45	+0.45	+0.47	+0.45	+0.83
D13	+22.61	+21.27	+3.55	+30.91	+0.00	+30.91	+19.24
D14	+1.01	+0.91	+0.92	+0.89	+0.88	+0.89	+1.61
D15	+5.70	+6.09	+6.21	+5.42	+5.75	+5.42	+8.21
D16	+10.45	+13.61	+13.35	+10.83	+15.87	+10.83	+15.92
D17	+7.28	+9.35	+9.33	+6.83	+9.23	+6.83	+13.84
D18	+1.27	+1.27	+1.28	+1.08	+1.50	+1.08	+2.03
D19	+0.22	+0.11	+0.11	+0.34	+0.14	+0.34	+0.24
D20	+2.38	+1.45	+1.45	+1.79	+2.11	+1.79	+1.93
D21	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D22	+5.71	+2.86	+1.68	+4.61	+1.08	+4.61	+4.22
D23	+4.84	+2.42	+1.83	+4.14	+0.62	+4.14	+4.58
D24	+5.04	+2.55	+1.27	+5.56	+0.42	+5.56	+3.43
D25	+0.45	+0.30	+0.30	+0.30	+0.27	+0.30	+0.41
D26	+5.84	+6.08	+2.64	+6.50	+0.00	+6.50	+2.83
D27	+5.37	+11.82	+3.03	+10.82	+14.61	+10.82	+8.25
D28	+3.61	+3.00	+2.82	+3.43	+0.34	+3.43	+2.28
D29	+7.46	+4.08	+2.15	+6.50	+0.00	+6.50	+3.75
D30	+6.77	+3.40	+1.01	+7.01	+0.00	+7.01	+7.34

Table 3.

Δ

metrics (%) after Platt correction for each dataset.

Table 3.

Δ

metrics (%) after Platt correction for each dataset.

Dataset	$Δ κ$	$Δ F_{1}^{m}$	$Δ F_{1}^{w}$	$Δ BA$	$Δ P$	$Δ R$	ΔPR-AUC
D1	+0.89	+0.45	+0.41	+0.60	+0.51	+0.60	+0.81
D2	+24.25	+12.96	+2.93	+16.50	−2.77	+16.50	+14.78
D3	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D4	+0.67	+0.36	+0.24	+0.57	+0.11	+0.57	+0.59
D5	+0.28	+0.25	+0.25	+0.25	+0.23	+0.25	+0.46
D6	+1.39	+1.28	+1.28	+1.26	+1.89	+1.26	+2.01
D7	+0.28	+0.26	+0.26	+0.26	+0.26	+0.26	+0.47
D8	+1.67	+1.88	+1.87	+1.50	+2.10	+1.50	+2.08
D9	+0.28	+0.25	+0.25	+0.26	+0.24	+0.26	+0.47
D10	+1.67	+1.97	+1.96	+1.51	+2.54	+1.51	+2.88
D11	+3.17	+1.59	+0.68	+5.35	−1.09	+5.35	+2.97
D12	+0.79	+0.72	+0.72	+0.71	+0.76	+0.71	+1.33
D13	+21.24	+20.59	+3.43	+25.10	+6.24	+25.10	+18.66
D14	+1.01	+0.92	+0.92	+0.89	+0.90	+0.89	+1.64
D15	+7.92	+8.83	+8.78	+7.83	+7.87	+7.83	+12.88
D16	+11.09	+16.53	+16.30	+11.33	+21.92	+11.33	+19.95
D17	+6.97	+9.46	+9.31	+6.83	+10.56	+6.83	+14.89
D18	+1.01	+1.07	+1.07	+0.86	+1.34	+0.86	+1.68
D19	+0.22	+0.11	+0.11	+0.34	+0.14	+0.34	+0.24
D20	+1.58	+1.36	+1.34	+1.19	+2.56	+1.19	+1.85
D21	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D22	+6.94	+3.48	+2.22	+5.01	+1.86	+5.01	+5.32
D23	+4.84	+2.42	+1.83	+4.14	+0.62	+4.14	+4.58
D24	+5.04	+2.55	+1.27	+4.37	+0.42	+4.37	+3.43
D25	+0.45	+0.30	+0.30	+0.30	+0.27	+0.30	+0.41
D26	+5.78	+6.16	+2.71	+6.23	+0.27	+6.23	+3.17
D27	+6.44	+12.89	+3.55	+10.36	+26.02	+10.36	+11.13
D28	+4.42	+4.02	+3.61	+4.27	+2.56	+4.27	+4.16
D29	+7.46	+4.08	+2.15	+6.31	−1.01	+6.31	+3.75
D30	+6.77	+3.40	+1.01	+7.01	−0.30	+7.01	+7.34

Table 4.

Δ

metrics (%) after Beta correction for each dataset.

Table 4.

Δ

metrics (%) after Beta correction for each dataset.

Dataset	$Δ κ$	$Δ F_{1}^{m}$	$Δ F_{1}^{w}$	$Δ BA$	$Δ P$	$Δ R$	ΔPR-AUC
D1	+0.89	+0.45	+0.41	+0.60	+0.51	+0.60	+0.81
D2	+24.25	+12.96	+2.93	+16.50	−2.77	+16.50	+14.78
D3	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D4	+0.67	+0.36	+0.24	+0.57	+0.11	+0.57	+0.59
D5	+0.28	+0.25	+0.25	+0.25	+0.23	+0.25	+0.46
D6	+1.39	+1.22	+1.22	+1.26	+1.94	+1.26	+2.01
D7	+0.56	+0.51	+0.51	+0.51	+0.53	+0.51	+0.95
D8	+3.06	+3.19	+3.18	+2.76	+3.91	+2.76	+3.96
D9	+0.56	+0.51	+0.50	+0.51	+0.50	+0.51	+0.95
D10	+1.67	+1.50	+1.50	+1.51	+2.03	+1.51	+2.37
D11	+3.17	+1.59	+0.68	+5.35	−1.86	+5.35	+2.97
D12	+0.59	+0.54	+0.54	+0.53	+0.56	+0.53	+0.99
D13	+21.29	+21.45	+3.45	+29.90	+7.14	+29.90	+19.74
D14	+1.16	+1.04	+1.05	+1.02	+1.02	+1.02	+1.87
D15	+8.87	+9.59	+9.63	+8.67	+8.28	+8.67	+13.82
D16	+10.45	+15.39	+15.13	+10.75	+20.64	+10.75	+18.57
D17	+6.65	+9.24	+9.07	+6.50	+10.42	+6.50	+14.42
D18	+1.27	+1.25	+1.25	+1.08	+1.44	+1.08	+1.94
D19	+0.22	+0.11	+0.11	+0.34	+0.14	+0.34	+0.24
D20	+2.38	+1.45	+1.45	+1.79	+2.11	+1.79	+1.93
D21	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D22	+6.94	+3.48	+2.22	+5.01	+1.86	+5.01	+5.32
D23	+4.84	+2.42	+1.83	+4.14	+0.62	+4.14	+4.58
D24	+5.04	+2.55	+1.27	+4.37	+0.42	+4.37	+3.43
D25	+0.45	+0.30	+0.30	+0.30	+0.27	+0.30	+0.41
D26	+5.97	+6.11	+2.57	+6.30	−0.92	+6.30	+3.06
D27	+7.10	+12.73	+4.12	+10.33	+22.06	+10.33	+10.14
D28	+4.85	+4.12	+3.92	+4.37	+2.66	+4.37	+4.26
D29	+7.46	+4.08	+2.15	+6.31	−1.01	+6.31	+3.75
D30	+6.77	+3.40	+1.01	+10.22	−0.86	+10.22	+7.34

Table 5.

Δ

metrics (%) after Shifting correction for each dataset.

Table 5.

Δ

metrics (%) after Shifting correction for each dataset.

Dataset	$Δ κ$	$Δ F_{1}^{m}$	$Δ F_{1}^{w}$	$Δ BA$	$Δ P$	$Δ R$	ΔPR-AUC
D1	+0.89	+0.45	+0.41	+0.60	+0.29	+0.60	+0.11
D2	+24.25	+12.96	+2.93	+16.50	+0.00	+16.50	+14.78
D3	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D4	+0.67	+0.36	+0.24	+0.57	+0.11	+0.57	+0.59
D5	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D6	+1.67	+1.43	+1.44	+1.51	+2.33	+1.51	+2.44
D7	+0.28	+0.26	+0.26	+0.26	+0.28	+0.26	+0.47
D8	+1.67	+1.88	+1.87	+1.50	+2.08	+1.50	+2.08
D9	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D10	+0.56	+1.10	+1.09	+0.51	+1.42	+0.51	+1.43
D11	+3.17	+1.59	+0.68	+5.35	+0.00	+5.35	+2.97
D12	+0.30	+0.28	+0.28	+0.27	+0.34	+0.27	+0.53
D13	+20.85	+20.08	+3.35	+25.08	+0.00	+25.08	+17.75
D14	+1.01	+0.91	+0.91	+0.89	+0.90	+0.89	+1.63
D15	+8.23	+8.99	+9.03	+8.17	+7.35	+8.17	+12.94
D16	+12.66	+16.01	+15.73	+12.75	+18.03	+12.75	+19.50
D17	+6.01	+7.27	+7.15	+5.83	+6.35	+5.83	+10.82
D18	+1.01	+1.11	+1.11	+0.86	+1.34	+0.86	+1.80
D19	+0.22	+0.11	+0.11	+0.09	+0.14	+0.09	+0.24
D20	+1.58	+1.36	+1.34	+1.19	+2.56	+1.19	+1.85
D21	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00	+0.00
D22	+5.71	+2.86	+1.68	+4.98	+1.08	+4.98	+4.22
D23	+4.84	+2.42	+1.83	+4.14	+0.62	+4.14	+4.58
D24	+5.04	+2.55	+1.27	+6.35	+0.42	+6.35	+3.43
D25	+0.15	+0.10	+0.10	+0.10	+0.09	+0.10	+0.13
D26	+5.87	+5.35	+2.58	+5.46	+0.32	+5.46	+2.64
D27	+7.07	+6.56	+4.39	+4.82	+11.10	+4.82	+6.78
D28	+2.39	+3.02	+2.28	+2.57	+0.24	+2.57	+2.58
D29	+7.46	+4.08	+2.15	+6.31	+0.00	+6.31	+3.75
D30	+6.77	+3.40	+1.01	+7.01	+0.00	+7.01	+7.34

Table 6. Wilcoxon signed-rank test with Holm’s correction for all pairs of correction methods. The null hypothesis

H_{0}

states that the compared methods perform equally well. W/T/L counts wins, ties, and losses of Method 1 over Method 2. Cohen’s d measures the effect size, where a positive value favors Method 1 and a negative value favors Method 2.

Table 6. Wilcoxon signed-rank test with Holm’s correction for all pairs of correction methods. The null hypothesis

H_{0}

states that the compared methods perform equally well. W/T/L counts wins, ties, and losses of Method 1 over Method 2. Cohen’s d measures the effect size, where a positive value favors Method 1 and a negative value favors Method 2.

Method 1	Method 2	p-Value	Holm	$H_{0}$	W/T/L	d
Temperature	Platt	0.2552	0.6624	Not rejected	6/14/10	$- 0.223$
Temperature	Beta	0.0626	0.2774	Not rejected	3/14/13	$- 0.352$
Temperature	Shifting	0.8261	0.8261	Not rejected	8/16/6	$- 0.002$
Platt	Beta	0.0555	0.2774	Not rejected	3/16/11	$- 0.370$
Platt	Shifting	0.2208	0.6624	Not rejected	9/16/5	$+ 0.228$
Beta	Shifting	0.0048	0.0290	Rejected	17/11/2	$+ 0.384$

Table 7. Wilcoxon signed-rank test with Holm’s correction for FCM-based classifiers. The null hypothesis

H_{0}