1. Introduction
A customized score function is a special evaluation function defined specifically for a problem when the default success or performance criteria used in machine learning or statistical models do not fully meet the needs. It is very useful to use in certain problematic situations, for example, if the problem is asymmetric, if false negatives produce more serious consequences than false positives, if the defined target is different, or if the optimization problem is set up with a profit target and is intended to be maximized. A score function can generally be represented in the following form: , where y is the real value and is the estimated value. This function may be maximized or minimized according to the purpose.
It is not correct to evaluate a customized score function only as a loss function, as it has a very common use in the literature. Ref. [
1] enriched logistic regression classifiers by adding “nonlinear decision-tree effects.” In addition to default accuracy measures, they evaluated performance using problem-oriented special metrics such as AUC and KS statistics. Ref. [
2] proposed a dynamic ensemble method based on flexible probability estimations and preferred risk-oriented combinations such as penalized AUC or log-loss as customized score functions. Ref. [
3] defined a noise-adaptive two-layer ensemble model and used a penalty mechanism that emphasizes misclassifications—especially false negatives—as a customized score function during model training. Ref. [
4] used logistic regression, SVM, and tree-based algorithms together with special criterion-based evaluation metrics such as Information Gain and Gain Ratio, and employed different score functions such as accuracy, AUC, KS, and explainability.
The importance of these studies in the literature lies in the use of problem-focused metrics, the support of decision mechanisms with customized scoring functions targeting incorrect classes, and the preference for domain-specific evaluation metrics instead of classical metrics in model training and evaluation. These studies provide important tools that offer flexibility in machine learning.
Feature selection has long been recognized as a critical step in high-dimensional learning problems, particularly in improving model interpretability, generalization, and computational efficiency. One of the earliest contributions to performance evaluation in classification tasks is the Matthews correlation [
5], which set a foundation for robust statistical assessment. Over time, various feature selection strategies have been proposed, many of which focus on statistical dependency between predictors and the target [
6], pioneering the use of mutual information as a criterion to quantify this relevance, and framing the selection task in information-theoretic terms. This was further refined by [
7] who introduced the widely used minimum redundancy maximum relevance (mRMR) principle, formalizing the trade-off between capturing informative features and avoiding redundancy. These approaches laid the groundwork for the relevance/redundancy paradigm that underpins many contemporary selection frameworks. However, most of them rely on static heuristics or greedy inclusion rules that are sensitive to noise and local optima—limitations which this study seeks to overcome by introducing a stochastic optimization-based selection mechanism.
Correlation-based feature selection is a method used to filter features according to high correlation thresholds between financial ratios, and model performance can be evaluated using a correlation-based metric such as the Matthews correlation coefficient (MCC) [
8]. This study compares four different feature selection methods in the context of credit scoring: Relief F, correlation-based, consistency-based, and wrapper algorithms. Redundant variables are eliminated by detecting high correlations between features. The simplicity of the model is evaluated along with its effects on training speed and accuracy. Highly correlated features are filtered by examining input variables using Pearson correlation. Based on correlation analysis, a custom score function that can be used in problems such as credit risk should be based on two main principles: first, the relationship between the feature and the target variable should be strong; second, the features should be independent of each other or have low correlations.
Correlation-based feature selection methods have played a central role in numerous applications, particularly in bioinformatics where data dimensionality and noise levels are typically high. Ref. [
9] provides a comprehensive review of such methods, highlighting their strengths and weaknesses in biological data analysis. Despite their widespread use, these approaches often face challenges related to the stability of selected feature subsets, especially under small sample sizes or data perturbations, as discussed by [
10]. These developments illustrate the ongoing interest in refining correlation-based selection strategies, yet most lack mechanisms to jointly address redundancy and relevance under stochastic dynamics—an aspect that this study aims to advance.
While existing feature selection techniques, such as correlation thresholding, mutual information filtering, greedy inclusion, or l_1-regularized models, have shown partial success in identifying informative variables, they often fail to jointly optimize relevance and redundancy in a balanced and data-adaptive manner. Most rely on deterministic and local strategies, which may converge to suboptimal subsets, especially in high-dimensional and noisy settings. The approach proposed in this study addresses these limitations by integrating a customized correlation-based scoring function with a stochastic differential equation (SDE) framework that combines gradient ascent with controlled exploration. This dual mechanism not only enforces sparsity but also facilitates global search, allowing the method to escape local optima and capture more informative variable configurations. To the best of our knowledge, this is the first attempt to embed stochastic optimization dynamics directly into the correlation-based feature selection process, offering a methodological bridge between statistical dependence measures and dynamic selection pathways.
2. Model Structure and Details
Before discussing the mathematical background, we will first address two important principles used in this field: the concepts of maximum relevance and minimum redundancy. In this study [
11], we present an information theory-based method for selecting the most informative and least redundant variables from a high-dimensional dataset. The maximum relevance principle can be expressed mathematically as follows:
Here
is the set of selected variables,
and c is the class label
. The practical meaning of the expression
Rel(
S) is the answer to the question, in which class do we have more information when we place the selected variable? Therefore, the maximum significance principle has an entropy-based operation in practice. The minimum redundancy principle can also be expressed mathematically as follows:
The aim here is to eliminate variables that convey the same information. Accordingly, the minimum redundancy maximum relevance (mRMR) criterion is defined by the following maximization problem [
11]:
Here, mRMR analysis is performed based on entropy. This method has been successfully applied in the variable selection of credit risk estimation models.
2.1. Correlation-Based Customized Score Function
Our goal at this point is to maximize the correlation. But why do we need such a goal? The answer to this question can be quite interesting. Let us briefly consider the structure of a linear model. Mathematically, we want the linear combination of explanatory variables—that is, the factors—to clearly represent the response variable. In this case, the correlation between the linear combination of factors and the response variable should be maximized. This idea can be expressed in the following mathematical form:
Here,
is the j-th variable vector, and
is the response variable;
is the Pearson correlation coefficient between the j-th variable and the response variable. At the same time, if the selected variables are highly correlated and there is duplication of information, we want to penalize this situation:
As a result, the goal is to maximize the following:
Here
is the penalty coefficient and determines how much we will penalize the dependency between variables. Alternatively, the following convex union can be written as follows:
This information is particularly important in variable selection algorithms, in explaining post-model analyses, and in evaluating financial indicators. An important question here is whether is analytical. Should it be considered as an input or as an output? If is a set of pre-selected variables, the customized score function will be calculated only according to these selected variables. When it is taken as an output, the maximization problem will be executed. In this case, the solution can be implemented in different ways: all combinations of variables can be tried; the variable that increases the score the most can be added at each step; or heuristic methods such as genetic algorithms can be used.
2.2. t-Distributed Stochastic Neighbor Embedding
It is a powerful dimensionality reduction technique used to visualize high-dimensional data by reducing it to low dimensions such as 2D or 3D [
12]. This method classifies the data, not the variables. Let the data matrix be
,
be the number of observations, and p be the number of variables. t-SNE defines similarity between the rows of the data matrix. t-SNE takes each observation
, which are the rows of the data matrix, and places them in a way that preserves the similarities between these observations in the low-dimensional space. This content is quite different from classical discriminant operations. This difference arises from the fact that the analysis uses an entropy-based criterion, perplexity. The measurement of the relative similarity of each data point to other data points is as follows:
t is not easy to interpret this measurement. The authors [
12] have left its meaning to the reader’s intuition due to difficulties in providing an explanation. Here, the intuition is expected to be guided by the Gaussian distribution, because the fraction appears to represent a conditional probability derived from it. Our aim is to answer the question: Given a selected observation
as the mean, which observations can fall within a fixed standard deviation in the Gaussian distribution?
It is clear that when the standard deviation is small, fewer observations will fall within this range, and when it is large, more observations will. We should also consider the concept of entropy—used to measure the amount of information—when deciding which observations fall around the mean. After determining the number of observations that will fall around each cluster center, we begin with a fixed standard deviation. Then, perplexity is calculated based on the selection of the cluster center:
If the
value calculated according to the selected standard deviation is greater than the target value, it means that more observations are clustered than desired. In this case, the standard deviation is reduced. In the opposite case, the standard deviation is increased. In this way,
scores are obtained with the help of the calculated
values:
As a result, the decision criterion is as follows:
Here, represents the i-th observation after t-SNE. The value of is the Kullback–Leibler (KL) divergence between the two distributions. A small value of this function indicates that the similarities in the low-dimensional representation are very close to those in the high-dimensional representation.
An important question here is why the t-distribution with one degree of freedom is chosen as the second distribution. Why are Gaussian and t-distributions compared using the KL divergence? This is because the core of the t-SNE algorithm lies in comparing the similarities defined by the Gaussian distribution in the high-dimensional space with those defined by the t-distribution in the low-dimensional space using the KL divergence.
The Gaussian distribution has narrow tails, meaning that distant points are considered unimportant. In contrast, the t-distribution has heavy tails, so even distant points retain some interaction in the low-dimensional space. This property helps solve the crowding problem: in a low-dimensional space, not all distant points are forced into the center as they can remain separated.
As a result, this analysis is not based on classification; instead, it transforms a p-dimensional dataset into a visually interpretable dataset with two or three components. The essence of the process is to create a new 2- or 3-component dataset that minimizes the KL divergence, denoted by C.
Different versions of the classical t-SNE algorithm also exist in the literature. In one improved version, weighted proximity is introduced according to cluster centers by adding a weighted distance function to the structure. In this way, an embedding that preserves both local and cluster structures more accurately can be achieved [
13]. This idea is particularly interesting. In the remainder of this article, we will implement a similar structure using a customized score function.
2.3. Weighted t-SNE Based on Feature Scores: A Rel/Red-Based Embedding Approach
High-dimensional datasets are among the most frequently encountered challenges in modern data analysis. Such data structures increase computational complexity and reduce interpretability in terms of both visualization and learning algorithms. Dimensionality reduction methods are used as a fundamental tool to reduce this complexity and reveal meaningful structures. In particular, the t-distributed Stochastic Neighbor Embedding (t-SNE) method has emerged as a successful technique that aims to preserve local neighborhood relationships between high-dimensional observations by projecting them onto two- or three-dimensional spaces. The success of t-SNE lies in its ability to keep similar data points close together in a low-dimensional space. However, the classical t-SNE algorithm treats each feature as equally important and does not account for information density or multiple correlations between features. In many real-world datasets, some variables have strong correlations with the target classes, while others merely duplicate the information contained in other variables. This situation can lead to artificial clusters or information loss in the embedding space produced by t-SNE. To address this, we propose a weighted t-SNE approach that considers the relative importance of features. A customized score function is defined by considering the relevance (Rel) of each feature with respect to the class label and the redundancy (Red) of each feature with other features. The distance computation in t-SNE is then weighted according to this score. Thus, informative features have a greater influence on the embedding process, while features with high redundancy and low information density have reduced influence. The proposed approach contributes to both feature selection and meaningful data visualization, thanks to the Rel/Red-based weighted distance function integrated into the classical t-SNE algorithm. Experimental results show that the proposed method provides better separability and interpretability compared to classical t-SNE and other dimensionality reduction techniques, for both artificial and real datasets. Let us now briefly present the mathematical formulation of this structure. Let the weights in the distance computation be defined as follows:
Here, denotes the class label. All other operations are carried out in accordance with the standard t-SNE algorithm. In this way, meaningful structures are better preserved in 2D and 3D visualizations, the effect of noisy variables is reduced, and variable scoring and visualization are performed simultaneously, facilitating interpretation in complex and high-dimensional datasets. In short, the method provides a more meaningful representation capability.
2.4. Gain Function Instead of Loss Function in Linear Model
In high-dimensional regression problems, determining the optimal subset of explanatory variables that can establish significant statistical relationships with the target variable is of central importance for both model performance and interpretability. In this context, we propose a framework that goes beyond classical linear modeling techniques and directly optimizes the relationship between the target variable and the linear combination of explanatory variables using the Pearson correlation coefficient. The proposed method creates a linear combination of explanatory variables via the weight vector, or the coefficient vector of the linear model,
, and aims to maximize the correlation of this combination with the target variable Y. At the same time, a penalty term based on the
-norm is added to the weight vector in order to promote sparsity of the model. Thus, the variable selection process is not explicitly performed via a binary selection vector, but directly via the elements of the weight vector:
Here, represents the Pearson correlation coefficient between the estimated values and the target variable, and the term represents a sparsity penalty applied to the complexity of the model. This structure enables the construction of a linear representation guided by a correlation-based criterion while simultaneously optimizing variable selection. The developed model aims to build a low-dimensional explanatory subspace with high linear representation power, particularly in high-dimensional data structures where correlations between variables are significant. Moreover, thanks to the differentiability of the -norm, this structure can be directly solved using gradient-based continuous optimization algorithms.
In high-dimensional statistical modeling and feature selection problems, objective functions that directly maximize the relationship between the target variable and linear combinations of explanatory variables are of great interest, especially in domains where interpretability is crucial. Such objective functions are typically nonlinear and differentiable but may exhibit multiple local extrema. In these cases, classical optimization techniques—such as deterministic gradient descent or coordinate-based methods—may yield suboptimal solutions by becoming trapped in local minima or plateau regions.
In this context, stochastic optimization approaches, particularly those based on stochastic differential equations (SDEs), provide effective optimization on high-dimensional, complex surfaces by simultaneously moving in the direction of the gradient of the objective function and exploring a wider region of the solution space [
14]. These methods incorporate a random term, such as Brownian motion, into the deterministic dynamics, enabling the process to escape potential traps (local extrema). Furthermore, this approach has been widely applied in fields such as energy-based learning, Bayesian inference, and stochastic gradient Langevin dynamics [
14,
15]. The objective function
is defined to maximize the Pearson correlation coefficient between the target variable and the linear combination of explanatory variables:
At the same time, sparse modeling is achieved by applying
-norm penalty on the explanatory variables. Thus, the objective function includes both a fitness term that measures the linear correlation strength and a regularization term that limits the number of variables. Such a structure offers a meaningful feature selection strategy, especially in datasets containing a large number of variables but only some of which are significant. In the proposed model, the weight vector
is considered as a stochastic process that evolves and this process is modeled with the following general SDE form:
In this formulation, the deterministic component, the gradient term, ensures that the objective function moves in the direction of the steepest increase, while the diffusion term,
allows the system to explore a wider solution space. The distinction between ensemble-level expectations and single-trajectory outcomes in stochastic selection dynamics may reflect nonergodic behavior, especially when optimization relies on noisy gradient flows with random perturbations. A closely related phenomenon is studied by [
16], who show that geometric Brownian motion with Poissonian resetting exhibits a clear mismatch between time-averaged and ensemble-averaged observables. Their findings highlight how intermittent resets in stochastic processes can fundamentally alter long-term statistical properties, including the emergence of non-stationarity and ergodicity breaking—features that are also relevant when modeling sparse feature selection under stochastic differential equation dynamics. This structure increases the probability of reaching a global maximum, especially in multi-modal functions. For the numerical solution of the equation, the Euler–Maruyama method is employed to simulate the process through iterative updates over time. In this way, the correlation maximization and variable selection problems can be solved holistically within a differentiable stochastic optimization framework. The gradient of the objective function consists of two parts:
This derivative can be derived analytically under differentiability conditions:
is the subgradient of the
-norm; so, in this case, the subgradient method or the smoothed
term can be used. Since this stochastic differential equation cannot be solved directly, an approximate solution is made by the Euler–Maruyama method:
Here is the Gaussian noise that is resampled at each iteration. This iterative structure contains both deterministic gradient descent and random exploration components. Now, in the following section, let us deform this structure so that the coefficient vector consists of zeros and ones.
2.5. Correlation-Based Selection with Binary Weights: SDE Approach
In high-dimensional data analysis, selecting only the variables that have statistically significant relationships with the target variable is critical to increase the interpretability and generalization performance of the model. In this context, feature selection methods play a fundamental role in preventing overfitting, especially in cases with limited observation numbers. In this study, a stochastic approach is proposed to maximize the Pearson correlation coefficient between the response variable and a selected subset of explanatory variables. The proposed method works on the weight vector
, which represents whether each feature is selected or not, in a binary representation. However, this discrete structure prevents the direct application of classical stochastic differential equation techniques. To overcome this problem, a continuous auxiliary variable vector
, is defined and the evolution of
is indirectly controlled by the stochastic flow on this vector. At each time step,
is defined by the thresholding process applied over
, and thus the system remains in the binary choice space:
Here,
, is the indicator function connected to the threshold τ. With this approach, a bridge is established between continuously differentiable stochastic processes and the discrete (combinatorial) decision space. In particular, the proposed structure has theoretical parallels with parametric relaxation techniques such as Gumbel–Softmax [
17,
18], and stochastic gradient Langevin dynamics [
19]. In addition, the correlation-based objective function
, provides a trade-off that controls both explanatory power and the number of variables. As a result, a novel feature selection method can be established that performs correlation-maximization in the binary variable space via continuously defined stochastic differential equations. The Process equation, gradient-based stochastic flow, is as follows:
Although
provides gradient directional information, since
is binary, this direction is only valid in
space. At each step,
is updated.
is reprojected onto the following vector:
In this way, the decision vector always remains binary while maintaining a continuous stochastic flow.
3. A Simulation on Default Risk with Correlation-Based Feature Selection and Rel/Red-Weighted Embedding Approach
This section presents a synthetic data simulation developed to evaluate the effectiveness of correlation-based customized scoring functions. The aim is to optimize variable selection, modeling, and visualization steps on a high-dimensional synthetic dataset by explicitly considering correlational structures. In particular, the focus is on structures that account not only for the relevance of variables to the target but also for the redundancy of information among variables. The proposed method seeks to produce more explainable, less redundant, and statistically significant models by combining classical feature selection and linear modeling approaches with correlation maximization and stochastic differential equation (SDE)-based learning. The simulation scenario is set in a context where a bank observes various financial and behavioral characteristics of applicants to assess default risk in loan applications. The generated synthetic dataset contains 500 observations and 10 variables, including monthly income, requested loan amount, debt-to-income ratio, credit score, payment delays, and years of employment. The target variable, default, indicates whether an individual has failed to make loan repayments. Variables with high correlation to the target and low correlation among themselves were identified through correlation analysis, after which a customized score function was applied to these variables. Feature selection was carried out using both deterministic (greedy) and stochastic (SDE) methods. Rel/Red scores were calculated for the selected variables, and a weighted t-SNE visualization was generated using these ratios. This approach emphasized not only informative variables related to the target but also unique variables that do not duplicate each other’s information, leading to more meaningful separations in the embedding space. The results were evaluated by comparing classical logistic regression models with correlation-based score functions, demonstrating the effectiveness of the SDE-supported framework in constructing simple yet statistically robust subspaces. Overall, the proposed simulation aims not only to validate the methodological framework but also to illustrate how correlation-based learning systems can be optimized in terms of explainability, consistency, and information density.
Table 1 summarizes the probability distribution and parameter values used to generate each variable in the synthetic dataset used in the study. The variable definitions were selected in accordance with the application scenario and were structured to reflect realistic financial and demographic characteristics and to establish meaningful relationships with the target variable, default status.
The mean values of each explanatory variable in the default and non-default groups, along with the corresponding
p-values from independent samples
t-tests, are presented in
Table 1. Notably, variables such as debt_ratio, credit_score, loan_amount, and payment_delay exhibit statistically significant differences between the two groups (
p < 0.001), indicating their strong discriminative power in relation to credit default behavior. Conversely, several variables, including num_credit_lines and employment_years, show no statistically significant differences, suggesting that they may have limited direct influence on default outcomes within this synthetic context. Overall, these findings reinforce the effectiveness of the correlation-based variable selection strategy, particularly the SDE approach, in identifying features that not only demonstrate statistical relevance but also capture meaningful behavioral patterns associated with default risk.
Table 2 examines whether variables selected using the SDE and Greedy selection methods exhibit statistically significant differences between the default groups (default = 0 and default = 1). The
p-values obtained from the independent samples
t-test applied to each variable demonstrate the differential effect of that variable on default decisions. The results show that all variables selected using both selection methods are significant at the
p < 0.05 level. Variables such as payment_delay and debt_ratio exhibited particularly strong differences (
p < 0.001), while significant but less significant differences were also observed for variables such as credit_score. These findings indicate that variables selected using both the SDE and conventional methods are correlated with the target variable and provide meaningful information in distinguishing the default group. However, the fact that the SDE approach provides a more consistent and explanatory set of variables will be more clearly demonstrated in the following analyses.
The significance levels between groups were evaluated using a
t-test, and the results are summarized in
Table 3.
The variations selected through the greedy procedure are reported in
Table 4.
When examining the correlations between the variables in the dataset and the target variable (default), it is observed that debt_ratio (+0.36), payment_delay (+0.32), and loan_amount (+0.28) have positive and significant relationships with default risk. In contrast, credit_score exhibits a correlation of −0.42 with the target, indicating a strong negative association and highlighting its high explanatory power in predicting lower default risk.
Figure 1 displays the Pearson correlation coefficients among all numerical variables in the dataset on a color scale from −1 to +1, allowing for quick identification of both extremely high
and insignificantly weak
relationships before model construction. This visualization allows for the early identification of variable pairs at risk of multicollinearity, eliminating unnecessary duplicate variables, thus reducing model complexity and yielding more stable coefficient estimates. It also helps the reader grasp the data structure at a glance.
Figure 2 presents a bar chart (or table) that ranks the correlation of each candidate variable with the target variable (e.g., the “default” case) in order of magnitude. This visualization quantifies the explanatory power of the variables, allowing for objective identification of candidates for pre-selection. This narrows the model’s input pool, highlighting the features most strongly correlated with the target variable, both reducing training time and improving model interpretability. Furthermore, negatively correlated variables displayed on the same scale allow for the easy detection of strong, yet oppositely directed, effects.
Figure 3 tests the principle of minimum repetition by presenting pairwise correlations between candidate variables in detail (often highlighting only cells in the upper triangle of the matrix or those with
) and identifies pairs of variables with high association, so that only one of the variables containing the same information is retained in the selection phase, resulting in a set of variables with high Rel/Red ratios consisting of traits strongly related to the target but weakly related to each other, and reducing the risk of overfitting and improving the interpretability of model coefficients.
Additionally, the correlations between the variables and the target are visually presented in the bar chart below. This chart displays the explanatory power of the variables for the default variable in descending order and serves as a guide for subsequent variable selection processes.
In parallel, the correlations among the variables were also examined, revealing a particularly high correlation between loan_amount and debt_ratio (+0.74), which is noteworthy in terms of redundancy.
Based on these findings, Rel/Red scores were calculated in the next stage to identify variables that not only have high explanatory power for the target but also exhibit minimal overlap with other variables.
3.1. Variable Selection with Customized Score Function
Redundancy analysis was conducted to identify highly correlated variable pairs and to reduce the duplication of information between variables. The observation that variables with strong correlations to the target, such as loan_amount and debt_ratio, also exhibit high correlations with each other highlighted the need for a criterion that balances such repetitive structures during the selection process.
In this context, the following ratio was calculated by considering the correlation of each variable with the target (Relevance) and its average correlation with the other variables (Redundancy):
This score normalizes the contribution of each variable to the target by accounting for its redundancy with other variables. In this way, it becomes possible to select variables that are both meaningful and non-redundant. Furthermore, a customized score function is defined as follows, enabling an advanced correlation-based evaluation:
This function aims to maximize the relationship with the target variable while simultaneously preventing redundancy by penalizing correlations between features. During the simulation process, both this score function and the Rel/Red ratio were taken into account to identify feature combinations with the highest information density, which then formed the basis for the subsequent modeling and visualization steps.
Figure 4 shows the distribution of the samples in the two-dimensional embedding space according to default status (default = 0 and 1). This approach enables a clearer separation of variables that are both related to the target variable and provide distinct, non-overlapping information.
Table 5 shown below shows how the customized Rel − λ × Red scores change for different values of λ (lambda).
3.2. Correlation Maximization with SDE
To introduce a stochastic dimension into the feature selection process, a stochastic differential equation (SDE) approach targeting correlation maximization was applied. In this method, the parameters were updated randomly, but in a controlled manner, to increase the correlation of each variable with the target variable. The non-deterministic nature of the SDE enabled it to generate more global solutions without becoming trapped in local maxima. The SDE-based optimization process followed the dw(t) structure. This formulation allowed for both moving in the direction that increases correlation and exploring various regions of the solution space. The results showed that the variables selected using the SDE method achieved similar correlation performance with a smaller number of variables, producing more interpretable results compared to classical greedy algorithms. In the next stage, these variables were weighted using their Rel/Red ratios and incorporated into the t-SNE visualization.
When
Table 6 is examined, the credit_score variable is found to have the strongest negative correlation with the target variable while being only weakly correlated with the other variables, resulting in a high Rel/Red score. Additionally, debt_ratio, payment_delay, and loan_amount were selected by the SDE method despite having more moderate correlations, as these variables collectively provide maximum information density and exhibit minimal overlap with each other. The advantage of SDE in this context is that it selects variables by optimizing the information diversity within the group, rather than relying solely on the individual correlation of a variable. This enables the creation of more stable and interpretable feature sets, particularly in datasets with multiple inter-relationships. As a result, the generalizability of the model increases, and dimensionality reduction techniques yield more meaningful structures.
This theoretical framework demonstrates how the SDE method can produce more balanced results in variable selection by considering both relevance and redundancy. We will now evaluate the performance of this approach by applying it step by step to a synthetically generated dataset.
The change in the score function over the course of the SDE iterations is depicted in
Figure 5, highlighting the gradual stabilization of the process.
Moreover, demonstrates that the SDE approach not only optimizes the initial score but also converges to the true value by surpassing local maxima, thanks to its stochastic structure. In this respect, it differs from deterministic algorithms and enables more interpretable and efficient feature selection.
The graph below shows the effect of the parameter λ (Redundancy penalty coefficient) used in correlation-based variable selection on the customized score function (Relevance − λ × Redundancy).
The sensitivity analysis of the customized score with respect to the λ parameter is depicted in
Figure 6, showing how performance varies across different values.
At low λ values (close to 0), the score is highest; in this case, only the correlation with the target variable (relevance) is considered, while the correlation between variables (redundancy) is ignored. As λ increases, the correlation between the selected variables is suppressed by the penalty coefficient, and the total score decreases. A more significant decrease in the score is observed, particularly at indicating that variables that are less similar but may have a lower correlation with the target variable are favored. This analysis reveals the effects of parameter sensitivity on the variables selected by the model and on the score performance, and also highlights the critical importance of choosing in the SDE approach. The sensitivity analysis conducted shows that increasing the redundancy penalty coefficient leads to a significant decrease in the total score. This suggests that while high values favor variables with lower redundancy, they may sacrifice overall relevance. Therefore, the variable clusters observed in the selection results are a direct reflection of these parameter settings. To assess the performance differences between the resulting variable sets, both relevance/redundancy metrics and AUC performance were examined together. Thus, the practical implications of these theoretical advantages were demonstrated.
Table 7 shows the comparison of model performance (AUC and 95% confidence interval) with the variables selected by both methods, relevance and redundancy metrics. SDE method: It obtained higher total and average rel, a similar level of average correlation (Red) between variables, and a significantly higher Rel/Red score. This shows that SDE not only selects more explanatory variables for the target but also minimizes the duplication of information among the selected variables.
4. Discussion and Conclusions
We want to know not only how well a model fits the data, but also to what extent it captures the underlying structure. Every learning process should not only achieve high accuracy but also reflect the logic of the decision-making process. Therefore, variable or feature selection is not a random process; rather, it resembles a meaningful attention selection concerned less with what the model predicts and more with how it predicts it. Composite objective functions serve precisely this purpose, allowing multiple requirements to be incorporated into a single optimization framework. In this way, the concept of attention selection is transformed into a quantitative, systematic approach. The foundation of any model lies in statistical dependency, which is why we refer to a mathematical equation that describes such a dependency as a statistical model. In the case of correlation, a strong relationship implies a linear connection between the target and the explanatory variables. When correlation is high, the statistical model tends toward linearity. The prediction error, whether the model’s outputs are accurate, is critical for assessing the model’s meaningfulness. However, this alone is not sufficient. What is often overlooked is the stochastic structure present in the data. While the literature often refers to this as “random noise,” stochastic variations are in fact among the most important elements that give statistical meaning to a model. They remind us that feature selection should not be defined only once, but also evaluated in terms of its robustness over time. This makes the model a witness not only to a single moment but to the entire process. Optimizing these three components, correlation, error, and stochastic stability, together through a balance parameter
α is akin to balancing a model. The goal is neither to be captivated solely by the strength of correlation nor to focus exclusively on minimizing error, but to weigh both in the presence of stochastic behavior:
Perhaps every decision carries within it a sense of continuity and an underlying hesitation. Therefore, the proposed method aims to contribute both to the advancement of gradient-based stochastic processes in artificial intelligence research and to a deeper characterization of the model selectability problem. This study makes a distinct contribution by reformulating the feature selection problem within a continuous-time stochastic optimization framework. Unlike previous models that treat variable importance as a static ranking or penalized regression coefficient, we conceptualize feature relevance and redundancy as dynamic entities governed by gradient-diffusion flows. The use of SDEs introduces a probabilistic exploration component that enables the algorithm to search beyond deterministic selection paths, a critical advantage in non-convex or multi-modal relevance landscapes. Moreover, the integration of correlation-driven objectives with time-evolving optimization represents a methodological advance not encountered in the prior literature. By comparing our results with classical greedy baselines and visualizing the impact of Rel/Red-weighted embeddings, we demonstrate that the proposed method not only improves selection quality but also enhances interpretability and structural insight—two aspects often neglected in high-dimensional selection tasks. In the future, this structure can be further enriched with different correlation measures, alternative stochastic flow models, or other selection spaces. The success of a model lies not only in the accuracy of its output but also in the consistency of the process by which its decisions are formed. In this study, a new correlation-based perspective has been introduced to the linear feature selection problem. Going beyond classical optimization approaches, an evolutionary selection mechanism defined by stochastic differential equations (SDEs) has been proposed. A customized score function aiming to maximize the Pearson correlation coefficient between the target variable and the linear combination of explanatory variables was defined, and feature selection was performed while encouraging sparsity through the -norm penalty. The unique aspect of the model is that it accounts not only for correlation maximization but also for the stochastic consistency and dynamic continuity of the selection process.
In this context, the SDE structure defined in continuous space was transferred to the binary selection space via thresholded projection, thereby combining differentiable optimization with discrete selection decisions. Furthermore, by proposing a composite objective function that balances correlation, stochastic process behavior, and deterministic error measurements, the model optimizes not only prediction performance but also selection stability. Within this framework, the proposed approach redefines the variable selection problem not only in terms of data compatibility but also in accordance with the principles of temporal consistency, structural significance, and robustness to randomness. The developed method extends the boundaries of classical deterministic structures and offers alternative optimization strategies, particularly in high-dimensional, multimodal solution spaces.
Future work may explore the applicability of the proposed structure to different types of correlations (e.g., Spearman, distance correlation), alternative stochastic process definitions (e.g., entropy-based diffusions), and more complex configurations (e.g., selection flows on multilayer networks). In addition, the method can be evaluated on applied datasets, offering new contributions to the explainability and decision security of artificial learning systems.
In this respect, the proposed approach offers a novel synthesis in both the correlation-based variable selection literature [
11,
20] and in the application of stochastic optimization methods to attribute selection [
21]. Consequently, the study contributes to the literature by providing both theoretical generalization and methodological diversification in the field of variable selection.
The proposed method also contributes to the broader family of stochastic optimization strategies commonly used in feature selection and parameter tuning. Techniques such as Simulated Annealing (SA), Stochastic Gradient Descent (SGD), and Bayesian Optimization have been extensively applied in high-dimensional model calibration. While SA excels in global search via temperature-based exploration and SGD enables fast updates through gradient noise, both typically operate in discrete or heuristic frameworks. Bayesian optimization, in contrast, constructs a probabilistic surrogate model to guide exploration, but can be computationally intensive and sensitive to kernel assumptions. Compared to these, the SDE-based approach introduced in this study offers a continuous-time, interpretable dynamic system that balances relevance-driven gradients with random exploration via diffusion terms. This formulation is particularly beneficial in domains such as finance, where the underlying data-generating processes are noisy, highly volatile, and subject to latent structural shifts. The method’s ability to explore feature space adaptively under uncertainty aligns well with the stochastic nature of financial data, making it a promising candidate for robust variable selection in risk-sensitive decision environments.