You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

27 June 2024

A New Alternating Suboptimal Dynamic Programming Algorithm with Applications for Feature Selection

,
,
and
Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška Cesta 46, SI-2000 Maribor, Slovenia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Dynamic Programming

Abstract

Feature selection is predominantly used in machine learning tasks, such as classification, regression, and clustering. It selects a subset of features (relevant attributes of data points) from a larger set that contributes as optimally as possible to the informativeness of the model. There are exponentially many subsets of a given set, and thus, the exhaustive search approach is only practical for problems with at most a few dozen features. In the past, there have been attempts to reduce the search space using dynamic programming. However, models that consider similarity in pairs of features alongside the quality of individual features do not provide the required optimal substructure. As a result, algorithms, which we will call suboptimal dynamic programming algorithms, find a solution that may deviate significantly from the optimal one. In this paper, we propose an iterative dynamic programming algorithm, which invertsthe order of feature processing in each iteration. Such an alternating approach allows for improving the optimization function by using the score from the previous iteration to estimate the contribution of unprocessed features. The iterative process is proven to converge and terminates when the solution does not change in three successive iterations or when the number of iterations reaches the threshold. Results in more than 95% of tests align with those of the exhaustive search approach, being competitive and often superior to the reference greedy approach. Validation was carried out by comparing the scores of output feature subsets and examining the accuracy of different classifiers learned on these features across nine real-world applications, considering different scenarios with various numbers of features and samples. In the context of feature selection, the proposed algorithm can be characterized as a robust filter method that can improve machine learning models regardless of dataset size. However, we expect that the idea of alternating suboptimal optimization will soon be generalized to tasks beyond feature selection.

1. Introduction

Nowadays, in the era of the Internet of Things, social media platforms, Earth observation, crowdsourcing, medical imaging equipment, various biomedical signals measurement devices, wearable sensors, digital twins, etc., we are flooded with vast amounts of data. The measurable characteristics used as attributes or input variables to describe an object of interest are called features, while individual data points (objects of interest) represent feature vectors. Each feature thus corresponds to a dimension in the vector. Vast amounts of data allow the creation of a large repertoire of features, but usually, not all features are relevant for further processing objects of interest. They often slow down the model, direct it towards a wrong solution, or even make reaching a solution infeasible. In order to address these challenges, feature selection approaches were introduced that select a subset of the most informative features while discarding irrelevant or redundant ones []. Feature selection plays a vital role in model construction in statistical analysis, dimensionality reduction, signal processing, pattern recognition, data visualization, and, particularly, in various machine learning tasks, such as classification, regression, and clustering. Its aim is to improve the model’s performance, including its accuracy, generalizability, and interpretability, and reduce overfitting and computational cost [].
Feature selection methods can be grouped into three categories []. Filter methods evaluate candidate subsets with independent criteria that exploit essential characteristics of the training data. They are fast, but the solution may deviate significantly from the optimal one. A wrapper approach uses a learning algorithm for subset evaluation, such as a classifier or regressor. Its performance is usually better but also much slower than the filter approach. Embedded methods interact with a learning algorithm but at a lower computational cost than the wrapperapproach. They use independent criteria to identify optimal subsets for a known cardinality. The learning algorithm is then used to select the final optimal subset across different cardinalities [].
Regardless of the approach chosen, feature selection can be viewed as an optimization problem as it searches for the best-evaluated feature subset []. Different search strategies can be used, including sequential search (greedy approach), exponential search (exhaustive search, beam search, or branch and bound), and random search []. Conversely, dynamic programming (DP) is not as commonly applied to feature selection as other methods. This popular optimization approach breaks a problem into smaller subproblems and uses their solutions to construct the solution to the larger problem. An optimal solution can be found if the problem exhibits optimal substructure. This means that an optimal solution to the problem contains optimal solutions to subproblems [,]. However, DP is usually computationally demanding, so for reasons of feasibility, acceptable speed, and availability to handle problems with higher dimensionality, it is also required that the number of subproblems is not too high and that the subproblems overlap, suggesting that it makes sense to record their solutions in a table and reuse them [].
In this paper, we highlight the possibilities of using DP in feature selection, analyze the difficulties of existing (rare) approaches, and propose alternative solutions. An evaluation criterion based on feature quality, correlation, and/or statistics does not generally provide an optimal substructure since, e.g., the union of two optimal subsets is not necessarily optimal due to possible high correlations between pairs of features, one from each subset. It is possible to achieve an optimal solution for specific problems by adapting the evaluation criterion, but this spoils generality (e.g., wrappers or embedded selection methods are tied to specific machine learning models and prone to overfitting []), which is among our primary goals. We thus focused on finding the best possible suboptimal solution. We studied approximate (ADP) [] and iterative [] dynamic programming (IDP) methods and developed a solution that we called alternating suboptimal dynamic programming (ASDP). It inverts the order of feature processing in each iteration and improves the optimization criterion by using the score from the previous iteration to estimate the contribution of unprocessed features. Its contributions are as follows:
  • A better or at least the same evaluation score of the final solution set compared to the score after a single iteration. Furthermore, the solution found in each iteration is never worse than the one found in the previous iteration.
  • Optimal solution according to the evaluation score found in more than 95% of cases.
  • Polynomial worst-case time complexity ( O ( n 4 ) ) allows significantly larger input feature sets to be considered compared to the exhaustive search approach.
  • Comparable and, in some cases, better classification accuracy on the basis of the feature set selected by the new method than when using our previous graph-based greedy feature selection method. In this respect, we have already demonstrated the competitiveness of the latter in our previous work [] compared to state-of-the-art classification approaches and applied feature selection methods.
The rest of the paper is structured as follows. In Section 2, we survey existing solutions in feature evaluation and selection, the use of DP in feature selection, and suboptimal DP algorithms. In the most research-intensive Section 3, we first summarise our preliminary filter method for feature selection based on graph cuts, which can be used alone or as a preprocessing for the new alternating suboptimal DP method presented afterward. In Section 4, we show and analyze the results, and, finally, in Section 5, we discuss the work carried out, its strengths, and some weaknesses that pose challenges for future research.

3. Materials and Methods

In this section, we present a new method for feature selection based on suboptimal DP. There are exponentially many subsets of a given feature set, all of which are candidates for the feature selection solution, so the exhaustive search approach is only practically applicable to problems with a few dozen features. Our method processes, e.g., 200 features in 5 s, but for larger input sets, it makes sense to preprocess the features with some faster filtering. We use our efficient and reliable graph-cut-based feature selection [], summarised in Section 3.1. In Section 3.2, we discuss the idea of using DP and the encountered difficulties and introduce an iterative suboptimal alternating solution, where the order of feature processing is inverted in each iteration. We conclude the chapter with proof of convergence and a theoretical analysis of time and space complexity.

3.1. Graph-Cut-Based Feature Selection

While wrapper feature selection methods, like the sequential search, nature-inspired algorithms, or binary teaching–learning-based approaches bypass the need for explicit feature evaluation to yield results that are close to optimal, their effectiveness is tied to the specific classification model being used. Additionally, these methods are highly computationally intensive, which can limit their applicability. Similarly, embedded methods incorporate an iterative cycle of evaluating and selecting features as a part of the model training process, which can also demand significant computational resources. Furthermore, the performance of embedded methods is likewise influenced by the choice of the classification model. As an alternative to the discussed wrapper and embedded feature selection techniques, as well as those filter methods that are unable to deal with correlated features, in this section, we presented a graph-cut-based feature selection strategy outlined in our work [] that enables the selection of a subset of high-quality dissimilar features while providing superior results. Depending on the defined feature estimation measurement, it can be used for both classification and regression purposes. Graph vertices represent features with associated weights that define their quality (as proposed in []), while graph edge weights define similarities between them. The method relies on two input parameters, T Δ and T p , used for graph definition. The former defines the necessary level of feature quality (i.e., maximal allowed class overlap) to be included in the output feature space, and the latter determines the minimal level of dissimilarity between them.
Let F S denote an input feature space F S = f i . A feature f i , referred to by an index i [ 1 , n ] , is given as a mapping function f i : Z R . An index m 1 , M refers to a sample, i.e., a feature vector defined as x m = f i , m . An undirected graph used for feature selection is defined as G = ( F , E ) , where a set of vertices F is defined as F = f i F S ; Δ ( f i ) T Δ , while an unordered set of edges E = e i , j ; P ( e i , j ) T p is given by e i , j = ( f i , f j ) for all f i , f j F , such that i j . A vertex-weighting function is given by Δ f i , as defined in [], and the edge-weighting function is given by the absolute Pearson correlation coefficient P : E [ 0 , 1 ] , formally described by Equation (1).
P ( e i , j ) = | m = 1 M ( f i , m μ i ) ( f j , m μ j ) σ i σ j | ,
where μ i denotes mean, while standard deviation σ i of feature values is defined as σ i = 1 M m = 1 M f i , m μ i . Both functions, Δ and P, are designed such that lower values (closer to 0) are more favorable for selection than higher values (closer to 1).
According to the theoretical framework introduced in [], we use the following definitions of elementary properties:
  • Vertices f i F and f j F are adjacent in a graph G if there exists an edge e i , j E .
  • A path from f i 0 to f i N is an ordered sequence of vertices  i 0 , i N = f i 0 , f i 1 , , f N , such that f i j and f i j + 1 are adjacent for all j [ 0 , N 1 ] .
  • A graph G is connected if f i , f j F there exists a path i , j .
  • A graph G = ( F , E ) is subgraph of G if F F and E E .
  • A neighbourhood Z ( f i ) of a vertex f i in graph G is the subset of vertices F, defined by all the adjacent vertices of f i k , namely, Z ( f i ) = { f j } ; f j F ; e i , j , where i j .
We say that a set of vertices C U T ( F ) F is a vertex-cut if its removal separates graph G into at least two non-empty and pairwise disconnected connected components. Obviously, Z ( f i ) is a graph-cut, as it separates a singleton { f i } (i.e., an individual vertex) from the rest of the graph, thus creating a subgraph G = ( F , E ) , whose vertex- and edge-sets are given formally by Equation (2).
F = F ( Z ( f i ) { f i } ) , E = { e h , l E ; f h , f l F and h l } .
An example of vertex-cut feature selection is presented in Figure 1. Figure 1a shows an undirected graph G = ( F , E ) , constructed over a set of features F S = { f 1 , f 2 , , f 9 } , with thresholds T Δ = 0.6 and T p = 0.6 applied on the associated vertex- and edge-weighting functions Δ and P, accordingly. To ensure the preservation of the overall informativeness of selected features, a feature of the highest quality f r ^ = arg min f m G Δ ( f m ) is selected first by a vertex-cut of its neighborhood Z ( f r ^ ) . The selected feature f 6 is colored green. All of highly correlated adjacent features Z ( f 6 ^ ) = { f 2 , f 3 , f 8 } are marked red and removed from G. This results in G , as defined by Equation (2), and a disconnected singleton { f 6 } (see Figure 1b). The same process is then repeated on G , separating the feature of the highest quality, namely f 1 , from the remaining graph G by removal of Z ( f 1 ) = { f 4 , f 7 } . The final cut is performed on the graph G separating f 5 (in green) from the remaining (empty) graph G by removal of Z ( f 5 ) = { f 9 } (in red), as shown in Figure 1c. Thus, the output subset of high-quality dissimilar features, namely { f 1 , f 5 , f 6 } , is obtained, as shown in Figure 1d.
Figure 1. Vertex-cut-based feature selection: (a) graph G where the feature of the highest quality (coloured green) is selected and its neighbourhood (red) is removed, (b) repeating the same procedure on subgraph G , and (c) subgraph G . (d) The output result { f 1 , f 5 , f 6 } (in green) is obtained.

3.2. New Suboptimal Dynamic Programming Algorithm

The new method combines the advantages of iterative and approximate dynamic programming. It does not seek a global optimum but instead adopts a suboptimal (approximate) solution, which it iteratively improves. It is based on a graph like the graph-cut-based filtering from Section 3.1. We thus use the same notation, but we will extend it throughout this subsection with additional algorithm parameters and graph vertex attributes. The graph is undirected, i.e., P ( e i , j ) = P ( e j , i ) . The input is the feature set F S = f i , 0 < i n , which is processed in index order, i.e., from f 1 to f n , so we will sometimes also speak of a sequence of features. At both ends of this sequence, the guard vertices f 0 and f n + 1 are added, which do not change during the execution of the algorithm, but they simplify the implementation. There is no edge between the two guards, while the guard vertices and the edges between a guard and any other vertex are given weights 0. We stress this in the form of an Equation (Equation (3)).
Δ ( f 0 ) = 0 , Δ ( f n + 1 ) = 0 , P ( e 0 , n + 1 ) = , P ( e 0 , i ) = 0 , 0 < i n , P ( e i , n + 1 ) = 0 , 0 < i n .
Each graph vertex f i contains, in addition to the weight Δ ( f i ) , a set S i that stores the “optimal” subset (feature selection result) of the vertices already processed, and the score s i of this subset, which is obtained by the evaluation criterion. Their initialization is described by Equation (4) and is important for the convergence proof in Section 3.3. The evaluation criterion described in Equation (5) seeks a minimum for all vertices, except the guards, i.e., 0 < i n .
S i = , 0 i n + 1 , s i = 0 , 0 i n + 1 .
s i = min 0 j < i ( s j + Δ ( f i ) + k S j P ( e k , i ) )
Let r be the value of j where the minimum was identified. The corresponding S i is calculated by Equation (6).
S i = S r { i }
The final score s c o r e and feature selection result S o l u t i o n are given by Equation (7). 1.5
s c o r e = min 0 < i n s i , s o l u t i o n = i , where s c o r e was found , S o l u t i o n = S s o l u t i o n .
Figure 2a shows the situation immediately before Equations (5) and (6) are applied to vertex f i , and Figure 2b shows the situation immediately after the equations are applied. Green indicates the graph vertices that have already been processed, and white indicates those that are being or will be processed. The red text indicates vertex attributes modified during the observed f i processing.
Figure 2. The concept of feature selection based on dynamic programming: (a) partial solution to be stored in f i considers the solutions stored in all its predecessors; (b) the situation after updating the status of f i . S i and s i are calculated with Equations (6) and (5), respectively.
To date, everything seems straightforward, but there are, in fact, three serious problems in the process that need to be addressed. The first is that the importance of vertices and edges might differ. For this reason, we introduce a weight w, 0 w 1 . This modifies the evaluation criterion Equation (5) into Equation (8).
s i = min 0 j < i ( s j + w · Δ ( f i ) + ( 1 w ) · k S j P ( e k , i ) )
The second problem is that Equation (5) in its present form always leads to a trivial solution from Equation (9). Since the weights of the graph vertices and edges are all non-negative, the minimum consists of a single vertex (without incident edges) with the lowest weight.
s c o r e = min 0 < j n ( w · Δ ( f j ) )
To prevent this, we first modified the model by replacing the decreasing vertex evaluation function Δ with the increasing Δ ( f ) = 1 Δ ( f ) . The idea was to award high vertex weights and penalize high edge weights. This resulted in the optimization function Equation (10):
s i = max 0 j < i ( s j + w · Δ ( f i ) ( 1 w ) · k S j P ( e k , i ) ) ,
which does not tend towards the trivial solution. However, to retain complementarity with the graph-cut-based method, we preferred to choose an alternative approach, which decrements all vertex and edge weights (except those of guards and their incident edges) for user-defined non-negative values s h f t Δ and s h f t P , respectively, (see Equation (11)). Furthermore, these two additional parameters provide new possibilities for tuning, as demonstrated in Section 4.
s i = min 0 j < i ( s j + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S j ( P ( e k , i ) s h f t P ) )
The third problem is the most demanding. Even if all partial solutions S j , 0 < j < i were optimal, there is no guarantee that this will be the case after adding f i to any of these solutions. It is enough that f i is over-correlated with a single feature from each S j , and the optimum will likely be missed. In other words, optimization defined in this way does not guarantee an optimal substructure, one of the two fundamental assumptions of dynamic programming, along with overlapping subproblems []. Of course, when considering f i , we can no longer refresh its predecessors’ attributes S j and s j . We tried to mitigate this problem by extending the evaluation criterion by predicting the contribution of vertices not yet visited and, most importantly, considering the correlation between the visited and predicted parts. The need to predict the contribution of unvisited nodes led us to a simple idea, which later turned out to be very successful, namely to reverse the graph traversing direction after arriving at f n . As G is an undirected graph, the status from the previous traversal can simply be used to estimate the score s i and partial solution S i . The updated evaluation criterion is given by Equation (12).
s f w d = min n + 1 j > i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) )
When the reverse traversal reaches f 1 , the direction of visiting the vertices is inverted again. The evaluation criterion Equation (12) is slightly modified to Equation (13), corresponding to the forward direction from f 1 towards f n . The only difference between the two equations is, of course, the direction and boundaries of the vertices’ traversal, written under the min function label.
s f w d = min 0 j < i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) )
The modified evaluation criterion significantly impacts the choice of vertex f r (r is the value of j, providing the minimum) and thus indirectly affects the calculation of s i and S i . Let r be the value of j in Equation (12) or (13) where the minimum was identified. The score s i is then calculated by using Equation (14), while Equation (6) representing the solution subset S i remains applicable.
s i = s r + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S r ( P ( e k , i ) s h f t P )
However, s i and S i should not be directly refreshed by s f w d and S r S i , since in the treatment of subsequent vertices, we assume that s i and S i can only refer to vertices that were visited before f i in the current iteration. Conversely, it would be a pity not to make better use of the great potential that Equations (12) and (13) certainly have. Fortunately, they can be used to predict the attributes of another vertex instead of f i , namely f i e n d , which represents the last vertex in the set S i (the one with the lowest index in the reverse direction traversal or with the highest index in the forward traversal). However, we should not update s i e n d and S i e n d when we process f i because we will need the values from the previous iteration when we process f i e n d later. As a consequence, we extend each vertex f k with additional attributes p r ( s k ) and p r ( S k ) ( p r stands for prediction), which store the aforementioned estimates of the score and the solution set. At the beginning of each iteration, the initialization p r ( s k ) = , 0 < k n , is performed. Algorithm 1 shows the processing of vertex f i , which is further explained in Figure 3. For simplicity, we assume that all the variables in Algorithm 1 are global, except i and f o r w a r d . The score s i is determined as the minimum between the previously stored p r ( s i ) and s i computed by Equation (14). In the former case, the set p r ( S i ) is assigned to S i , while in the latter case, S i is determined by Equation (6). Note that p r ( s i ) and p r ( S i ) can be refreshed multiple times in the same iteration since multiple sequences S i at different i can terminate with the same vertex f i e n d .
Algorithm 1 Processing a Considered Graph Vertex
1:
function ProcessVertex(i, f o r w a r d )
2:
    if  f o r w a r d  then
▹ Forward direction graph traversal.
3:
         i e n d = max f k S i k ;
4:
         s f w d = min 0 j < i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) ) ;
▹ (13)
5:
    else
▹ Reverse direction graph traversal.
6:
         i e n d = min f k S i k ;
7:
         s f w d = min n + 1 j > i ( s j + s i + ( 1 w ) · k S j , h S i ( P ( e k , h ) s h f t P ) ) ;
▹ (12)
8:
    end if
9:
     r = the value of j, where the minimum in line 4 or 7 was achieved;
10:
     s i = s r + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S r ( P ( e k , i ) s h f t P ) ;
▹ (14)
11:
     S i = S r { i } ;
▹ (6)
12:
    if  ( p r ( s i ) < s i )  then
▹ Update the vertex with predictions from the same iteration.
13:
         s i = p r ( s i ) ;
14:
         S i = p r ( S i ) ;
15:
    end if
16:
    if  ( s f w d < p r ( s i e n d ) )  then
▹ Update the predictions of a not yet processed f i e n d .
17:
         p r ( s i e n d ) = s f w d ;
18:
         p r ( S i e n d ) = S r S i ;
19:
    end if
20:
    return
▹ No value returned—all the variables are global, except i and f o r w a r d .
21:
end function
Figure 3. The concept of feature selection based on alternating suboptimal dynamic programming: the situation (a) before processing f i during the reverse direction traversal; (b) after processing f i during the reverse direction traversal; (c) before processing f i during the forward direction traversal; and (d) after processing f i during the forward direction traversal.
Figure 3a,b show the situation immediately before and after Equations (6), (12) and (14) are applied to vertex f i , respectively. The graph traversal is performed in the reverse direction. The obvious difference between the straightforward non-iterative solution from Figure 2 is that here, S i does not contain the initial vertex f i only, but the partial solution from the previous iteration instead. As a consequence, there is a double loop in the sum calculation. The green color indicates the graph vertices that have already been processed in the observed iteration, and the yellow color indicates those that were processed in the previous iteration (and are or will be processed later in the current iteration). Note that these yellow vertices contain the predictions (colored cyan), which might be updated earlier in the ongoing iteration. The red text indicates vertex attributes modified during the observed f i processing. Analogously, Figure 3c,d show the processing of vertex f i when the graph is passed in the forward direction. Equation (13) replaces Equation (12) in this case.
The pseudocode in Algorithm 2 describes the overall structure of the alternating suboptimal dynamic programming method for feature selection. As mentioned, 200 features can still be processed relatively fast, but for larger input sets, it makes sense to preprocess the features with graph-cut-based feature selection filtering (line 2). The initialization in line 3 sets up the guard vertices using Equation (3). Partial solution sets candidates and their scores are initialized using Equation (4), which is needed in lines 4, 7 and 11 of Algorithm 1 within the first-iteration calls of ProcessVertex (line 11 of Algorithm 2). The value f i n a l S c o r e is set to some high value () to provide the first comparison in line 16, and M a x I t e r a t i o n s is set to a user-defined value or default 100. In line 8, all predicted scores are set to a high value () at the beginning of each iteration, which is needed in line 16. The main work is done in the ProcessVertex function, which is called sequentially in line 11 for each feature f i except for the guard vertices. The direction of traversing the features is inverted in each iteration (line 23). The process terminates when the identical score is obtained three times in a row, or the number of iterations reaches m a x I t e r a t i o n s (line 24). If there are two (or more) solutions with the same score, the algorithm may find one during the forward direction traversal and a different one in the reverse direction traversal. In this case, it will return the last of the two solutions found.

3.3. Convergence and Complexity Analysis

The solution found is generally suboptimal but often better than that found in the one-pass method, as will be confirmed by the results in the next section. In any case, the solution after several passes is not worse than the one-pass solution since the result can only improve from iteration to iteration or remain unchanged (after three consecutive such iterations, the algorithm terminates), which is confirmed by Proposition 1 below.
Proposition 1. 
The score in each iteration of the proposed alternating suboptimal dynamic programming algorithm can only be lower (better) or equal to the score in the previous iteration but never higher (worse).
Algorithm 2 Alternating Suboptimal Dynamic Programming
1:
function ASDP( Δ , P, n)
2:
     ( Δ , P , n ) = GraphCutBasedFeatureSelection( Δ , P , n );
▹ Optional filtering
3:
     ( Δ , P , s , S , f i n a l S c o r e , m a x I t e r a t i o n s ) = Init( Δ , P , n );
4:
     s o l u t i o n R e p e a t e d = 0 ;     i t e r a t i o n = 0 ;
5:
     s t a r t = 1 ;     e n d = n ;
6:
    repeat
▹ iterations of ASDP
7:
        for  i s t a r t to e n d  do
▹ for all features
8:
            p r ( s i ) = ;
9:
        end for
▹ for all features
10:
        for  i s t a r t to e n d  do
▹ for all features
11:
           ProcessVertex(i, s t a r t < e n d );
12:
        end for
▹ for all features
13:
         s c o r e = min 0 < i n s i ;
▹ (7): this and the next two lines
14:
         s o l u t i o n = i , where s c o r e was found;
15:
         S o l u t i o n = S s o l u t i o n ;
16:
        if  ( s c o r e < f i n a l S c o r e )  then
17:
            f i n a l S c o r e = s c o r e ;
18:
            s o l u t i o n R e p e a t e d = 0 ;
19:
        else
20:
            s o l u t i o n R e p e a t e d = s o l u t i o n R e p e a t e d + 1 ;
21:
        end if
22:
         i t e r a t i o n = i t e r a t i o n + 1 ;
23:
         ( s t a r t , e n d ) = Swap ( s t a r t , e n d ) ;
24:
    until  ( i t e r a t i o n = m a x I t e r a t i o n s ) ( s o l u t i o n R e p e a t e d = 3 ) ;
25:
    return ( f i n a l S c o r e , S o l u t i o n )
26:
end function
Proof 
(Proof). The proof is conceptually straightforward since we will show that the score s i from the previous iteration is also considered a candidate for the minimum in the observed iteration. Namely, this score is obtained in the evaluation criterion in line 7 of Algorithm 1 at j = 0 or in line 4 at j = n + 1 . The algorithm does not modify the parameters of the two guards, so s j = 0 and S j = in both cases. Consequently, only s i remains from the expression on the right of (13) or (12). If s i is also the minimum in the current iteration, then s f w d = s i will be written first to p r ( s i e n d ) in line 17 of Algorithm 1, then to s i in line 13 of Algorithm 1, to s c o r e in line 13 of Algorithm 2, and finally to f i n a l S c o r e in line 17 of Algorithm 2. Conversely, if s i is not the minimum in the current iteration, then it can only be replaced with a lower score in some of the aforementioned lines of Algorithm 1 or Algorithm 2. This completes the proof. □
Based on Proposition 1, it makes sense to modify the initialization (line 3 of Algorithm 2). The proven convergence allows us to use the input feature set instead of the empty set as an initial solution candidate. Equation (15) introduces a recursive definition of initial values, which replaces Equation (4). Note that the last two lines of Equation (15) were derived from Equations (6) and (14) by setting r = i 1 .
S 0 = S n + 1 = , s 0 = s n + 1 = 0 , S 1 = { 1 } , s 1 = Δ ( f 1 ) , S i = S i 1 { i } , 2 i n , s i = s i 1 + w · ( Δ ( f i ) s h f t Δ ) + ( 1 w ) · k S i 1 ( P ( e k , i ) s h f t P ) , 2 i n .
Propositions 2–4 consider the time and space complexity of the graph-cut-based and the alternating suboptimal dynamic programming feature selection approaches.
Proposition 2. 
The graph-cut-based feature selection method has the worst-case time complexity O ( n 2 ) , where n is the number of features, i.e., graph vertices.
Proof 
The algorithm gradually selects features f r ^ with the highest quality, which requires at most O ( n ) steps. In each step, a neighborhood Z ( f r ^ ) is considered, which contains at most O ( n ) features. This results in O ( n ) · O ( n ) = O ( n 2 ) worst-case time complexity. Note that the method removes the considered features and their highly correlated neighborhood from the graph G in each step and, consequently, the expected time complexity is much closer to O ( n · log n ) , which corresponds to sorting the vertices according to their qualities. □
Proposition 3. 
The proposed alternating suboptimal dynamic programming feature selection approach runs in O ( n 4 ) in the worst case, where n is the number of graph vertices (features).
Proof 
A double sum in lines 4 and 7 of Algorithm 1 contributes O ( n 2 ) time. In both cases, it is performed within the min function, which considers O ( n ) values. The ProcessVertex function thus requires O ( n ) · O ( n 2 ) = O ( n 3 ) time. It is called O ( n ) times in line 11 of Algorithm 2, resulting in O ( n 4 ) time per a single iteration. Although the number of iterations (loop of lines 6–24) is by default set to 100, it rarely exceeds ten and practically never 15, so its time consumption may be considered constant, i.e., O ( 1 ) , and the overall worst-case time complexity is proven O ( n 4 ) . □
Proposition 4. 
Both considered approaches to feature selection, i.e., the graph-cut-based and the alternating suboptimal dynamic programming algorithm, require O ( n 2 ) space, where n is the number of graph vertices (features).
Proof 
In the graph-cut-based approach, the graph contains n vertices and at most n · ( n 1 ) 2 edges. Similarly, there are n + 2 vertices and ( n + 2 ) · ( n + 1 ) 2 1 edges in the ASDP approach. Furthermore, n + 2 sets S i and p r ( S i ) , each with O ( n ) elements, also do not exceed O ( n 2 ) space. The overall space complexity is thus O ( n 2 ) . □

4. Results

4.1. Validation Setup

The proposed method based on alternating suboptimal dynamic programming (ASDP) and the exhaustive search algorithm (brute force, BF) was implemented using C++, while the graph-cut-based feature selection (Graph-FS) was implemented using Python 3.11.5 on the Microsoft® Windows 11 operating system. All experiments were conducted on a workstation with an Intel® Core™ i5 CPU and 16 GB of main memory. The algorithms are not yet integrated into a common application, but the results of the Graph-FS prefiltering are imported into the ASDP and BF methods via text files. The reproducibility of classification experiments is provided through the scikit-learn 1.4.1 implementation of machine learning methods. Classifiers were implemented with the following settings:
  • K-Nearest neighbors classifier (KNN) was assessed using default settings, where K { 2 , 3 , , 8 } were tested;
  • Naive Bayes classifier (NBC) was used with the default settings;
  • Random Forest (RF) was of maximal depth from the range 2 , 4 , 8 , 16 , 20 , while the maximal number of iterations was from 5 , 10 , 15 , 20 , 25 , 30 ;
  • XGBOOST was of maximal depth from the range 2 , 4 , 8 , 16 , 20 , while the maximal number of iterations was from 5 , 10 , 15 , 20 , 25 , 30 .
The ASDP and BF evaluation and the classification accuracy assessment were conducted on nine well-known benchmark datasets, available at the UCI machine learning repository []. Table 1 summarises the characteristics of each dataset, including its name and the number of features, classes, and samples contained.
Table 1. Description of test datasets.
These datasets were chosen to demonstrate the diversity of real-world applications of the proposed methods. For example, while Ds2 presents the utility of feature selection for financial institutions, Ds3 and Ds9 show that feature selection is also beneficial for medical research. Furthermore, to prove the proposed method’s efficiency across different datasets and scenarios, examples with various numbers of features and samples were considered. In continuation, we will also show the consistency and robustness of the proposed methods, as the results will not deviate from the expected ones either in the case of Ds6, which contains 60 features, with only 208 samples, or in the case of Ds5 (it contains 16 features and 20,000 samples).
Each run of the ASDP and BF evaluation test consists of 125 experiments by employing 5 · 5 · 5 triplets of parameters ( w , s h f t Δ , s h f t P ) , where w { 0 , 0.25 , 0.5 , 0.75 , 1 } , s h f t Δ { 0 , m e d 1 / 4 ( Δ ) , m e d ( Δ ) , m e d 3 / 4 ( Δ ) , 1 } , and s h f t P { 0 , m e d 1 / 4 ( P ) , m e d ( P ) , m e d 3 / 4 ( P ) , 1 } . Here, m e d ( Δ ) is the median of Δ ( f i ) , 0 < i n , while m e d 1 / 4 ( Δ ) and m e d 3 / 4 ( Δ ) are the medians of the lower and higher half-sequences of Δ ( f i ) , respectively. In a similar manner, the medians m e d 1 / 4 ( P ) , m e d ( P ) , and m e d 3 / 4 ( P ) are determined.

4.2. Assessment of Scores of the Alternating Suboptimal Dynamic Programming Algorithm

The main question with respect to the ASDP method development was how much could it improve the solution compared to a single iteration of suboptimal dynamic programming (SDP-1). At the same time, it is reasonable to compare the extent to which ASDP and SDP-1 achieve the global optimum provided by the BF approach. The results of the analysis are summarized in Table 2. The three main conclusions are listed below the table.
Table 2. Comparison of scores obtained by BF, SDP-1, and the ASDP method.
  • The third column shows that SDP-1 reaches the global optimum in 62.6% of the tests. The fourth column then shows that ASDP significantly raises this percentage to 95.8.
  • The degree of match (61.4%) between the SDP-1 and ASDP scores in the fifth column should not be below that between SDP-1 and BF (62.6%) since ASDP never degrades the score from the first iteration, according to Proposition 1. Indeed, if we ignore rows Ds4, Ds6, Ds7, and Ds9, where we could not evaluate BF, we also obtain 62.6% for ASDP (in brackets). Interestingly, at least for the tests performed, a conclusion can be drawn that whenever ASDP fails to reach the global optimum in the first iteration, it improves the score at least a little in subsequent iterations.
  • The last two columns confirm the empirical finding of the proof of Proposition 3 that the number of iterations of ASDP is within O ( 1 ) , since in the tests performed, it does not exceed 11, and on average it is only 3.7 , barely above the termination condition of 3 consecutive iterations with the unchanged score.
In order to further improve the results and, in particular, the feasibility in situations with a larger number of features, we preprocessed ASDP with fast and highly accurate, though still suboptimal, Graph-FS. The results are shown in Table 3, and the critical observations are listed immediately below.
Table 3. Comparison of scores of Graph-FS filtering used alone or postprocessed by BF or ASDP.
  • The second column confirms a significantly lower number of features than before the use of Graph-FS (see Table 1).
  • The fourth column shows that BF did not change the Graph-FS results in 38.7% of tests. In other words, it obtains a better score in 61.3% of cases.
  • The fifth column gives the first impression that ASDP performs significantly worse (34.8% vs. 38.7) compared to BF. However, eliminating all tests on the Ds7 dataset, where BF was not viable, made both scores equal. Since ASDP cannot, according to Proposition 1 and the initialization from Equation (15), spoil the initial score, we may also conclude here that the score was strictly improved in the remaining 61.3% of tests. However, a better ASDP score obtained with Equation (14) does not necessarily imply better results in practical applications. We will show this in Section 4.3 by matching the ASDP score with the classification accuracy.
  • The sixth column shows that preprocessing of ASDP with Graph-FS raises the proportion of solutions reaching the global optimum from 95.8% in Table 2 to 98%.
  • The last two columns show a maximum number of iterations of 12 and a lower number of iterations of 3.4 compared to 3.7 from Table 2.

4.3. Assessment of the Use of the Proposed Approach in Classification Tasks

In this section, we demonstrated the usability of Graph-FS and ASDP for feature selection for classification tasks on the real benchmark datasets displayed in Table 1. For this purpose, we compared the classification performance of the selected features for both presented methods and their combination (Graph-FS + ASDP) with the performance of the same classifiers when learning about the input feature set. The results are shown in Table 4, Table 5, Table 6 and Table 7 for each specific classifier used. All tests were conducted by ten-fold cross-validation [], using average accuracy a c c to indicate the method’s efficiency. The accuracy is defined by (16):
a c c = number   of   correctly   classified   samples number   of   all   classified   samples .
Table 4. Accuracies for RF classifier after the feature selection with Graph-FS, ASDP, their combination, or when using all input features.
Table 5. Accuracies for XGBOOST classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all input features.
Table 6. Accuracies for NBC after the feature selection with Graph-FS, ASDP, their combination, or when using all input features.
Table 7. Accuracies for KNN classifier after Graph-FS, ASDP, Graph-FS + ASDP, or when using all input features.
Note that the a c c values in the tables represent the highest achieved classification results. Namely, in all test cases, all combinations of the classifier’s parameter values (see Section 4.1) were tested, except for the NBC. The latter is a non-parametric method and it was used with the default settings. We also report the number of features selected and parameters T Δ and T p used in the Graph-FS and Graph-FS + ASDP methods while obtaining the listed highest results. Since identical results were typically obtained for different combinations, we do not list ASDP parameters w, s h f t Δ , and s h f t p . Table 1 gives the number of input features. The highest accuracy for each dataset is emphasized in bold. Here, we considered that the same accuracy can be achieved across different methods, regardless of selected features.
Analysis shows an improvement in accuracy on the original dataset for all test cases except in the case of Ds1 for classifier RF. Furthermore, Graph-FS and ASDP achieved similar classification scores. However, Graph-FS showed slightly higher accuracy for Ds2, Ds3, Ds5, and Ds8 for classifier RF, while the same results as ASDP are shown in the case of Ds4, Ds5, Ds6, and Ds9. For classifier XGBOOST, similar results are obtained, where Graph-FS is slightly better in classification accuracy than ASDP in cases Ds1, Ds2, Ds3, Ds7, and Ds8. In the case of classifier NBC, Graph-FS achieved the best results in cases Ds2, Ds3, Ds5, and Ds8, while for Ds8, ASDP provides the most informative feature subset, achieving the highest accuracy among those in the comparison. We observed different results for the last classifier, KNN, with the ASDP showing superior performance. It achieved highest accuracy in cases Ds1, Ds3, Ds7, and Ds8.
Conversely, when comparing ASDP and Graph-FS + ASDP, we noticed improved classification performance of selected classifiers in some cases. For example, in the case of Ds4 for classifier RF, we achieved the highest classification accuracy with Graph-FS + ASDP for a selected feature subset that contains only two features, while Graph-FS and ASDP achieved the same results when subsets of 10 and 14 features were selected, respectively. Similar results can be found in the case of Ds2 and Ds6 across all classifiers, Ds1 for NBC, and Ds2 and Ds3 for the KNN classifier, where the combination of Graph-FS and ASDP achieved the highest measured accuracy but with a smaller number of features than Graph-FS and ASDP individually. The most interesting result is that for Ds7 for NBC, where Graph-FS and ASDP combined achieved the highest accuracy among all the measured results.
Finally, the results demonstrate the robustness of both approaches, as no significant deviations regarding the improvements were displayed in experiments with various datasets with different numbers of features or samples. Both, ASDP and Graph-FS + ASDP, achieved comparable results regardless of the number of features, which can be low (e.g., Ds1 and Ds3) or high (e.g., Ds7 and Ds9). In addition, both approaches showed improvements in classification accuracy in datasets containing both small and large numbers of samples.

5. Discussion

This paper introduces an alternating suboptimal dynamic programming (ASDP) algorithm, primarily aimed at improving feature selection, at least in some cases, and being competitive in others. It iteratively considers individual features and inverts the processing order in each iteration. This allows the optimization function to be improved by using the score from the previous iteration to estimate the contribution of yet unprocessed features in the current one. We proved that convergence is achieved and that the time complexity displays a polynomial ( O ( n 4 ) ) relationship. Results on nine well-known benchmark datasets for machine learning tasks demonstrated that a single iteration suboptimal dynamic programming found the global optimum in 62.6 % cases, which was significantly improved to 95.8 % by ASDP in only 3.7 iterations on average (and never above 12). Although ASDP is relatively slow and thus limited to 200–300 input features, we have extended its usability by preprocessing it with our fast and highly accurate graph-cut-based feature selection (Graph-FS) method. This raised the proportion of solutions reaching the global optimum to 98 % and reduced the average number of iterations to 3.4 .
We have also shown the practicality of using ASDP and the Graph-FS + ASDP combination in classification. The latter was slightly behind or equal to the Graph-FS alone when using the RF or XGBOOST classifiers and sometimes slightly better when using the NBC. The former seems contradictory to the proven convergence of ASDP, but the optimization criterion of ASDP and the classification accuracy of the used classifiers do not guarantee the perfect consistency of results. Surprisingly, the ASDP method without Graph-FS prefiltering performed best when using the KNN classifier. Finally, in all but one case for RF, the presented methods achieved better classification accuracy than the classifiers learned from the complete input feature set. Note that the superior performance of Graph-FS in comparison to state-of-the-art approaches was already demonstrated in []. We may thus conclude that ASDP and Graph-FS + ASDP are also entirely competitive.
The four contributions of the proposed method, listed in Section 1, were justified as follows. The first was confirmed by the proof of Proposition 1 and by the results in Table 2. Table 2 also confirmed the second promised contribution, which was further exceeded by the results in Table 3. The third contribution was confirmed by the proof of Proposition 3, as well as by the fact that the BF score in some cases in Table 2 could not be determined due to excessive time complexity. The fourth contribution was confirmed by the experiments in Section 4.3, in particular by the results in Table 6 and Table 7.
A disadvantage of using ASDF without preprocessing is that a larger number of features makes the method too slow or, depending on the implementation, even infeasible. It processes 200 features in 5 s on a regular PC and becomes useless at 500 features. This represents a significant improvement compared to the exhaustive search approach, which achieves such performance at a very modest 25 and 30 features, respectively. However, for larger input sets, it makes sense to preprocess ASDF with some faster filtering. Conversely, Graph-FS + ASDF restricts the solution search space to subsets of the Graph-FS solution. We will try to achieve a compromise by cascading Graph-FS over 2–5 iterations, in which in each iteration, we will gradually lower the thresholds T Δ and T P and extend the selected set with features chosen from those not yet in the solution. We would also like to evaluate the use of ASDF in regression tasks in the future. In addition, we expect that the idea of alternating suboptimal optimization will soon be generalized to tasks beyond feature selection as well. In general, graph nodes can represent a wide variety of entities, and edges can represent any bilateral operation, such as distance, similarity, or correlation.

Author Contributions

Conceptualization, D.P. and D.V.; methodology, D.M. and D.V.; software, D.P. and D.V.; validation, D.P., D.V. and B.Ž.; formal analysis, D.V. and B.Ž.; investigation, D.P., D.V., D.M. and B.Ž.; resources, D.V.; data curation, D.V.; writing—original draft preparation, D.P., D.V. and B.Ž.; writing—review and editing, D.P. and D.M.; visualization, D.P. and D.V.; supervision, B.Ž. and D.M.; project administration, B.Ž.; funding acquisition, B.Ž. and D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Slovene Research and Innovation Agency under Research Project J2-4458 and Research Programme P2-0041.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADPApproximate/Adaptive Dynamic Programming
ASDPAlternating Suboptimal Dynamic Programming
BFBrute Force, Brute-Force
CPUCentral Processing Unit
DPDynamic Programming
Graph-FSGraph-cut-based Feature Selection
IDPIterative Dynamic Programming
KNNK-Nearest Neighbours classifier
LASSOLeast Absolute Shrinkage and Selection Operator
MDPMarkov Decision Process
NBCNaive Bayes Classifier
RFRandom Forrest
RLReinforcement Learning
SDP-1Single iteration of alternating Suboptimal Dynamic Programming
UCIUniversity of California Irvine machine learning repository
XGBOOSTExtreme Gradient Boosting

References

  1. Liu, H.; Motoda, H. Feature Selection for Knowledge Discovery and Data Mining; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1998. [Google Scholar]
  2. Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  3. Kumar, V.; Minz, S. Feature selection: A literature Review. SmartCR 2014, 4, 211–229. [Google Scholar] [CrossRef]
  4. Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
  5. Bellman, R. Dynamic programming. Princet. Univ. Press 1957, 89, 92. [Google Scholar]
  6. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
  7. Liu, D.R.; Li, H.L.; Wang, D. Feature selection and feature learning for high-dimensional batch reinforcement learning: A survey. Int. J. Autom. Comput. 2015, 12, 229–242. [Google Scholar] [CrossRef]
  8. Kossmann, D.; Stocker, K. Iterative dynamic programming: A new class of query optimization algorithms. ACM Trans. Database Syst. 2000, 25, 43–82. [Google Scholar] [CrossRef]
  9. Vlahek, D.; Mongus, D. An Efficient Iterative Approach to Explainable Feature Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 2606–2618. [Google Scholar] [CrossRef] [PubMed]
  10. Forman, G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. J. Mach. Learn. Res. 2003, 3, 1289–1305. [Google Scholar]
  11. Fakhraei, S.; Soltanian-Zadeh, H.; Fotouhi, F. Bias and Stability of Single Variable Classifiers for Feature Ranking and Selection. Expert Syst. Appl. 2014, 41, 6945–6958. [Google Scholar] [CrossRef] [PubMed]
  12. Liu, H.; Motoda, H. Computational Methods of Feature Selection; Chapman & Hall/CRC: Boca Raton, FL, USA, 2007; p. 440. [Google Scholar]
  13. Gu, Q.; Li, Z.; Han, J. Generalized Fisher Score for Feature Selection. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, UAI 2011, Barcelona, Spain, 14–17 July 2012; pp. 266–273. [Google Scholar]
  14. Li, H.; Jiang, T.; Zhang, K. Efficient and robust feature extraction by maximum margin criterion. In Proceedings of the Advances in Neural Information Processing Systems, Whistler, BC, Canada, 8–13 December 2003; Volume 16. [Google Scholar]
  15. He, X.; Cai, D.; Niyogi, P. Laplacian Score for Feature Selection. In Proceedings of the 18th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 507–514. [Google Scholar]
  16. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2011; p. 744. [Google Scholar]
  17. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006; p. 792. [Google Scholar]
  18. Verleysen, M.; Rossi, F.; François, D. Advances in Feature Selection with Mutual Information. In Similarity-Based Clustering: Recent Developments and Biomedical Applications; Biehl, M., Hammer, B., Verleysen, M., Villmann, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 52–69. [Google Scholar]
  19. Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, USA, 1984. [Google Scholar]
  20. Strobl, C.; Boulesteix, A.L.; Augustin, T. Unbiased split selection for classification trees based on the Gini Index. Comput. Stat. Data Anal. 2007, 52, 483–501. [Google Scholar] [CrossRef]
  21. Raileanu, L.; Stoffel, K. Theoretical Comparison between the Gini Index and Information Gain Criteria. Ann. Math. Artif. Intell. 2004, 41, 77–93. [Google Scholar] [CrossRef]
  22. Krakovska, O.; Christie, G.; Sixsmith, A.; Ester, M.; Moreno, S. Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets. PLoS ONE 2019, 14, e0213584. [Google Scholar] [CrossRef] [PubMed]
  23. Frénay, B.; Doquire, G.; Verleysen, M. Is mutual information adequate for feature selection in regression? Neural Netw. 2013, 48, 1–7. [Google Scholar] [CrossRef] [PubMed]
  24. Bishop, C.M. Pattern Recognition and Machine Learning (Information Science and Statistics); Springer: Berlin/Heidelberg, Germany, 2006; p. 728. [Google Scholar]
  25. Bell, D.; Wang, H. A Formalism for Relevance and Its Application in Feature Subset Selection. Mach. Learn. 2000, 41, 175–195. [Google Scholar] [CrossRef]
  26. Kira, K.; Rendell, L.A. A Practical Approach to Feature Selection. In Proceedings of the Ninth International Workshop on Machine Learning, San Francisco, CA, USA, 1–3 July 1992; pp. 249–256. [Google Scholar]
  27. Kononenko, I.; Šimec, E.; Robnik-Šikonja, M. Overcoming the myopia of inductive learning algorithms with RELIEFF. Appl. Intell. 1997, 7, 39–55. [Google Scholar] [CrossRef]
  28. Hall, M.A. Correlation-Based Feature Selection for Machine Learning. Ph.D. Thesis, The University of Waikato, Hamilton, New Zealand, 1999. [Google Scholar]
  29. Yu, L.; Liu, H. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 856–863. [Google Scholar]
  30. Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
  31. Garcia-Ramirez, I.A.; Calderon-Mora, A.; Mendez-Vazquez, A.; Ortega-Cisneros, S.; Reyes-Amezcua, I. A novel framework for fast feature selection based on multi-stage correlation measures. Mach. Learn. Knowl. Extr. 2022, 4, 131–149. [Google Scholar] [CrossRef]
  32. Wang, L.; Zhou, N.; Chu, F. A General Wrapper Approach to Selection of Class-Dependent Features. IEEE Trans. Neural Netw. 2008, 19, 1267–1278. [Google Scholar] [CrossRef]
  33. Oliveira, L.S.; Sabourin, R.; Bortolozzi, F.; Suen, C.Y. A methodology for feature selection using multiobjective genetic algorithms for handwritten digit string recognition. Int. J. Pattern Recognit. Artif. Intell. 2003, 17, 903–929. [Google Scholar] [CrossRef]
  34. Jesenko, D.; Mernik, M.; Žalik, B.; Mongus, D. Two-Level Evolutionary Algorithm for Discovering Relations between Nodes Features in a Complex Network. Appl. Soft Comput. 2017, 56, 82–93. [Google Scholar] [CrossRef]
  35. Chuang, L.Y.; Chang, H.W.; Tu, C.J.; Yang, C.H. Improved binary PSO for feature selection using gene expression data. Comput. Biol. Chem. 2008, 32, 29–38. [Google Scholar] [CrossRef] [PubMed]
  36. Schiezaro, M.; Pedrini, H. Data feature selection based on Artificial Bee Colony algorithm. EURASIP J. Image Video Process. 2013, 47, 1–8. [Google Scholar] [CrossRef]
  37. Narendra, P.; Fukunaga, K. A Branch and Bound Algorithm for Feature Subset Selection. IEEE Trans. Comput. 1977, C-26, 917–922. [Google Scholar] [CrossRef]
  38. Gheyas, I.A.; Smith, L.S. Feature subset selection in large dimensionality domains. Pattern Recognit. 2010, 43, 5–13. [Google Scholar] [CrossRef]
  39. Somol, P.; Pudil, P.; Novovicová, J.; Paclík, P. Adaptive floating search methods in feature selection. Pattern Recognit. Lett. 1999, 20, 1157–1163. [Google Scholar] [CrossRef]
  40. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  41. Zhao, P.; Yu, B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
  42. Buteneers, P.; Caluwaerts, K.; Dambre, J.; Verstraeten, D.; Schrauwen, B. Optimized parameter search for large datasets of the regularization parameter and feature selection for ridge regression. Neural Process. Lett. 2013, 38, 403–416. [Google Scholar] [CrossRef][Green Version]
  43. Nelson, G.D.; Levy, D.M. A Dynamic Programming Approach to the Selection of Pattern Features. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 145–151. [Google Scholar] [CrossRef]
  44. Acır, N. Classification of ECG beats by using a fast least square support vector machines with a dynamic programming feature selection algorithm. Neural Comput. Appl. 2005, 14, 299–309. [Google Scholar] [CrossRef]
  45. Cheung, R.; Eisenstein, B. Feature selection via dynamic programming for text-independent speaker identification. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 397–403. [Google Scholar] [CrossRef]
  46. Moudani, W.; Shahin, A.; Shakik, F.; Mora-Camino, F. Dynamic programming applied to rough sets attribute reduction. J. Inf. Optim. Sci. 2013, 32, 1371–1397. [Google Scholar] [CrossRef]
  47. Bertsekas, D.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Nashua, NH, USA, 1996. [Google Scholar]
  48. Approximate Dynamic Programming. Available online: https://deepgram.com/ai-glossary/approximate-dynamic-programming (accessed on 23 April 2024).
  49. Mes, M.; Perez Rivera, A. Approximate Dynamic Programming by Practical Examples. In Markov Decision Processes in Practice; Boucherie, R., van Dijk, N.M., Eds.; Number 248; Springer: Berlin/Heidelberg, Germany, 2017; pp. 63–101. [Google Scholar]
  50. Loxley, P.N.; Cheung, K.W. A dynamic programming algorithm for finding an optimal sequence of informative measurements. Entropy 2023, 25, 251. [Google Scholar] [CrossRef] [PubMed]
  51. Petrik, M.; Taylor, G.; Parr, R.; Zilberstein, S. Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes. In 27th International Conference on Machine Learning (ICML 2010); Fürnkranz, J., Joachims, T., Eds.; Omnipress: Madison, WI, USA, 2010; pp. 871–878. [Google Scholar]
  52. Preux, P.; Girgin, S.; Loth, M. Feature discovery in approximate dynamic programming. In Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, Nashville, TN, USA, 30 March–2 April 2009; pp. 109–116. [Google Scholar]
  53. Papadaki, K.P.; Powell, W.B. Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem. Eur. J. Oper. Res. 2002, 142, 108–127. [Google Scholar] [CrossRef]
  54. Luus, R. Optimal control by dynamic programming using systematic reduction in grid size. Int. J. Control 1990, 51, 995–1013. [Google Scholar] [CrossRef]
  55. Lock, J.; McKelvey, T. A computationally fast iterative dynamic programming method for optimal control of loosely coupled dynamical systems with different time scales. IFAC-PapersOnLine 2017, 50, 5953–5960. [Google Scholar] [CrossRef]
  56. Lincoln, B.; Rantzer, A. Suboptimal dynamic programming with error bounds. In Proceedings of the 41st IEEE Conference on Decision and Control, Las Vegas, NV, USA, 10–13 December 2002; Volume 2, pp. 2354–2359. [Google Scholar]
  57. Lincoln, B.; Rantzer, A. Relaxing dynamic programming. IEEE Trans. Control Syst. Technol. 2006, 51, 1249–1260. [Google Scholar] [CrossRef]
  58. Rantzer, A. Relaxed dynamic programming in switching systems. IEE Proc.-Control Theory Appl. 2006, 153, 567–574. [Google Scholar] [CrossRef]
  59. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu (accessed on 23 April 2024).
  60. Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2010; p. 537. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.