E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm

Omae, Yuto; Mori, Masaya

doi:10.3390/make5010007

Open AccessArticle

E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm

by

Yuto Omae

^*

and

Masaya Mori

College of Industrial Technology, Nihon University, Chiba 275-8575, Japan

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2023, 5(1), 109-127; https://doi.org/10.3390/make5010007

Submission received: 29 November 2022 / Revised: 26 December 2022 / Accepted: 30 December 2022 / Published: 11 January 2023

(This article belongs to the Special Issue Recent Advances in Feature Selection)

Download

Browse Figures

Versions Notes

Abstract

:

Generally, when developing classification models using supervised learning methods (e.g., support vector machine, neural network, and decision tree), feature selection, as a pre-processing step, is essential to reduce calculation costs and improve the generalization scores. In this regard, the minimum reference set (MRS), which is a feature selection algorithm, can be used. The original MRS considers a feature subset as effective if it leads to the correct classification of all samples by using the 1-nearest neighbor algorithm based on small samples. However, the original MRS is only applicable to numerical features, and the distances between different classes cannot be considered. Therefore, herein, we propose a novel feature subset evaluation algorithm, referred to as the “E2H distance-weighted MRS,” which can be used for a mixture of numerical and categorical features and considers the distances between different classes in the evaluation. Moreover, a Bayesian swap feature selection algorithm, which is used to identify an effective feature subset, is also proposed. The effectiveness of the proposed methods is verified based on experiments conducted using artificially generated data comprising a mixture of numerical and categorical features.

Keywords:

feature subset selection; minimum reference set; classification; machine learning; Bayesian optimization

1. Introduction

Generally, classification tasks are executed using supervised learning models, such as support vector machines, neural networks, and decision trees. However, the explicit use of the selected effectiveness features for classification is essential when developing classification models. This process is called feature selection, and it is known to reduce the calculation time and improve the estimation accuracy [1,2]. Therefore, several feature selection algorithms have been proposed in the field of machine learning; these include out-of-bag error [3], inter–intra class distance ratio [4,5], genetic algorithms [6,7], bagging [8,9], CART [10,11], Lasso [12,13], BoLasso [14], ReliefF [15,16], and atom search [17].

In this study, we consider the minimum reference set (MRS) [18], another feature selection algorithm, for developing a high-quality classification model. In the MRS, we select a feature subset

F^{'}

from among the feature set

F

and calculate the Euclidean distances for all the pairwise samples belonging to different classes. Next, we test the correct classification of all samples using the 1-nearest neighbor (1NN) algorithm based on paired samples with close distances. In this regard, if we achieve the correct classification of all samples using a small sample size, we regard the corresponding feature subsets as desirable. In other words, the sample size that leads to no classification error is considered as the evaluation value of the feature subset

F^{'}

. Additional details on the MRS can be found in [18]. We consider the MRS to be reliable because it has been adopted in previous studies for feature selection [19,20,21,22]. However, the MRS presents the following two limitations:

One of these limitations is represented as “Issue 1” in Figure 1. Here, (A1) and (A2) denote feature spaces with values

x_{1}

and

x_{2}

, and the blue circle and red cross represent the observation samples of two different classes. In these cases, the correct classification of all samples can be achieved using 1NN, explicitly based on samples enclosed in the square box. Notably, large distances between different classes in a feature space are often desirable to achieve a high generalization score. Therefore, feature space (A2) is better than space (A1). However, the feature space evaluations of (A1) and (A2) are the same (i.e., six) when using the original MRS [18], even if the feature space (A2) is known to be desirable. We consider this an issue because the MRS is a method used for identifying a desirable feature subset for the classification problem.

The second issue is indicated as “Issue 2” in Figure 1. The figure presents a three-dimensional feature space consisting of two numerical features

x_{1}, x_{2} \in R

and a categorical feature

x_{3} \in {⧫, ♣, ♠}

. Here, when

x_{3} = ⧫

, the samples are correctly classified by the numerical features

x_{1}

and

x_{2}

. In certain cases, categorical features, such as

x_{3}

, are also effective for classification. However, the original MRS [18] can only evaluate numerical features because it is based on the Euclidean distance. Although we can transform the categorical values into numerical values via one-hot encoding, large categorical values often lead to an increase in the dimension number [23]. When the sample size and dimension number are a and r, respectively, the time complexity of the 1NN algorithm based on the brute-force search is

O (a r)

[24]. Moreover, although the k-d tree [25] is a fast algorithm, it is affected by dimensionality [26]. The number of dimensions r increases the computational time, despite the existence of an algorithm with a time complexity of

O (r log r + log a)

[26]. Therefore, we consider that the application of one-hot encoding to categorical features is not desirable because it leads to an increase in the dimensionality.

Therefore, herein, we propose a novel feature subset evaluation algorithm, called the E2H distance-weighted MRS (E2H MRS), to address the two aforementioned issues. Here, “E2” and “H” denote the squared Euclidian distance and Hamming distance, respectively. Note that we can measure the distances between samples in a feature space comprising a mixture of numerical and categorical values by using the mixture distance “E2H.” Moreover, we propose a Bayesian swap feature selection algorithm (BS-FS) to identify a subset of desirable features for classification. In this paper, we will present details regarding the E2H MRS and BS-FS algorithms.

2. Proposed Method

Herein, we explain the mathematical representation of feature subset selection and the proposed methods. The used variables are summarized in Appendixes Appendix A.1 and Appendix A.2.

2.1. Mathematical Representation of Feature Subset Selection

Let us denote a set

F^{r}

consisting of

n^{r}

numerical features and a set

F^{c}

consisting of

n^{c}

categorical features as follows:

\begin{matrix} \begin{matrix} F^{r} = {f_{1}^{r}, \dots, f_{n^{r}}^{r}}, F^{c} = {f_{1}^{c}, \dots, f_{n^{c}}^{c}} . \end{matrix} \end{matrix}

(1)

For instance,

f_{1}^{r} = “ age ”

,

f_{2}^{r} = “ height ”

,

f_{1}^{c} = “ male or female ”

,

f_{2}^{c} = “ blood type ”

, and so on. As these features are known to mix, all features can be represented as

\begin{matrix} \begin{matrix} F = F^{r} \cup F^{c}, | F | = n, n = n^{r} + n^{c} . \end{matrix} \end{matrix}

(2)

The proposed algorithm determines a feature subset

F_{opt .}^{'}

consisting of m effective features from among the all features

F

to estimate the class

z \in {z_{0}, z_{1}}

, that is,

\begin{matrix} \begin{matrix} F_{opt .}^{'} \subset F, | F_{opt .}^{'} | = m, m \leq n, \end{matrix} \end{matrix}

(3)

where

\begin{matrix} \begin{matrix} F_{opt .}^{'} = \underset{F^{'} \subset F}{argmin} L (F^{'}), s . t ., | F^{'} | = m . \end{matrix} \end{matrix}

(4)

Notably, the E2H MRS adopts a function L for evaluating the feature subset

F^{'}

. Because the feature subset

F^{'}

consists of a mixture of numerical and categorical features, let us denote the feature vector of class

z \in {z_{0}, z_{1}}

as

\begin{matrix} \begin{matrix} x^{z} = {[x^{z, r} x^{z, c}]}^{⊤}, \end{matrix} \end{matrix}

(5)

where

\begin{matrix} \begin{matrix} x^{z, r} = {[x_{1}^{z, r} \dots x_{p^{r}}^{z, r}]}^{⊤}, \\ x^{z, c} = {[x_{1}^{z, c} \dots x_{p^{c}}^{z, c}]}^{⊤}, \\ p^{r} + p^{c} = m . \end{matrix} \end{matrix}

(6)

Here,

x^{z, r}

is a vector consisting of

p^{r}

numerical features, and

x^{z, c}

is a vector consisting of

p^{c}

categorical features. As examples, we use four features of “Sex”, “Embarked (Port of Embarkation)”, “Age”, “Fare” for Titanic survival prediction [27,28]. In this case, the features subset is

F^{'} = {“ Age ”, “ Fare ”, “ Sex ”, “ Embarked ”} \subset F

.

p^{r} = 2

,

p^{c} = 2

, and

m = 4

because “Age” and “Fare” are numerical features, “Sex” and “Embarked” are categorical features. Moreover, feature vector

x^{z}

consists of these values.

When the feature vector

x^{z}

consists of only categorical or numerical features,

(p^{c}, p^{r}) = (m, 0)

or

(p^{c}, p^{r}) = (0, m)

is satisfied. The feature vector of the i-th observation sample is defined as

x_{i}^{z}

.

2.2. E2H Distance-Weighted MRS Algorithm

The E2H MRS algorithm developed for evaluating the feature subset

F^{'}

is summarized in Algorithm 1. This algorithm outputs an evaluation

L (F^{'})

by inputting the feature subset

F^{'}

. After initialization, all pairwise distances

D (x^{z_{0}}, x^{z_{1}}; γ)

between the classes

z_{0}

and

z_{1}

are computed (line 5). Next, the distance set

D = {d_{1}, d_{2}, \dots}

is sorted by

D (x^{z_{0}}, x^{z_{1}}; γ)

(line 6). Although these processes are also included in the original MRS [18], only numerical features can be evaluated because the original MRS is based on the Euclidean distance. Therefore, we use another distance function to apply the MRS to the feature subset

F^{'}

comprising a mixture of numerical and categorical features. The definitions of

D (x^{z_{0}}, x^{z_{1}}; γ)

will be explained in Section 2.3.

Algorithm 1 E2H MRS feature evaluation algorithm

Input:: Feature subset $F^{'}$ , Hamming weight $γ$ , distance weight $δ$
Output:: Evaluation of the feature subset $L (F^{'})$
1:: Standardizing numerical features in $F^{'}$ on a value range of zero to one
2:: Set initial data $I \leftarrow ϕ$ , i.e., empty set
3:: Set initial average distance $C (I) \leftarrow 0$
4:: Set initial classification error of all data based on 1-NN using $I$ , $E (I) \leftarrow \infty$
5:: Calculating all pairwise distances $D (x^{z_{0}}, x^{z_{1}}; γ)$ between different classes
6:: Set $D \leftarrow {d_{1}, d_{2}, \dots}$ sorted by the smallest to largest distance on $D (x^{z_{0}}, x^{z_{1}}; γ)$
7:: $k \leftarrow 1$
8:: while $E (I) \neq 0$ do
9:: Identify different class samples i and j related to $D (x_{i}^{z_{0}}, x_{j}^{z_{1}}; γ) = d_{k}$
10:: if ${i, j} \neg \subset I$ then
11:: Updating sample set $I \leftarrow I \cup {i, j}$
12:: Updating distance $C (I) \leftarrow C (I) + d_{k}$
13:: end if
14:: $k \leftarrow k + 1$
15:: end while
16:: Averaging distance $C (I) \leftarrow C (I) / | I |$
17:: Scoring $S (I; δ) \leftarrow {(1 - C (I))}^{δ} | I |$
18:: $L (F^{'}) \leftarrow S (I; δ)$
19:: return $L (F^{'})$

Next, two samples

{i, j}

of different classes that are nearest to each other (i.e.,

d_{1}

) are added to the set

I

. We then test

E (I)

, which denotes the classification error resulting from the 1NN algorithm based on the set

I

. The proposed distance function is used in the 1NN algorithm because the feature subset consists of a mixture of numerical and categorical features. If the error rate is not zero, that is,

E (I) \neq 0

, paired samples

{i, j}

related to

d_{2}

are added to

I

, and the error rate

E (I)

is rechecked. Note that the evaluation value of the feature subset

F^{'}

is calculated when the error rate is zero, that is,

E (I) = 0

. The computation of the evaluation value in the original MRS [18] uses

| I |

, which is the size of the set

I

. This implies that the larger the sample size, the better the evaluation of feature subsets by the original MRS. However, although this method is valid, the distances between different classes are not considered. To consider the distances between different classes, we propose a novel feature subset evaluation function

S (I; δ)

using

C (I)

, the average distance of

d_{k}

, which is obtained in the growth process of the set

I

. The evaluation value of the feature subset

L (F^{'})

is obtained using these processes. Details pertaining to

S (I; δ)

are explained in Section 2.4.

2.3. Distance Function

Let us now denote the distance between

x^{z_{0}}

and

x^{z_{1}}

, consisting of a mixture of numerical and categorical values, as

\begin{matrix} \begin{matrix} D (x^{z_{0}}, x^{z_{1}}; γ) = \frac{1}{p^{r} + γ p^{c}} (D^{E 2} (x^{z_{0}, r}, x^{z_{1}, r}) + γ D^{H} (x^{z_{0}, c}, x^{z_{1}, c})), γ \geq 0 . \end{matrix} \end{matrix}

(7)

The first and second terms represent the squared Euclidean distance and the Hamming distance, respectively, that is,

\begin{matrix} \begin{matrix} D^{E 2} (x^{z_{0}, r}, x^{z_{1}, r}) & = {(x^{z_{0}, r} - x^{z_{1}, r})}^{⊤} (x^{z_{0}, r} - x^{z_{1}, r}) \\ = \sum_{i = 1}^{p^{r}} {(x_{i}^{z_{0}, r} - x_{i}^{z_{1}, r})}^{2}, \end{matrix} \end{matrix}

(8)

D^{H} (x^{z_{0}, c}, x^{z_{1}, c}) = \sum_{i = 1}^{p^{c}} σ (x_{i}^{z_{0}, c}, x_{i}^{z_{1}, c}),

(9)

where

\begin{matrix} \begin{matrix} σ (x_{i}^{z_{0}, c}, x_{i}^{z_{1}, c}) = \{\begin{matrix} 0, & x_{i}^{z_{0}, c} = x_{i}^{z_{1}, c} \\ 1, & x_{i}^{z_{0}, c} \neq x_{i}^{z_{1}, c} \end{matrix} . \end{matrix} \end{matrix}

(10)

In general, the Hamming distance is defined as the minimum number of substitutions required to change one string into another. In other words, it is the number of mismatches between two strings. Therefore, when regarding the

p^{c}

-dimensional categorical features vector as a string of length

p^{c}

, the number of mismatches between the categorical features vectors of two different classes can be represented by the Hamming distance. This number is calculated with Equations (9) and (10).

Moreover, we refer to

γ

as the “Hamming weight” because it is a weight parameter used for categorical features. The parameter

γ

is manually set by users, and when they have a hypothesis in which categorical features are important for classification, they set a large value. When we set

γ = 0

, the effect of categorical features on distance disappears. The distance function defined by Equation (7) is similar to that used in the k-prototype algorithm [29].

The following theorem is satisfied for the proposed distance function

D (x^{z_{0}}, x^{z_{1}}; γ)

:

Theorem 1.

\begin{matrix} \begin{matrix} x^{z, r} \in {[0, 1]}^{p^{r}} \Rightarrow D (x^{z_{0}}, x^{z_{1}}; γ) \in [0, 1] . \end{matrix} \end{matrix}

Proof.

\begin{matrix} \begin{matrix} x^{z, r} \in {[0, 1]}^{p^{r}} & \Rightarrow max_{x^{z_{0}, r}, x^{z_{1}, r}} D^{E 2} (x^{z_{0}, r}, x^{z_{1}, r}) = p^{r} \\ \Rightarrow max_{x^{z_{0}}, x^{z_{1}}} [D^{E 2} (x^{z_{0}, r}, x^{z_{1}, r}) + γ D^{H} (x^{z_{0}, c}, x^{z_{1}, c})] = p^{r} + γ p^{c}, \\ ∵ max_{x^{z_{0}, c}, x^{z_{1}, c}} D^{H} (x^{z_{0}, c}, x^{z_{1}, c}) = p^{c} \\ \Rightarrow max_{x^{z_{0}}, x^{z_{1}}} D (x^{z_{0}}, x^{z_{1}}; γ) = 1 \\ \Rightarrow D (x^{z_{0}}, x^{z_{1}}; γ) \in [0, 1], ∵ min_{x^{z_{0}}, x^{z_{1}}} [D^{E 2} (x^{z_{0}, r}, x^{z_{1}, r}) + γ D^{H} (x^{z_{0}, c}, x^{z_{1}, c})] = 0 \end{matrix} \end{matrix}

□

In other words, the range of

D (x^{z_{0}}, x^{z_{1}}; γ)

extends from zero to one when the condition

x^{z, r} \in {[0, 1]}^{p^{r}}

is satisfied. The process involved in the standardization of numerical features from zero to one is presented in line 1 of Algorithm 1.

2.4. Evaluation Function of a Feature Subset

Here, we explain the evaluation function

L (F^{'})

of the feature subset

F^{'}

. In the original MRS [18], the sample size

| I |

of set

I

leading to the correct classification (no error) of all samples is adopted as an evaluation function of a feature subset. By including the distances between different classes in the original MRS, we propose

\begin{matrix} \begin{matrix} S (I; δ) = {(1 - C (I))}^{δ} | I |, δ \geq 0 \end{matrix} \end{matrix}

(11)

as a novel feature subset evaluation function (line 17 in Algorithm 1). For the proposed evaluation function

S (I; δ)

, the following theorem is satisfied:

Theorem 2.

\begin{matrix} \begin{matrix} x^{z, r} \in {[0, 1]}^{p^{r}} \land δ \geq 0 \Rightarrow S (I; δ) \in [0, | I |] . \end{matrix} \end{matrix}

Proof.

\begin{matrix} \begin{matrix} x^{z, r} \in {[0, 1]}^{p^{r}} & \Rightarrow D (x^{z_{0}}, x^{z_{1}}; γ) \in [0, 1], ∵ Theorem 1 \\ \Rightarrow C (I) \in [0, 1], ∵ C (I) is average of D (x^{z_{0}}, x^{z_{1}}; γ) \\ \Rightarrow {(1 - C (I))}^{δ} \in [0, 1], ∵ δ \geq 0 \\ \Rightarrow S (I; δ) \in [0, | I |] \end{matrix} \end{matrix}

□

Note that the range of evaluation

S (I; δ)

extends from zero to

| I |

if the range of the numerical features

x^{z, r}

extends from zero to one.

C (I)

is the average distance of set

I

and is obtained using the different class distances

d_{k}

represented in lines 12 and 16 of Algorithm 1. The range of

C (I)

extends from zero to one because it denotes the average of

D (x^{z_{0}}, x^{z_{1}}; γ)

based on Theorem 1. Therefore, Theorem 2 is satisfied. Moreover, we understand that

{(1 - C (I))}^{δ}

is a damping coefficient for

| I |

, based on Equation (11).

δ

is referred to as the “distance weight” because it is a parameter used for adjusting the damping coefficient based on the distance

C (I)

. In a typical classification problem, the distance between different classes in a features space should be long to decrease classification errors. In some works, feature spaces leading to a long distance between different classes were used for classification [30,31]. Therefore, we included the parameter

δ

to represent the weight of the distance between different classes in the proposed method. This parameter is manually set by the users. The distance between the different classes is emphasized when setting

δ

to a large value. In contrast, the sample size of set

I

is emphasized when setting

δ

to a small value. The value of the proposed evaluation function

S (I; δ)

approaches zero when the distance between different classes is large, and the sample size of

I

is small. Notably, the smaller the value of

S (I; δ)

, the more effective the subset

F^{'}

; hence, we define

\begin{matrix} \begin{matrix} L (F^{'}) = S (I; δ) \end{matrix} \end{matrix}

(12)

to evaluate

F^{'}

. This function is expressed in Equation (4).

Note that when setting

δ = 0

, the proposed evaluation function and the original MRS [18] have the same form, owing to

\begin{matrix} \begin{matrix} S (I; δ = 0) = | I | . \end{matrix} \end{matrix}

(13)

In contrast, when setting

δ \to \infty

, the evaluation value is

\begin{matrix} \begin{matrix} lim_{δ \to \infty} S (I; δ) = \{\begin{matrix} 0, & 0 < C (I) \leq 1 \\ | I |, & C (I) = 0 \end{matrix} . \end{matrix} \end{matrix}

(14)

In most cases,

S (I; δ \to \infty)

approaches zero because

C (I) = 0

(i.e., the average distance between classes

z_{0}

and

z_{1}

is zero) is not satisfied. This implies that when the distance weight

δ

is too large, the proposed evaluation function

S (I; δ)

does not perform well.

2.5. Bayesian Swap Feature Selection Algorithm

In the brute-force search method, the number of calculations required to solve the optimization problem presented in Equation (4), that is, the number of calculations required for identifying a feature subset that minimizes

L (F^{'})

, is

\begin{matrix} \begin{matrix} T_{all} (n, m) =_{n} C_{m}, \end{matrix} \end{matrix}

(15)

where n denotes the size of

F

, and m is the size of

F^{'}

. Therefore, when n is large, obtaining an optimal solution is difficult from the perspective of the calculation cost. In the original MRS [18], an approach for finding the approximate solution is adopted. In particular, the first process randomly chooses m features from among the all features

F

, and the second process gradually improves the evaluation value

L (F^{'})

by swapping the features. Additional details on this method are explained in [18]. The number of calculations required when using this method is

\begin{matrix} \begin{matrix} T_{fsa} (n, m) = m (n - m) . \end{matrix} \end{matrix}

(16)

Thus, this algorithm significantly reduces the calculation cost compared to the brute-force search. However, the final adopted feature subset depends on the initially selected feature subset. Therefore, we adopt an approach using the initial feature subset obtained via Bayesian optimization. The Bayesian optimization algorithm is a tree-structured parzen estimator algorithm (TPE) [32], and it is used in the optimization framework “optuna” (v2.0.0) [33].

A feature selection algorithm based on the described approach is outlined in Algorithm 2. The input values comprise the all features of set

F

, the dimension number m, and the number of iterations in the Bayesian optimization b. The output is an approximate solution

F_{opt .}^{*} .

We represent

L (F_{opt .}^{*}) ≃ L (F_{opt .}^{'})

because

L (F_{opt .}^{*})

is expected to be close to

L (F_{opt .}^{'})

, which is an evaluation of the optimal solution

F_{opt .}^{'} .

Algorithm 2 Bayesian swap feature subset selection algorithm (BS-FS)

Input:: Feature set $F$ , feature dimension m, iterations of the Bayesian optimization b
Output:: Approximation solution of the feature subset $F_{opt .}^{*}$ , i.e., $L (F_{opt .}^{*}) ≃ L (F_{opt .}^{'})$
1:: for $t = 1$ to b do
2:: Bayesian selection (TPE) of m features $F_{t}^{'} \leftarrow {f_{1}, f_{2}, \dots, f_{m}} \subset F$
3:: Calculate $L (F_{t}^{'})$
4:: end for
5:: Solve $F_{opt .}^{*} \leftarrow \underset{F_{t}^{'}}{argmin} {L (F_{t}^{'}) ∣ t = 1, \dots, b}$ , where $F_{opt .}^{*} = {f_{1}^{*}, f_{2}^{*}, \dots, f_{m}^{*}}$
6:: Obtain the difference set ${\bar{F}}_{opt .}^{*} \leftarrow F ∖ F_{opt .}^{*}$ , where ${\bar{F}}_{opt .}^{*} = {{\bar{f}}_{1}^{*}, {\bar{f}}_{2}^{*}, \dots, {\bar{f}}_{n - m}^{*}}$
7:: for $i = 1$ to m do
8:: for $j = 1$ to $n - m$ do
9:: Swap $f_{i}^{*}$ and ${\bar{f}}_{j}^{*}$ , i.e., $F_{opt .}^{*, swap} \leftarrow F_{opt .}^{*} ∖ {f_{i}^{*}} \cup {{\bar{f}}_{j}^{*}}$ , ${\bar{F}}_{opt .}^{*, swap} \leftarrow {\bar{F}}_{opt .}^{*} ∖ {{\bar{f}}_{j}^{*}} \cup {f_{i}^{*}}$
10:: if $L (F_{opt .}^{*, swap}) < L (F_{opt .}^{*})$ then
11:: Accept the swap, i.e., $F_{opt .}^{*} \leftarrow F_{opt .}^{*, swap}$ , ${\bar{F}}_{opt .}^{*} \leftarrow {\bar{F}}_{opt .}^{*, swap}$
12:: end if
13:: end for
14:: end for
15:: return $F_{opt .}^{*}$

Note that lines 1–5 in Algorithm 2 detail the Bayesian optimization processes used for searching for the initial feature subset. We choose

F_{t}^{'} \subset F

and determine its evaluation value,

L (F_{t}^{'})

. Note that t denotes the iteration ID of the Bayesian optimization. The relevant Bayesian optimization processes are repeated b times, and the feature subset of the minimum evaluation value is selected as the initial subset

F_{opt .}^{*}

(line 5). Subsequently, the evaluation value is improved by swapping each feature in the initial subset

F_{opt .}^{*}

and the remaining subset

{\bar{F}}_{opt .}^{*} = F ∖ F_{opt .}^{*}

. These processes are indicated in lines 6–14 of Algorithm 2. We refer to this algorithm as the “Bayesian swap feature selection algorithm (BS-FS)” because this method is a combination of Bayesian optimization and feature swapping.

Using Algorithm 2, the number of calculation evaluation functions in the BS-FS is

\begin{matrix} \begin{matrix} T_{bs} (n, m, b) & = b + m (n - m) \\ = b + m n - m^{2} . \end{matrix} \end{matrix}

(17)

In general,

n, b ≫ m

is satisfied as a parameter relationship. Therefore, the time complexity of Algorithm 2 is

O (b + n)

. This algorithm is fast compared to the brute-force search method. However, if the number of maximum iterations b is too large, the number of calculations for BS-FS is larger than that for the brute-force search method (i.e.,

T_{bs} (n, m, b) > T_{all} (n, m)

). The boundary point

b^{'}

is

\begin{matrix} \begin{matrix} b^{'} =_{n} C_{m} + m^{2} - m n \Leftrightarrow T_{bs} (n, m, b^{'}) = T_{all} (n, m) . \end{matrix} \end{matrix}

(18)

In other words, the number of maximum iterations for the Bayesian optimization, b, must be less than

b^{'}

.

3. Artificial Dataset for the Verification of the Proposed Methods

Further, we verified the effectiveness of the proposed methods using an artificial dataset. Note that the effective feature subset for classification is defined as

\begin{matrix} \begin{matrix} F^{Sol .} = {f_{1}^{Sol ., r}, f_{2}^{Sol ., r}, f_{1}^{Sol ., c}, f_{2}^{Sol ., c}}, \end{matrix} \end{matrix}

(19)

where

{f_{1}^{Sol ., r}, f_{2}^{Sol ., r}} \subset F^{r}

, and

{f_{1}^{Sol ., c}, f_{2}^{Sol ., c},} \subset F^{c}

. In other words, the combination of two numeric features and two categorical features forms an effective feature subset for classification tasks. Let us denote the values of these features as

x_{1}

,

x_{2}

,

x_{3}

, and

x_{4}

. In particular, because

x_{3}

and

x_{4}

are categorical features,

\begin{matrix} \begin{matrix} x_{3} \in {♣, ♠}, x_{4} \in {♢, ♡} . \end{matrix} \end{matrix}

(20)

Although we set a binary state as a categorical feature for simplification, because we adopted the Hamming distance

D^{H}

, the number of states of a categorical feature can be any number of states. In this study, we generated the feature vectors

x = {[x_{1} x_{2} x_{3} x_{4}]}^{⊤}

related to the feature subset

F^{Sol .}

based on the probability distribution. An overview of this is presented in Figure 2. The samples

x_{1}

and

x_{2}

of class

z_{0}

(blue circles) are generated based on a Gaussian distribution

N

, defined as

\begin{matrix} \begin{matrix} f_{z_{0}} (x_{1}, x_{2}; e^{c}) = N (u (e^{c}), v) . \end{matrix} \end{matrix}

(21)

To determine the effects of categorical features, the mean vector

u

is defined as follows:

\begin{matrix} \begin{matrix} u (e^{c}) = \{\begin{matrix} {[u_{1} u_{2}]}^{⊤}, & (x_{3}, x_{4}) = (♣, ♢) \\ [u_{1} + e^{c} u_{2}]^{⊤}, & (x_{3}, x_{4}) = (♣, ♡) \\ [u_{1} u_{2} + e^{c}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♢) \\ [u_{1} - e^{c} u_{2} - e^{c}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♡) \end{matrix} . \end{matrix} \end{matrix}

(22)

This implies that the average vector is shifted by

e^{c}

depending on the categorical features. Only in the case of

(x_{3}, x_{4}) = (♠, ♡)

, the average vector is shifted by

- e^{c}

.

The samples

x_{1}

,

x_{2}

of class

z_{1}

(red circles) are generated based on a Gaussian mixture distribution, defined as

\begin{matrix} \begin{matrix} f_{z_{1}} (x_{1}, x_{2}; e^{c}, e^{r}) = \frac{1}{3} \sum_{i = 1}^{3} N (u_{i} (e^{c}, e^{r}), v) . \end{matrix} \end{matrix}

(23)

To determine the effect of the categorical features, the mean vectors

u_{1}, u_{2}

, and

u_{3}

are defined as

\begin{matrix} \begin{matrix} u_{1} (e^{c}, e^{r}) = \{\begin{matrix} {[u_{1} + e^{r} u_{2}]}^{⊤}, & (x_{3}, x_{4}) = (♣, ♢) \\ [u_{1} + e^{c} + e^{r} u_{2}]^{⊤}, & (x_{3}, x_{4}) = (♣, ♡) \\ [u_{1} + e^{r} u_{2} + e^{c}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♢) \\ [u_{1} - e^{c} + e^{r} u_{2} - e^{c}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♡) \end{matrix}, \end{matrix} \end{matrix}

(24)

\begin{matrix} \begin{matrix} u_{2} (e^{c}, e^{r}) = \{\begin{matrix} {[u_{1} u_{2} + e^{r}]}^{⊤}, & (x_{3}, x_{4}) = (♣, ♢) \\ [u_{1} + e^{c} u_{2} + e^{r}]^{⊤}, & (x_{3}, x_{4}) = (♣, ♡) \\ [u_{1} u_{2} + e^{c} + e^{r}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♢) \\ [u_{1} - e^{c} u_{2} - e^{c} + e^{r}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♡) \end{matrix}, \end{matrix} \end{matrix}

(25)

\begin{matrix} \begin{matrix} u_{3} (e^{c}, e^{r}) = \{\begin{matrix} {[u_{1} + e^{r} u_{2} + e^{r}]}^{⊤}, & (x_{3}, x_{4}) = (♣, ♢) \\ [u_{1} + e^{c} + e^{r} u_{2} + e^{r}]^{⊤}, & (x_{3}, x_{4}) = (♣, ♡) \\ [u_{1} + e^{r} u_{2} + e^{c} + e^{r}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♢) \\ [u_{1} - e^{c} + e^{r} u_{2} - e^{c} + e^{r}]^{⊤}, & (x_{3}, x_{4}) = (♠, ♡) \end{matrix} . \end{matrix} \end{matrix}

(26)

In other words, the samples of class

z_{1}

are shifted by

e^{r}

compared with the samples of class

z_{0}

. The variance–covariance matrix is defined as follows:

\begin{matrix} \begin{matrix} v = [\begin{matrix} v & 0 \\ 0 & v \end{matrix}] . \end{matrix} \end{matrix}

(27)

The distribution has the following parameters:

e^{r}

and

e^{c}

. For the artificial samples generated by the distribution based on large values of

e^{r}

, the classification of two classes using the numerical features

x_{1}

and

x_{2}

is simple because the distances between different classes are large. For artificial samples generated by the distribution based on large values of

e^{c}

, it is necessary to use the categorical features

x_{3}

and

x_{4}

for classification. Therefore,

e^{r}

represents a “numerical effect,” and

e^{c}

represents a “categorical effect.”

The generated feature spaces in four dimensions (

x_{1}, x_{2} \in R

,

x_{3}, \in {♣, ♠}

,

x_{4} \in {♢, ♡}

) are presented in Figure 3: (A)

(e^{c}, e^{r}) = (10, 30)

, (B)

(e^{c}, e^{r}) = (10, 50)

, (C)

(e^{c}, e^{r}) = (30, 30)

, and (D)

(e^{c}, e^{r}) = (30, 50)

. The values of the categorical features

x_{3}

and

x_{4}

change from left to right. The rightmost figure depicts an explicit scatter plot of the numerical features

x_{1}

and

x_{2}

, that is, it does not consider the categorical features

x_{3}

and

x_{4}

. Comparing (A) and (B), we can establish that the distance between different classes increases for a large value of the numerical effect

e^{r}

. Moreover, even if we do not consider the categorical features

x_{3}

and

x_{4}

, we can classify the samples owing to the small value of the categorical effect

e^{c}

. In contrast, (C) and (D) represent spaces with large categorical effects

e^{c}

. In this case, we can observe that the categorical features

x_{3}

and

x_{4}

are required for correct classification. We can control the difficulty level of classification using the parameters

e^{r}

and

e^{c}

. Therefore, the method for the generation of artificial data described in this section is appropriate for verifying the proposed algorithms. Note that the generated values of the numerical features

x_{1}

and

x_{2}

are standardized from zero to one to satisfy Theorems 1 and 2.

The numerical examples of feature spaces (A)–(D) calculated by Algorithm 1 and Equation (11) are provided in Table 1.

S (I; δ)

,

| I |

, and

{(1 - C (I))}^{δ}

represent the evaluation value, the size of MRS, and the damping coefficient, respectively. These values are calculated using Algorithm 1 and Equation (11). Notably, the lower the value of

S (I; δ)

, the better the feature space is for classification. For Algorithm 1, we adopted

(γ, δ) \in {(0, 0), (1, 1), (1, 5)}

as the Hamming weight

γ

and distance weight

δ

. The parameters

γ

and

δ

are manually set by the users.

(γ, δ) = (0, 0)

indicate the original MRS, and

(γ, δ) = (1, 1)

and

(1, 5)

represent the proposed method E2H MRS. In the case of

(γ, δ) = (0, 0)

, although (B) and (D) are perceptually desirable feature spaces for classification, the best space based on evaluation value

S (I; δ)

is (A). The method did not determine (D) as the best feature space. This can be attributed to the Hamming weight

γ = 0

, i.e., the method did not consider the effect of the categorical feature values

x_{3}

and

x_{4}

. Similarly, in the case of

(γ, δ) = (0, 0)

, the score of (B) was worse than that of (A), which can be attributed to the distance weight

δ = 0

, i.e., it did not consider distance between different classes. In contrast, when adopting

(γ, δ) = (1, 1)

, the proposed method determined (B) and (D) as desirable feature spaces for classification because the effects of categorical features and the distance between different classes are considered. Moreover, when adopting

(γ, δ) = (1, 5)

, the effect of the damping coefficient on the evaluation value increased. Therefore, we consider the proposed method of E2H MRS to be better than original MRS method for evaluating features subset.

4. Experiment 1: Relationship between the Distance between Different Classes and the E2H MRS Evaluation

4.1. Objective and Outline

In the original MRS [18], the distance between different classes is not considered because the evaluation value of the feature subset is the sample size of the set

I

. Therefore, we propose a novel evaluation function

S (I; δ)

that includes the distance and sample size. To verify its effectiveness, we generate a feature subset

F^{'}, m = 4

comprising two numerical and two categorical features, and we calculate the evaluation value

S (I; δ) = L (F^{'})

.

Notably, we adopt

e^{r} \in {20, 30, 40, 50}

and

e^{c} = 20

as the parameters for generating artificial feature subsets. Moreover,

δ \in {0, 1, 2, 3, 4}

is adopted for the sensitivity analysis of the distance weight. When

δ = 0

, the evaluation functions of the original MRS [18] and E2H MRS have the same form. In other words, the results of

δ \geq 1

represent E2H MRS but not the original MRS. The number of generated samples is

\begin{matrix} \begin{matrix} (n_{z_{0}}, n_{z_{1}}) \in {(12, 12), (24, 24), (48, 48), (96, 96), (192, 192), (384, 384)}, \end{matrix} \end{matrix}

(28)

where

n_{z}

represents the samples of class

z \in {z_{0}, z_{1}}

. As stated, all numerical features are standardized from zero to one to satisfy Theorems 1 and 2. Moreover, we perform experiments using 100 random seeds to obtain stable results because the generated data depend on randomness.

4.2. Result and Discussion

The results obtained are summarized in Figure 4. The vertical axis represents the average evaluation value

S (I; δ) = L (F^{'})

on 100 seeds. The horizontal axis represents the numerical effect,

e^{r}

. In other words, the greater the value of

e^{r}

, the greater the distance between different classes in the feature subset. The dashed line indicates the result of the original MRS (

δ = 0

), and the solid lines indicate the results of the E2H MRS (

δ \geq 1

).

The greater the distance between different classes, the more effective the feature subset for classification. Therefore, when

e^{r}

is large, the evaluation value

L (F^{'})

should ideally be small. From this viewpoint, the results of the original MRS (

δ = 0

) are deemed to be inappropriate when the sample size is greater than 48. This is because the original MRS cannot consider the distance between different classes. In contrast, in the case of the E2H MRS (

δ \geq 1

), the evaluation values are small when the distance between different classes is large. Therefore, the E2H MRS is effective in identifying feature subsets with large distances between different classes.

5. Experiment 2: Effectiveness of BS-FS in Finding Desirable Feature Subsets

5.1. Objective and Outline

In this section, we describe whether the combination of the E2H MRS (Algorithm 1) and BS-FS (Algorithm 2) can determine an effective feature subset for classification. To this end, the following features are generated:

\begin{matrix} \begin{matrix} F^{Sol .}, | F^{Sol .} | = 4 : Correct feature subset by (e^{c}, e^{r}) = (40, 50), \\ F^{Q . Sol .}, | F^{Q . Sol .} | = 4 : Quasi - correct feature subset by (e^{c}, e^{r}) = (40 / 2, 50 / 2), \\ F^{Bad}, | F^{Bad} | = 7 : Bad feature subset, \\ F = F^{Sol .} \cup F^{Q . Sol .} \cup F^{Bad}, | F | = 15 : All features set . \end{matrix} \end{matrix}

(29)

Among these,

F^{Sol .}

is the most effective feature subset comprising two numerical and two categorical features (a total of four features), which are generated based on the probability distribution of

(e^{c}, e^{r}) = (40, 50)

. Further,

F^{Q . Sol .}

is a quasi-correct feature subset consisting of two numerical and two categorical features (a total of four features), and these features are generated based on the distribution of

(e^{c}, e^{r}) = (40 / 2, 50 / 2)

. Next,

F^{Bad}

consists of seven randomly generated features (the breakdown of categorical and numerical features is also random). Therefore,

F^{Bad}

is not an effective classification feature subset. Notably, the feature sets

F

consist of the union of these feature subsets, and the total number of features is

4 + 4 + 7 = 15

. We adopt the proposed algorithms E2H MRS and BS-FS to identify four effective features among all the 15 features. Note that the total number of solutions is

_{15} C_{4} = 1365

, as shown in Figure 5; that is, the chance of obtaining the optimal solution in one trial is

1 /_{15} C_{4} = 1 / 1365 ≃ 0.0733 %

.

Although both

F^{Sol .}

and

F^{Q . Sol .}

are effective feature subsets for correct classification,

F^{Sol .}

is better than

F^{Q . Sol .}

owing to the adopted parameters,

e^{c}, e^{r}

. That is,

F^{Sol .}

is the best solution, and

F^{Q . Sol .}

is the second-best solution. In any case, identifying these subsets is a difficult problem because there is only one in all 1365 feature subsets, as shown in Figure 5. When the number of generated samples is extremely small, the evaluation value of the best subset

F^{Sol .}

may not be the minimum value owing to randomness. In this case, the proposed algorithms may identify the best

F^{Sol .}

or the second-best

F^{Q . Sol .}

subset depending on the number of samples generated. Therefore, we tested various sample sizes

(n_{z_{0}}, n_{z_{1}})

defined by Equation (28). Moreover, we adopted

b \in {0, 100}, γ \in {0.1, 1, 10}

to understand the effects of the Bayesian optimization and Hamming weight on the evaluation results. The corresponding experiment was conducted using 100 random seeds to verify the correct detection rate of

F^{Sol .}

and

F^{Q . Sol .}

.

5.2. Result and Discussion

The rates for the correct detection of

F^{Sol .}

and

F^{Q . Sol .}

for a total of 1365 solutions using the proposed algorithms are shown in Figure 6. The left-, center-, and right-side figures present the results for different Hamming weights. The top and bottom figures illustrate the results of the Bayesian optimization. The horizontal axis represents the number of generated samples, and the vertical axis represents the correct detection rate for 100 seeds.

First, when the Hamming weight is too small (

γ = 0.1

), the correct detection rate is also small compared with that for

γ = 1

and

γ = 10

. The Hamming weight refers to the weight of categorical features (see Equation (7)). Therefore, for

γ = 0.1

, we consider that the detection rates decrease because the proposed method fails to detect correct categorical features. From the results for

γ = 1

and

γ = 10

, we can conclude that the correct detection rates improve for a large Hamming weight. However, because the results for

γ = 1

and

γ = 10

are almost the same, there may be an upper limit to its effectiveness.

Next, we discuss the effects of Bayesian optimization. When Bayesian optimization was not adopted (upper side in Figure 6), the detection rates of

F^{Sol .}

and

F^{Q . Sol .}

were almost the same. In contrast, when searching for the initial feature subset using Bayesian optimization (bottom side in Figure 6), the detection rate of

F^{Sol .}

was higher than that of

F^{Q . Sol .}

by approximately three times. For example, when the number of generated samples was 384, the detection rates of

F^{Sol .}

and

F^{Q . Sol .}

were approximately 60% and 20%, respectively. Therefore, searching the initial feature subset using Bayesian optimization may be effective in identifying the best subset

F^{Sol .}

. Moreover, BS-FS is effective in detecting one of the following subsets:

F^{Sol .}

and

F^{Q . Sol .}

because the total detection rate of

F^{Sol .}

and

F^{Q . Sol .}

increases.

However, for small sample sizes, the detection rate of

F^{Sol .}

and

F^{Q . Sol .}

is also small. When the sample size is too small, owing to an incorrect random bias, the E2H MRS may classify some features belonging to the bad feature subset

F^{Bad}

as effective. Therefore, when the sample size is too small, selecting only two patterns of the correct solution

F^{Sol .}

and quasi-correct solution

F^{Q . Sol .}

from a total of 1365 pattern candidates is difficult when using E2H MRS and BS-FS. In actual data, cases where the numbers of collected samples are not large are sometimes encountered. In such cases, the selection of all the correct features becomes unrealistic. Therefore, it is important to check the number of correct features among the four features selected by the proposed method.

Further, we checked the detection rate of two or three correct features among the four features selected by the E2H MRS and BS-FS. Notably, the correct features are defined as features belonging to

F^{Sol .}

or

F^{Q . Sol .}

. The corresponding results are presented in Figure 7. Here, (A) and (B) denote the detection rates obtained when the number of correct features is two and three or more, respectively. As can be observed, although the sample size is small, some correct features are selected. Moreover, we also understand that the detection rates increase when searching for the initial feature subset using Bayesian optimization. Therefore, we consider that the proposed methods, E2H MRA and BS-FS, are effective in identifying desirable feature subsets for classification tasks, even if the sample size is small.

6. Conclusions

In this paper, we propose an improved form of the original MRS [18], which is a feature subset evaluation and selection algorithm. The improved algorithm is referred to as the E2H MRS. In particular, the E2H MRS (Algorithm 1) can evaluate numerical and categorical mixture feature subsets and consider the distance between different classes. Moreover, a subset selection algorithm for time complexity

O (b + n)

, referred to as BS-FS (Algorithm 2), is proposed. The proposed methods are validated using Experiments 1 and 2 based on artificial data.

In this study, we verified the effectiveness of the proposed methods, E2H MRS and BS-FS, by using samples sizes of several tens to hundreds. However, recently, large datasets with several million samples have emerged, and we did not verify the effectiveness on such a dataset. Moreover, we adopted 2:2 as the proportion of numerical/categorical features in the experiment described in Section 5. Cases of other proportions should also be verified. Therefore, we plan to perform experiments in future.

Author Contributions

Conceptualization, Y.O. and M.M.; methodology, Y.O. and M.M.; software, Y.O. and M.M.; validation, Y.O. and M.M.; formal analysis, Y.O. and M.M.; investigation, Y.O. and M.M.; resources, Y.O. and M.M.; data curation, Y.O. and M.M.; writing—original draft preparation, Y.O. and M.M.; writing—review and editing, Y.O. and M.M.; visualization, Y.O. and M.M.; supervision, Y.O.; project administration, Y.O.; funding acquisition, Y.O.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JSPS Grant-in-Aid for Scientific Research (C) (Grant No. 21K04535), and JSPS Grant-in-Aid for Young Scientists (Grant No. 19K20062).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Variables and Their Meanings Table

Appendix A.1. Variables for Representing Problem Description

Variables	Meanings
$F$	The all features set collected by the users who want to find desirable features subset.
$F^{r}$	The all numerical features set in $F$ .
$F^{c}$	The all categorical features set in $F$
$n^{r}$	The size of $F^{r}$ , i.e., $n^{r} = \| F^{r} \|$ .
$n^{c}$	The size of $F^{c}$ , i.e., $n^{c} = \| F^{c} \|$ .
n	The size of $F$ , i.e., $n = n^{r} + n^{c}$ .
$f_{i}^{r}$	The i-th element of $F^{r}$ , i.e., one of numerical features.
$f_{i}^{c}$	The i-th element of $F^{c}$ , i.e., one of categorical features.
$F^{'}$	One of the features subset of $F$ .
m	The size of $F^{'}$ .
$L (F^{'})$	The evaluation function for the features subset $F^{'}$ .
$F_{opt .}^{'}$	The optimal features subset leading to the minimum value of $L (F^{'})$ .
z	Either class $z_{0}$ or $z_{1}$ .
$x^{z}$	The features vector of class $z \in {z_{0}, z_{1}}$ .
$x^{z, r}$	The part of feature vector $x^{z}$ that consists numerical values.
$x^{z, c}$	The part of feature vector $x^{z}$ that consists categorical values.
$p^{r}$	The dimension number of $x^{z, r}$ .
$p^{c}$	The dimension number of $x^{z, c}$ .

Appendix A.2. Variables for Representing the Proposed Methods

Variables	Type ¹	Meanings
$D (x^{z_{0}}, x^{z_{1}}; γ)$	Calculation	The mixture distance between two features vectors $x^{z_{0}}$ and $x^{z_{1}}$ .
$D^{E 2} (x^{z_{0}, r}, x^{z_{1}, r})$	Calculation	The squared Euclidean distance between two numerical features $x^{z_{0}, r}$ and $x^{z_{1}, r}$ .
$D^{H} (x^{z_{0}, c}, x^{z_{1}, c})$	Calculation	The Hamming distance between two categorical features $x^{z_{0}, c}$ and $x^{z_{1}, c}$ .
$σ (x_{i}^{z_{0}, c}, x_{i}^{z_{1}, c})$	Calculation	The function for checking whether $x_{i}^{z_{0}, c}$ and $x_{i}^{z_{1}, c}$ are the same or not. If their are the same, it outputs 0, if not, it outputs 1. The function is used for the Hamming distance $D^{H} (x^{z_{0}, c}, x^{z_{1}, c})$ . Note that $x_{i}^{z_{0}, c}$ and $x_{i}^{z_{1}, c}$ are i-th elements of categorical features vectors $x^{z_{0}, c}$ and $x^{z_{1}, c}$ , respectively.
$γ$	Manually	The weight of the Hamming distance $D^{H} (x^{z_{0}, c}, x^{z_{1}, c})$ . When users have a hypothesis in which categorical features are important for classification, they set a large value. When users set $γ = 0$ , the effect of categorical features on distance disappears. The range is $γ \geq 0$ .
$I$	Calculation	It is the minimum reference set (MRS) leading to the correct classification (no error) of all samples by using features subset $F^{'}$ . MRS was proposed in the original study [18].
$C (I)$	Calculation	The average distance between different classes of set $I$ . Appears in Algorithm 1.
$S (I; δ)$	Calculation	The evaluation function of features subset $F^{'}$ considered both of MRS size $I$ and distance $C (I)$ . The lower the value, the better is the feature space for classification. This is equivalent to $L (F^{'})$ .
$δ$	Manually	The effect of the distance between different classes on the evaluation function. This parameter is manually set by the users. When they emphasize the distance between different classes compared with MRS size, they set a large value. The range is $δ \geq 0$ .
b	Manually	Iterations of the Bayesian optimization. Appears in Algorithm 2. This parameter is manually set by the users. When they want to improve accuracy of the obtained solution, they set a large value. The computational cost is highly dependent on this value.
$F_{opt .}^{*}$	Calculation	The solution of features subset for classification obtained by Algorithm 2. The solution’s evaluation $L (F_{opt .}^{*})$ is expected to be close to the optimal solution’s evaluation $L (F_{opt .}^{'})$ .

¹ “Manually” means the users of the proposed methods need setting any value. “Calculation” means the values are automatically calculated.

References

Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Gopika, N.; Kowshalaya, M. Correlation Based Feature Selection Algorithm for Machine Learning. In Proceedings of the 3rd International Conference on Communication and Electronics Systems, Coimbatore, Tamil Nadu, India, 15–16 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 692–695. [Google Scholar]
Yao, R.; Li, J.; Hui, M.; Bai, L.; Wu, Q. Feature Selection Based on Random Forest for Partial Discharges Characteristic Set. IEEE Access 2020, 8, 159151–159161. [Google Scholar] [CrossRef]
Yun, C.; Yang, J. Experimental comparison of feature subset selection methods. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), Omaha, NE, USA, 28–31 October 2007; pp. 367–372. [Google Scholar]
Lin, W.C. Experimental Study of Information Measure and Inter-Intra Class Distance Ratios on Feature Selection and Orderings. IEEE Trans. Syst. Man Cybern. 1973, 3, 172–181. [Google Scholar] [CrossRef]
Huang, C.L.; Wang, C.J. A GA-based feature selection and parameters optimizationfor support vector machines. Expert Syst. Appl. 2006, 31, 231–240. [Google Scholar] [CrossRef]
Stefano, C.D.; Fontanella, F.; Marrocco, C.; Freca, A.S.D. A GA-based feature selection approach with an application to handwritten character recognition. Pattern Recognit. Lett. 2014, 35, 130–141. [Google Scholar] [CrossRef]
Dahiya, S.; Handa, S.S.; Singh, N.P. A feature selection enabled hybrid-bagging algorithm for credit risk evaluation. Expert Syst. 2017, 34, e12217. [Google Scholar] [CrossRef]
Li, G.Z.; Meng, H.H.; Lu, W.C.; Yang, J.Y.; Yang, M.Q. Asymmetric bagging and feature selection for activities prediction of drug molecules. BMC Bioinform. 2008, 9, S7. [Google Scholar] [CrossRef] [Green Version]
Loh, W.Y. Fifty Years of Classification and Regression Trees. Int. Stat. Rev. 2014, 82, 329–348. [Google Scholar] [CrossRef] [Green Version]
Loh, W.Y. Classification and regression trees. Data Min. Knowl. Discov. 2011, 1, 14–23. [Google Scholar] [CrossRef]
Roth, V. The generalized LASSO. IEEE Trans. Neural Networks 2004, 15, 16–28. [Google Scholar] [CrossRef]
Osborne, M.R.; Presnell, B.; Turlach, B.A. On the LASSO and its Dual. J. Comput. Graph. Stat. 2000, 9, 319–337. [Google Scholar] [CrossRef]
Bach, F.R. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar] [CrossRef]
Palma-Mendoza, R.J.; Rodriguez, D.; de Marcos, L. Distributed ReliefF-based feature selection in Spark. Knowl. Inf. Syst. 2018, 57, 1–20. [Google Scholar] [CrossRef] [Green Version]
Huang, Y.; McCullagh, P.J.; Black, N.D. An optimization of ReliefF for classification in large datasets. Data Knowl. Eng. 2009, 68, 1348–1356. [Google Scholar] [CrossRef]
Too, J.; Abdullah, A.R. Binary atom search optimisation approaches for feature selection. Connect. Sci. 2020, 32, 406–430. [Google Scholar] [CrossRef]
Chen, X.W.; Jeong, J.C. Minimum reference set based feature selection for small sample classifications. ACM Int. Conf. Proc. Ser. 2007, 227, 153–160. [Google Scholar] [CrossRef]
Mori, M.; Omae, Y.; Akiduki, T.; Takahashi, H. Consideration of Human Motion’s Individual Differences-Based Feature Space Evaluation Function for Anomaly Detection. Int. J. Innov. Comput. Inf. Control. 2019, 15, 783–791. [Google Scholar] [CrossRef]
Zhao, Y.; He, L.; Xie, Q.; Li, G.; Liu, B.; Wang, J.; Zhang, X.; Zhang, X.; Luo, L.; Li, K.; et al. A Novel Classification Method for Syndrome Differentiation of Patients with AIDS. Evid.-Based Complement. Altern. Med. 2015, 2015, 936290. [Google Scholar] [CrossRef] [Green Version]
Mori, M.; Flores, R.G.; Suzuki, Y.; Nukazawa, K.; Hiraoka, T.; Nonaka, H. Prediction of Microcystis Occurrences and Analysis Using Machine Learning in High-Dimension, Low-Sample-Size and Imbalanced Water Quality Data. Harmful Algae 2022, 117, 102273. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, Y.; Zhu, Z.; Pan, J.S. MRS-MIL: Minimum reference set based multiple instance learning for automatic image annotation. In Proceedings of the International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 2160–2163. [Google Scholar]
Cerda, P.; Varoquaux, G. Encoding High-Cardinality String Categorical Variables. IEEE Trans. Knowl. Data Eng. 2022, 34, 1164–1176. [Google Scholar] [CrossRef]
Beliakov, G.; Li, G. Improving the speed and stability of the k-nearest neighbors method. Pattern Recognit. Lett. 2012, 33, 1296–1301. [Google Scholar] [CrossRef]
Bentley, J.L. Multidimensional binary search trees used for associative searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
Ram, P.; Sinha, K. Revisiting kd-tree for nearest neighbor search. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1378–1388. [Google Scholar]
Ekinci, E.; Omurca, S.I.; Acun, N. A comparative study on machine learning techniques using Titanic dataset. In Proceedings of the 7th International Conference on Advanced Technologies, Hammamet, Tunisia, 26–28 December 2018; pp. 411–416. [Google Scholar]
Kakde, Y.; Agrawal, S. Predicting survival on Titanic by applying exploratory data analytics and machine learning techniques. Int. J. Comput. Appl. 2018, 179, 32–38. [Google Scholar] [CrossRef]
Huang, Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
Wen, T.; Zhang, Z. Effective and extensible feature extraction method using genetic algorithm-based frequency-domain feature search for epileptic EEG multiclassification. Medicine 2017, 96. [Google Scholar] [CrossRef] [PubMed]
Song, J.; Zhu, A.; Tu, Y.; Wang, Y.; Arif, M.A.; Shen, H.; Shen, Z.; Zhang, X.; Cao, G. Human Body Mixed Motion Pattern Recognition Method Based on Multi-Source Feature Parameter Fusion. Sensors 2020, 20, 537. [Google Scholar] [CrossRef]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. Adv. Neural Inf. Process. Syst. 2011, 42. [Google Scholar]
Optuna: A Hyperparameter Optimization Framework. Available online: https://optuna.readthedocs.io/en/stable/ (accessed on 1 November 2022).

Figure 1. Issues related to the original minimum reference set (MRS) [18] feature selection algorithm.

Figure 2. Overview of the probability distributions defined by Equations (21) and (23) for generating artificial data. The blue and red circles represent the distributions used for generating samples belonging to

z_{0}

and

z_{1}

, respectively. The number of blue circles on each figure is one because samples belonging to class

z_{0}

are generated by the Gaussian distribution. The number of red circles is three because the samples of class

z_{1}

are generated based on a Gaussian mixture distribution. The radius of the circle represents the standard deviation. These figures indicate that the average values of these distributions are changed by

x_{3}

and

x_{4}

, the values of categorical features. Therefore, to correctly classify classes

z_{0}

and

z_{1}

, categorical features should be used. We used artificial samples generated by these distributions for verifying the effectiveness of the proposed methods E2H MRS and BS-FS. These results are described in Section 4 and Section 5.

Figure 2. Overview of the probability distributions defined by Equations (21) and (23) for generating artificial data. The blue and red circles represent the distributions used for generating samples belonging to

z_{0}

and

z_{1}

, respectively. The number of blue circles on each figure is one because samples belonging to class

z_{0}

are generated by the Gaussian distribution. The number of red circles is three because the samples of class

z_{1}

are generated based on a Gaussian mixture distribution. The radius of the circle represents the standard deviation. These figures indicate that the average values of these distributions are changed by

x_{3}

and

x_{4}

, the values of categorical features. Therefore, to correctly classify classes

z_{0}

and

z_{1}

, categorical features should be used. We used artificial samples generated by these distributions for verifying the effectiveness of the proposed methods E2H MRS and BS-FS. These results are described in Section 4 and Section 5.

Figure 3. Artificial data generated by using Equations (21) and (23), and Figure 2. The cases (A)–(D) vary in parameters

e^{c}

and

e^{r}

. The values of categorical features

x_{3}

and

x_{4}

change from left to right. The rightmost figure presents an explicit scatter plot of the numerical features

x_{1}, x_{2}

, i.e., no categorical features

x_{3}, x_{4}

are considered.

Figure 3. Artificial data generated by using Equations (21) and (23), and Figure 2. The cases (A)–(D) vary in parameters

e^{c}

and

e^{r}

. The values of categorical features

x_{3}

and

x_{4}

change from left to right. The rightmost figure presents an explicit scatter plot of the numerical features

x_{1}, x_{2}

, i.e., no categorical features

x_{3}, x_{4}

are considered.

Figure 4. Effect of the distance weight

δ

on the evaluation

L (F^{'})

.

Figure 4. Effect of the distance weight

δ

on the evaluation

L (F^{'})

.

Figure 5. The generated feature subsets for the experiment 2 and the candidates of solutions.

Figure 6. Detection rate of

F^{Sol .}

and

F^{Q . Sol .}

for the E2H MRS and BS-FS (Bayesian optimization iteration

b \in {0, 100}

, and Hamming weight

γ \in {0.1, 1, 10}

).

Figure 6. Detection rate of

F^{Sol .}

and

F^{Q . Sol .}

for the E2H MRS and BS-FS (Bayesian optimization iteration

b \in {0, 100}

, and Hamming weight

γ \in {0.1, 1, 10}

).

Figure 7. Detection rate of two or three correct features (Hamming weight

γ = 1

and Bayesian iteration

b \in {0, 100}

).

Figure 7. Detection rate of two or three correct features (Hamming weight

γ = 1

and Bayesian iteration

b \in {0, 100}

).

Table 1. Evaluation scores of the feature spaces (A)–(D) shown in Figure 3.

(γ, δ) = (0, 0)

represents original MRS and

(γ, δ) = (1, 1)

and

(1, 5)

represent E2H MRS. Note that total samples size on each feature space is 120 (class

z_{0}

: 60, class

z_{1} : 60

).

Table 1. Evaluation scores of the feature spaces (A)–(D) shown in Figure 3.

(γ, δ) = (0, 0)

represents original MRS and

(γ, δ) = (1, 1)

and

(1, 5)

represent E2H MRS. Note that total samples size on each feature space is 120 (class

z_{0}

: 60, class

z_{1} : 60

).

Feature Space $(e^{c}, e^{r})$	Setting Parameters $(γ, δ)$ ¹	MRS Size $\| I \|$	Damping Coefficient ${(1 - C (I))}^{δ}$	Score $S (I; δ)$ ²
(A) $(10, 30)$	(0, 0)	48	1.000	48.00
(B) $(10, 50)$	(0, 0)	56	1.000	56.00
(C) $(30, 30)$	(0, 0)	63	1.000	63.00
(D) $(30, 50)$	(0, 0)	67	1.000	67.00
(A) $(10, 30)$	(1, 1)	35	0.983	34.41
(B) $(10, 50)$	(1, 1)	26	0.960	24.95
(C) $(30, 30)$	(1, 1)	35	0.993	34.77
(D) $(30, 50)$	(1, 1)	27	0.981	26.48
(A) $(10, 30)$	(1, 5)	35	0.844	29.54
(B) $(10, 50)$	(1, 5)	26	0.661	17.20
(C) $(30, 30)$	(1, 5)	35	0.935	32.73
(D) $(30, 50)$	(1, 5)	27	0.822	22.20

¹

γ

: Hamming weight,

δ

: distance weight. ² The lower the value of S (I;

δ

), the better is the feature space for classification.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Omae, Y.; Mori, M. E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm. Mach. Learn. Knowl. Extr. 2023, 5, 109-127. https://doi.org/10.3390/make5010007

AMA Style

Omae Y, Mori M. E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm. Machine Learning and Knowledge Extraction. 2023; 5(1):109-127. https://doi.org/10.3390/make5010007

Chicago/Turabian Style

Omae, Yuto, and Masaya Mori. 2023. "E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm" Machine Learning and Knowledge Extraction 5, no. 1: 109-127. https://doi.org/10.3390/make5010007

APA Style

Omae, Y., & Mori, M. (2023). E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm. Machine Learning and Knowledge Extraction, 5(1), 109-127. https://doi.org/10.3390/make5010007

Article Menu

E2H Distance-Weighted Minimum Reference Set for Numerical and Categorical Mixture Data and a Bayesian Swap Feature Selection Algorithm

Abstract

1. Introduction

2. Proposed Method

2.1. Mathematical Representation of Feature Subset Selection

2.2. E2H Distance-Weighted MRS Algorithm

2.3. Distance Function

2.4. Evaluation Function of a Feature Subset

2.5. Bayesian Swap Feature Selection Algorithm

3. Artificial Dataset for the Verification of the Proposed Methods

4. Experiment 1: Relationship between the Distance between Different Classes and the E2H MRS Evaluation

4.1. Objective and Outline

4.2. Result and Discussion

5. Experiment 2: Effectiveness of BS-FS in Finding Desirable Feature Subsets

5.1. Objective and Outline

5.2. Result and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Variables and Their Meanings Table

Appendix A.1. Variables for Representing Problem Description

Appendix A.2. Variables for Representing the Proposed Methods

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI