B-MFO: A Binary Moth-Flame Optimization for Feature Selection from Medical Datasets

: Advancements in medical technology have created numerous large datasets including many features. Usually, all captured features are not necessary, and there are redundant and irrelevant features, which reduce the performance of algorithms. To tackle this challenge, many metaheuristic algorithms are used to select effective features. However, most of them are not effective and scalable enough to select effective features from large medical datasets as well as small ones. Therefore, in this paper, a binary moth-ﬂame optimization (B-MFO) is proposed to select effective features from small and large medical datasets. Three categories of B-MFO were developed using S-shaped, V-shaped, and U-shaped transfer functions to convert the canonical MFO from continuous to binary. These categories of B-MFO were evaluated on seven medical datasets and the results were compared with four well-known binary metaheuristic optimization algorithms: BPSO, bGWO, BDA, and BSSA. In addition, the convergence behavior of the B-MFO and comparative algorithms were assessed, and the results were statistically analyzed using the Friedman test. The experimental results demonstrate a superior performance of B-MFO in solving the feature selection problem for different medical datasets compared to other comparative algorithms.


Introduction
Nowadays, with advances in science and medical technology, numerous large medical datasets including many features have been created, which also contain redundant and irrelevant features. Data-driven decision making in high-risk diseases such as heart diseases [1] is a significant trend in which many data mining and machine learning methods are introduced [2]. Since medical data are obtained from multiple sources, all captured features are not necessary and some of them are irrelevant and redundant, which may reduce algorithms' performance in the data-driven decision-maker software. The irrelevant and redundant data can be removed since they are useless in improving the accuracy's classification, because the irrelevant data have a weak correlation with class and the redundant data have a strong correlation with one or more features. For instance, the FSBRR algorithm [3] removes the feature of radius in the Breast Cancer Wisconsin Dataset as a redundant feature because its correlation is very high with feature of smoothness. Feature selection can address this problem, through which a subset of the relevant and effective features must be found. Feature selection is used in a variety of real-world applications such as disease diagnosis [4,5], email spam detection [6], text clustering [7,8], and human activity recognition [9].
Based on the strategy used for selecting features, the feature selection algorithms can be categorized into three methods [10]: filter-based, wrapper-based, and hybrid methods. The filter-based methods analyze features based on intrinsic properties of the data, without using the classification algorithm [11], whereas wrapper-based methods use classifiers to assess possible solutions [12]. Hybrid methods combine the benefits of both filter-based and wrapper-based approaches. Although the wrapper-based methods are computationally expensive and their performance depends on the utilized learning algorithm, they are usually more accurate than the other two categories [13]. The wrapper-based methods use different search approaches such as exhaustive, random, greedy, heuristic, and metaheuristic [14], which, except for the last search approach, are impractical to select effective features from medium and large datasets [15]. Thus, a wide range of metaheuristic optimization algorithms is proposed to solve the feature selection problems for applications with large datasets such as medicine [16].
Metaheuristic optimization algorithms are mostly inspired by nature, and can be classified into three categories: evolutionary, physics-based, and swarm intelligence. The simplicity in development and sufficient results of the swarm intelligence (SI) algorithms for a variety of problems have made some of them very attractive and popular, such as particle swarm optimization (PSO) [17], grey wolf optimizer (GWO) [18], whale optimization algorithm (WOA) [19], moth-flame optimization (MFO) [20], and aquila optimizer (AO) [21]. Since SI algorithms mimic the behavioral model of insects, aquatic animals, terrestrial animals, and birds, these algorithms can share information between their swarms that enhances their robustness [22]. Local optima trapping, premature convergence, and unbalanced search strategy may all be disadvantages of these algorithms, despite their benefits [23,24]. Hence, many improvements of these algorithms have been introduced thus far, including optimal control strategies [25], chaotic whale optimization algorithm (CWOA) [26], grasshopper optimization algorithm (GOA) based on the opposition-based learning (OBLGOA) [27], improved grey wolf optimizer (I-GWO) [28], disruption barebones particle swarm optimization (DBPSO) [29], improved krill herd (IKH) [30], particle swarm optimization with backtracking search optimization algorithm (PSOBSA) [31], and representative-based grey wolf optimizer (R-GWO) [32].
Since feature selection is an NP-complete problem [33], SI algorithms are widely used to solve this problem. Many researchers have adapted different SI algorithms for converting from the continuous form to binary, such as wrapper-based binary sine cosine algorithm (WBSCA) [34], binary grasshopper optimization algorithm (BGOA) [35], binary butterfly algorithm (BBA) [36], efficient binary symbiotic organisms search (EBSOS) [37], binary grey wolf optimizer with support vector machine (GWOSVM) [38], and island binary moth-flame optimization (IsBMFO) [39]. However, most binary SI algorithms are not effective or scalable enough to select effective features from large datasets as well as small ones.
Therefore, in this study, a binary moth-flame optimization (B-MFO) is proposed to solve the feature selection problem. The canonical MFO was introduced by Mirjalili [20], which was inspired by the transverse orientation mechanism of the moths in the night around artificial lights. There have been many variants of MFO developed, such as EMFO [40], IMFO [41], CLSGMFO [42], and improved MFO [43], given its simplicity, weaknesses, and applications. Several researchers have applied S-shaped and V-shaped transfer functions to convert the continuous MFO to binary. In this study, in addition to the S-shape and V-shaped transfer functions, another transfer function named U-shaped was adapted, which is a novel transfer function with multiple alterable parameters for solving feature selection problems. Each category contains four versions of transfer functions. Therefore, twelve versions of B-MFO were introduced in three categories of transfer functions. Then, they were evaluated by seven medical datasets: Pima, Lymphography, Breast-WBDC, PenglungEw, Parkinson, Colon, and Leukemia. In addition, the winner versions of B-MFO were compared with the best results gained by four well-known binary metaheuristic optimization algorithms: BPSO [44], bGWO [45], BDA [46], and BSSA [47]. The convergence behavior of the winner versions of B-MFO and comparative algorithms was evaluated and visualized. Finally, the results were statistically analyzed by the Friedman test.
In the rest of this study, Section 2 discusses the related works. Section 3 describes the canonical MFO algorithm. Then, the proposed B-MFO is presented and evaluated in Sections 4 and 5. Finally, the conclusion and future works are explained in Section 6.

Related Work
There are many different discrete problems such as feature selection [48,49], tour planning [50], complex systems [51], and traveling salesman problems [52] that must be solved with discrete optimization algorithms [53]. To solve feature selection problems, wrapper-based methods widely apply discrete metaheuristic optimization algorithms as search strategies to find effective feature subsets [47,[54][55][56][57]. Since the majority of metaheuristic optimization algorithms such as DA [58], SSA [59], HGSO [60], FFA [61], MTDE [62], QANA [63], and AO [21] have been proposed to solve continuous problems such as engineering [64][65][66][67][68], cloud computing [69], and rail-car fleet sizing [70], they should be converted into binary algorithms for using in wrapper-based methods and solving discrete problems. The continuous algorithm can be converted to a binary form in a variety of ways [71]. The JayaX [72] and BitABC [73] use the logical operators for changing to the binary form. Another way is using the transfer function (TF), which converts the continuous search space to the binary one in which the search agents can shift to nearer or farther corners of a hypercube by flipping various numbers of bits [44]. Thus, transfer functions apply a mapping function to gain the probability of changing a solution from 0 to 1 or vice versa.
Many transfer functions were introduced such as S-shaped [44,74], V-shaped [74,75], and U-shaped [76] to convert the continuous metaheuristic optimization algorithms to binary ones. The binary particle swarm optimization (BPSO) [44] was introduced by Kennedy and Eberhart, which applied a sigmoid function to solve various discrete optimization problems [77][78][79]. Yuan et al. [80] proposed a new improved binary PSO (IBPSO) in which the BPSO is combined with the lambda-iteration method to solve the unit commitment problem. The BPSO has been applied for various problems such as text clustering [81,82], text feature selection [83], and disease diagnosis [84][85][86].
Binary grey wolf optimizer (bGWO) is another wrapper method for feature selection which was proposed by Emary et al. [45]. The binary version of GWO was performed using the sigmoid transfer function and was utilized to fix the feature selection problems and large-scale unit commitment [87,88], and text classification [89]. To enhance the solution quality of transfer functions, Hu et al. in [90] introduced new transfer functions and improved them based on analysis parameters of GWO. Al-tashi et al. [87] proposed a new hybrid optimization algorithm named (BPSOGWO) to find the best feature subset.
Zamani et al. [91] proposed a new metaheuristic algorithm named feature selection based on whale optimization algorithm (FSWOA) to reduce the dimensionality of medical datasets. Hussien et al. proposed two binary variants of WOA (bWOA) [92,93] based on Vshaped and S-shaped to use for dimensionality reduction and classification problems. The binary WOA (BWOA) [94] was suggested by Reddy et al. for solving the PBUC problem, which mapped the continuous WOA to the binary one through various transfer functions.
The binary dragonfly algorithm (BDA) [95] was proposed by Mafarja to solve discrete problems. The BDFA [96] was proposed by Sawhney et al. which incorporates a penalty function for optimal feature selection. Although BDA has good exploitation ability, it suffers from being trapped in local optima. Thus, a wrapper-based approach named hyper learning binary dragonfly algorithm (HLBDA) [97] was developed by Too et al. to solve the feature selection problem. The HLBDA used the hyper learning strategy to learn from the personal and global best solutions during the search process.
Faris et al. employed the binary salp swarm algorithm (BSSA) [47] in the wrapper feature selection method. Ibrahim et al. proposed a hybrid optimization method for the feature selection problem which combines the slap swarm algorithm with the particle swarm optimization (SSAPSO) [98]. The chaotic binary salp swarm algorithm (CBSSA) [99] was introduced by Meraihi et al. to solve the graph coloring problem. The CBSSA applies a logistic map to replace the random variables used in the SSA, which causes it to avoid the stagnation to local optima and improves exploration and exploitation. A time-varying hierarchal BSSA (TVBSSA) was proposed in [15] by Faris et al. to design an improved wrapper feature selection method, combined with the RWN classifier.

The Canonical Moth-Flame Optimization
Moth-flame optimization (MFO) [20] is a nature-inspired algorithm that imitates the transverse orientation mechanism of moths in the night around artificial lights. This mechanism applies to navigation, and forces moths to fly in a straight line and maintain a constant angle with the light. MFO's mathematical model assumes that the moths' position in the search space corresponds to the candidate solutions, which are represented in a matrix, and the corresponding fitness of the moths are stored in an array. In addition, a flame matrix shows the best positions obtained by the moths so far, and an array is used to indicate the corresponding fitness of the best positions. To find the best result, moths search around their corresponding flame and update their positions; therefore, moths never lose their best position. Equation (1) shows the position updating of each moth relative to the corresponding flame.
where S is the spiral function, and M i and F j represent the i-th moth and the j-th flame, respectively. The main update mechanism is a logarithmic spiral, which is defined by Equation (2): where D i is the distance between the i-th moth and the j-th flame, which is computed by Equation (3), and b is a constant value for defining the shape of the logarithmic spiral. The parameter t is a random number in the range [−r, 1], in which r is a convergence factor and linearly decreases from −1 to −2 during the course of iterations.
To avoid trapping in local optima, each moth updates its position using one flame. In each iteration, the list of flames is updated and sorted based on their fitness values. The first moth updates its position according to the best flame and the last moth updates its position according to the worst flame. In addition, to increase the exploitation of the best promising solutions, the number of flames is reduced in the course of iterations by an adaptive mechanism, which is shown in Equation (4): where N indicates the maximum number of moths, and iter and MaxIter are the current and maximum number of iterations, respectively.

Binary Moth-Flame Optimization (B-MFO)
In this study, three different categories of S-shaped, V-shaped, and U-shaped transfer functions are applied to convert the MFO algorithm from continuous to binary for solving the feature selection problem. First, in Section 4.1, these different categories of transfer functions and how to apply them to develop different variants of B-MFO are described in detail accompanied by their flowchart and pseudo-code. Then, in Section 4.2, solving feature selection problem using B-MFO is explained.

B-MFO Using S-Shaped Transfer Function
The sigmoid (S-shaped) function shown in Equation (5) is a usual transfer function named S 2 [100], which was originally introduced for developing the binary PSO (BPSO) [44].
where v i d (t) is the i-th search agent's velocity in dimension d at iteration t. The TF s converts the velocity to a probability value and the next position x i d (t + 1) is obtained with the probability value of its velocity as given in Equation (6), where r is a random value between 0 and 1.
According to Equation (6), the position updating of each search agent is computed by the current velocity and the previous position. In some binary metaheuristic algorithms such as BPSO [44] and BGSA [101], the velocity is used in transfer functions to calculate the probability value of changing the position. In some other algorithms such as bGWO [45] and BMFO [102], transfer functions apply the updated position of each search agent to calculate the probability value. In addition to the S 2 function introduced in Equation (5), three variants of the S-shaped function named S 1 , S 3 , and S 4 [74] are developed by manipulating the coefficient of the velocity value in Equation (5). All variants of the S-shaped transfer function are shown in Table 1 and visualized in Figure 1, which shows that if the slope of the S-shaped transfer function increases, the probability value of changing the position value increases. Thus, among of S-shaped functions, the S 1 obtains the highest probability and the S 4 provides the lowest value for the same velocity, which can affect the position updating of search agents and finding the optimum solution. In addition to the advantages of S-shaped, this category of transfer functions has a shortcoming in those metaheuristic algorithms that search agents are updated considering by their velocity value. The zero value of velocity is converted to one or zero with a probability of 0.5, while the search agents should not be moved with the zero value of velocity [103]. Several researchers tried to resolve this shortcoming, but they could not avoid trapping into local optima. Table 1. The variants of S-shaped, V-shaped, and U-shaped transfer functions.

S-Shaped Transfer Function V-Shaped Transfer Function U-Shaped Transfer Function
No.

B-MFO Using V-Shaped Transfer Function
The hyperbolic (V-shaped) function [104] named V2 was presented for developing BGSA [101] which has new position updating as shown in Equation (7).
where vi d (t) shows the i-th search agent's velocity in dimension d at iteration t. Since the V-shaped function is different from the S-shaped function, this function is updated with new rules that are shown in Equation (8).
where xi d (t) indicates the i-th search agent's position in dimension d at iteration t and ¬ (xi d (t)) represents the complement of (xi d (t)). In addition, r is a random value between 0 and 1. If the velocity is low, the TFv encourages the search agents to stay in their current positions; otherwise, if the velocity is high, the search agents switch to complement. In addition to the function introduced in Equation (7), three variants of V-shaped function named V1, V3, and V4 are introduced [74], which are shown in Table 1 and Figure 1. It can be seen that V1 provides a higher probability than V2, V3, and V4 for the same velocity, which can affect the position updating of search agents and finding the optimum solution.
The V-shaped transfer function was proposed to tackle some shortcomings of the Sshaped. Although this transfer function can solve the problem of metaheuristic algorithms with the velocity by zero value, they still suffer from falling into local optima. If in an iteration, the velocity of a search agent is low, in the next iteration, the search agent remains the same with high probability [103]. In this study, in addition to the transfer functions mentioned so far, we utilized a novel U-shaped transfer function to convert the continuous MFO to binary form.

B-MFO Using U-Shaped Transfer Function
The U-shaped transfer function [76] was proposed with two control parameters α and β that define the slope and width of the U-shaped function's basin, respectively. Equations (9) and (10) indicate the U-shaped function.

B-MFO Using V-Shaped Transfer Function
The hyperbolic (V-shaped) function [104] named V 2 was presented for developing BGSA [101] which has new position updating as shown in Equation (7).
where v i d (t) shows the i-th search agent's velocity in dimension d at iteration t. Since the V-shaped function is different from the S-shaped function, this function is updated with new rules that are shown in Equation (8).
where x i d (t) indicates the i-th search agent's position in dimension d at iteration t and ¬(x i d (t)) represents the complement of (x i d (t)). In addition, r is a random value between 0 and 1. If the velocity is low, the TF v encourages the search agents to stay in their current positions; otherwise, if the velocity is high, the search agents switch to complement. In addition to the function introduced in Equation (7), three variants of V-shaped function named V 1 , V 3 , and V 4 are introduced [74], which are shown in Table 1 and Figure 1. It can be seen that V 1 provides a higher probability than V 2 , V 3 , and V 4 for the same velocity, which can affect the position updating of search agents and finding the optimum solution. The V-shaped transfer function was proposed to tackle some shortcomings of the S-shaped. Although this transfer function can solve the problem of metaheuristic algorithms with the velocity by zero value, they still suffer from falling into local optima. If in an iteration, the velocity of a search agent is low, in the next iteration, the search agent remains the same with high probability [103]. In this study, in addition to the transfer functions mentioned so far, we utilized a novel U-shaped transfer function to convert the continuous MFO to binary form.

B-MFO Using U-Shaped Transfer Function
The U-shaped transfer function [76] was proposed with two control parameters α and β that define the slope and width of the U-shaped function's basin, respectively. Equations (9) and (10) indicate the U-shaped function.
where v i d (t) shows the i-th search agent's velocity in dimension d at iteration t, and the r is a uniform random number between 0 and 1. We used the U-shaped transfer function accompanied by two main conditions shown in Equations (11) and (12) in which the lower and upper bounds are limited by 1.
The variants of the U-shaped transfer function named U 1 , U 2 , U 3 , and U 4 were introduced using a different control parameter β [76], shown in Table 1 and Figure 1. In the initial iterations, the exploration is an important step to seek the whole search space and after switching from exploration to exploitation, in the final iterations, the exploitation step is essential to find better solutions. The U-shaped transfer functions with different shapes have different exploratory and exploitative behaviors. As illustrated in Figure 1, the U-shaped is comparable to V-shaped; however, variants of U-shaped have a higher exploration in contrast to variants of V-shaped. The U1 function and variants of V-shaped intersect around +0.7 and −0.7; before this point, the exploration of V-shaped is higher, and after it, variants of U-shaped display a higher exploration. Therefore, U-shaped transfer functions have sufficient potential to outperform the other transfer functions.
To map the continuous MFO to the binary one, each search agents' dimension obtained by Equation (2) is converted to a probability value in the range [0, 1] using all variants of TF s , TF v , and TF u . Therefore, we adapted twelve different transfer functions from the three categories, S-shaped, V-shaped, and U-shaped. By using these transfer functions, each search agents' position is mapped to the probability value using Equations (5), (7), and (9). Finally, this probability value is updated using Equations (6), (8), and (10), and a new search agent's position is created. Thus, we compare twelve versions of proposed B-MFO to find the suitable version. It is important to apply a proper transfer function, since converting a continuous search space to a binary one is significant in the results of classifier of feature selection problems. Algorithm 1 and Figure 2 show the pseudo-code and the flowchart of B-MFO, respectively. The time complexity of B-MFO is O(NDT) where N, D, and T signify the population size, dimension, and maximum number of iterations, respectively.

B-MFO for Solving Feature Selection Problem
The feature selection problem is to select an optimum subset of the relevant and effective features to construct a more accurate data model. To formulate the feature selection problem, a vector of one or zero as a subset of features is defined by using a transfer function, which obtains probability values to change elements in the vector that can be 0 (not selected) or 1 (selected). The length of the vector is equal to the dimensions of the dataset. In addition, a fitness function is determined to evaluate the subset of features. The problem of feature selection is referred to as a multi-objective optimization problem [105,106] since it usually aims to minimize the number of selected features and maximize the data model accuracy. As shown in Equation (13), the objectives are represented in a fitness function, where CE shows the classification error. N sf and N tf are the number of selected features and total features of the dataset, respectively. η and λ (1 − η) demonstrate the significance of classification quality and feature reduction, respectively [46]. Initializing the moth population.

19:
End for 20: End for 21: Calculating the probability value of M(i, j) using TF s in Equation (5), TF v in Equation (7), and TF u in Equation (9). 22: Updating new position using Equation (6)

Experimental Assessment
To execute the proposed B-MFO and other comparative algorithms in a fair condition, all algorithms were implemented using the MATLAB 2018a platform. They were conducted on Windows 10 with a processor Intel Core i7-6500U CPU (2.50 GHz) with 8 GB on main memory. In addition, the population size (N) and the maximum number of iterations (MaxIter) were considered as 20 and 300, respectively, and each algorithm was run 30 times. The proposed B-MFO is experimentally evaluated on all transfer functions of the three categories of S-shaped [44], V-shaped [101], and U-shaped [76] to solve the feature selection problem. The experimental results were compared with the best result gained by four wellknown binary metaheuristic algorithms: BPSO [44], bGWO [45], BDA [46], and BSSA [47]. The parameters of BPSO and bGWO were set to be the same as original studies such as that for BPSO w = [0.9 to 0.4] and C 1 = C 2 = 2, and for bGWO a = [2 to 0]. In addition, the other algorithms did not need any parameter setting.

Data Description
In this study, seven medical datasets [107,108] were applied to evaluate the B-MFO and comparative algorithms in the feature selection problem. Table 2 shows the details of datasets in terms of the number of features, number of samples, and size that is considered large if the number of features is more than 100. In our evaluation, a k-nearest neighbor (k-NN) classifier with a Euclidean distance metric and k neighbor = 5 [56] was applied as a fitness function to assess the quality of selected features subsets. To reduce the overfitting problem, the k-fold cross-validation with k fold = 10 was utilized, which divides datasets into k folds, and the classifier used the k-1 folds for training data and the 1 fold for test data. This process was repeated for each of the k folds, and all folds were selected once as test data.

Evaluation Criteria
The proposed B-MFO was compared with comparative algorithms using various metrics consisting of average accuracy, the standard deviation of accuracy, average fitness, the standard deviation of fitness, and the average number of selected features. Moreover, the performance of the k-NN classifier was measured using sensitivity and specificity derived from the confusion matrix, which includes the information about actual and predicted classifications given by the classifier. The sensitivity is a metric that evaluates the ability of the model to predict true positives, and the specificity is the metric that measures the ability of the model to predict true negatives. The average accuracy gained by B-MFO and comparative algorithms was statistically analyzed by the nonparametric Friedman test [109]. In addition, the convergence behavior of B-MFO and comparative algorithms were visualized.

Discussion of the Results
In this section, the best results achieved by B-MFO using three categories S-shaped, V-shaped, and U-shaped for each dataset are compared to comparative algorithms in terms of various metrics. Table 3 reports the average accuracy, the standard deviation of accuracy, and the average number of selected features. The average fitness and the standard deviation of fitness are indicated in Table 4, where the bold letters characterize the best results. In addition, Table 5 demonstrates achieved specificity and sensitivity by the k-NN classifier on large datasets, which proves that B-MFO has presented better results than the comparative algorithms. Our hypothesis is that the sensitivity and specificity of B-MFO are more reliable than other comparative algorithms when the size of datasets is increased. According to Figure 3 and the average accuracy obtained, the B-MFO outperforms the comparative algorithms, especially on large datasets. In addition, in most datasets, the minimum number of features selected by B-MFO shows that B-MFO could avoid the local optima trapping and obtain the optimum solution. Figure 4 presents the average number of selected features in large datasets: PenglungEW, Parkinson, Colon, and Leukemia. These results indicate the significant effect of transfer functions on algorithms' behavior in the position updating of search agents and finding the optimum solution in the feature selection problem. Among the three categories of transfer functions used by B-MFO, the U-shaped transfer functions outperform the V-shaped and S-shaped in terms of maximizing the classification accuracy and minimizing the number of selected features, especially for large datasets.                Figure 5 shows convergence curves of average fitness achieved by the winner version of B-MFO and comparative algorithms. The curves show that B-MFO can find better solutions and provide a balance between exploration and exploitation. To statistically analyze the results, the Friedman test was applied to rank B-MFO and comparative algorithms. Table 6 presents the results of the Friedman test on average accuracy achieved by B-MFO and comparative algorithms, which shows B-MFO has the first rank in comparison with other algorithms.

Conclusions and Future Work
Numerous large datasets that include redundant and irrelevant features have been created in the field of medical technology. To select effective features from different medical datasets, this study proposed three categories of binary moth-flame optimization (B-MFO). Consequently, the canonical MFO was converted from continuous to binary using variants of S-shaped, V-shaped, and U-shaped transfer functions. Each category contains four versions of transfer functions; accordingly, twelve versions of B-MFO were experimentally evaluated on seven medical datasets. Finally, the winner versions of B-MFO were compared with the best results achieved by four well-known binary metaheuristic optimization algorithms: BPSO, bGWO, BDA, and BSSA. The results show that the B-MFO algorithm outperforms other comparative algorithms in terms of classification accuracy and minimizing the number of selected features, especially for large medical datasets. In addition, among variants of transfer functions used by B-MFO, the U-shaped functions outperform the V-shaped and S-shaped in terms of classification accuracy and minimized the number of selected features. For future works, B-MFO particularly using U-shaped

Conclusions and Future Work
Numerous large datasets that include redundant and irrelevant features have been created in the field of medical technology. To select effective features from different medical datasets, this study proposed three categories of binary moth-flame optimization (B-MFO). Consequently, the canonical MFO was converted from continuous to binary using variants of S-shaped, V-shaped, and U-shaped transfer functions. Each category contains four versions of transfer functions; accordingly, twelve versions of B-MFO were experimentally evaluated on seven medical datasets. Finally, the winner versions of B-MFO were compared with the best results achieved by four well-known binary metaheuristic optimization algorithms: BPSO, bGWO, BDA, and BSSA. The results show that the B-MFO algorithm outperforms other comparative algorithms in terms of classification accuracy and minimizing the number of selected features, especially for large medical datasets. In addition, among variants of transfer functions used by B-MFO, the U-shaped functions outperform the V-shaped and S-shaped in terms of classification accuracy and minimized the number of selected features. For future works, B-MFO particularly using U-shaped transfer functions can be applied to select effective features in large-scale optimization problems, since it showed sufficient results for large datasets. In addition, B-MFO can be applied to other applications such as various engineering applications.