Distributed Fuzzy Cognitive Maps for Feature Selection in Big Data Classiﬁcation

: The features of a dataset play an important role in the construction of a machine learning model. Because big datasets often have a large number of features, they may contain features that are less relevant to the machine learning task, which makes the process more time-consuming and complex. In order to facilitate learning, it is always recommended to remove the less signiﬁcant features. The process of eliminating the irrelevant features and ﬁnding an optimal feature set involves comprehensively searching the dataset and considering every subset in the data. In this research, we present a distributed fuzzy cognitive map based learning-based wrapper method for feature selection that is able to extract those features from a dataset that play the most signiﬁcant role in decision making. Fuzzy cognitive maps (FCMs) represent a hybrid computing technique combining elements of both fuzzy logic and cognitive maps. Using Spark’s resilient distributed datasets (RDDs), the proposed model can work effectively in a distributed manner for quick, in-memory processing along with effective iterative computations. According to the experimental results, when the proposed model is applied to a classiﬁcation task, the features selected by the model help to expedite the classiﬁcation process. The selection of relevant features using the proposed algorithm is on par with existing feature selection algorithms. In conjunction with a random forest classiﬁer, the proposed model produced an average accuracy above 90%, as opposed to 85.6% accuracy when no feature selection strategy was adopted.


Introduction
In an age when big data is becoming increasingly popular, there is a huge amount of information that is redundant as well as irrelevant, which poses a challenge for academics and industry alike. The data that are gathered may be of high dimensionality, and, in most cases, not all of the features that are gathered are equally meaningful. Some of the features may be noisy, nonsensical, correlated, or unrelated to the task at hand. In most cases, high-dimensional data pose a problem for modeling tasks since the models are not always designed to cope with excessive amounts of inconsequential features, and this can reduce the performance of a predictive model. As a way of mitigating these problems, feature selection can identify relevant features and select them, eliminating nonessential or redundant features to maintain or improve classification accuracy [1]. Feature selection is a topic that has been studied intensively for a long time. Given the need to process massive amounts of data in a wide range of fields, the importance of this task has only grown. A second issue that practitioners must contend with is the lack of available computing resources. When dealing with huge amounts of data, the bulk of the approaches that are currently available for feature selection do not scale properly, and their efficiency may substantially decline to the point that they are no longer usable. Since present approaches are expected to prove inadequate for addressing the rising number of features found in big data, Bolón-Canedo et al. [2] concluded in their analysis of the most popular feature selection methods that a growing demand exists for scalable and efficient feature selection methods. Present approaches for feature selection are not likely to scale well when working with extensive data due to the fact that efficiency may considerably decline or the feature selection strategy may become inapplicable, which is the primary research need that has been identified.
Fuzzy Cognitive Maps (FCMs) [3] are systems inspired by human cognitive ability. FCMs employ a recursive learning procedure in order to learn about a given system and discover the various aspects of the system. An FCM is represented using a directed acyclic graph. The nodes in the graph represent the most important components of the system under consideration and the connections between the nodes represent the causal relationships between these concepts. Since FCMs represent the cause-effect relationships between attributes of a system, FCMs can be adopted to identify the features of a system that have maximum influence in the decision-making process. One of the major barriers in using FCMs is their inability to handle large datasets. This can be overcome by employing a distributed process to perform the FCM learning. The purpose of the current research was to look at the feasibility of using distributed FCMs to choose features from different data sources. The major contributions of this work are the following: • A novel fuzzy cognitive map based technique to extract the most significant features in a dataset that contribute to decision making was introduced. • The proposed model was implemented in a distributed manner, thus enabling the scalability of the feature selection algorithm.

•
Comparison of the performance of the proposed distributed fuzzy cognitive map feature selection algorithm with other best-performing algorithms was carried out.
The paper is organized as follows: Section 2 provides insight into the current literature. Section 3 elaborates on the materials and methods used in the article. Section 4 demonstrates the results obtained and Section 5 is the performance analysis and discussion section. The paper's conclusion is presented in Section 6.

Literature Review
A feature selection approach does not alter the original representation of the variables being analyzed; rather, it only selects a subset of them, as opposed to other dimensionality reduction approaches, such as those based on projection or compression. In this way, the variables retain their original meaning by preserving their original semantics and offering the advantage of being able to be interpreted by domain experts. Features can serve a variety of purposes, but the most important goals are (a) to reduce overfitting and improve model performance, that is, to achieve better prediction performance in supervised classification and better cluster-detecting performance in clustering; (b) to develop faster and more cost-effective models; and (c) to gain an understanding of the processes that produced the data. In contrast, the advantages of feature selection techniques are not without their downsides, as searching for a subset of relevant features introduces another layer of complexity to the modeling task. When the class labels of a feature set are known, feature selection strategies are categorized as supervised, and they are categorized as unsupervised when the class labels are unknown. There are three types of supervised feature selection strategy used in classification: filter, wrapper, and embedded methods [4]. The filter approach works as a preprocessing step and utilizes the general characteristics of the training data independent of the predictive model being used [5]. The wrapper selection method creates many models with different subsets of input features and selects the ones that yield the best performance [4]. A feature selection procedure that is embedded in a model's training process is termed an embedded approach. Saeys et al. [6] investigated different ensemble feature selection techniques, which pool the strengths of different feature selection approaches to provide more reliable outcomes. Bolon-Canedo et al. [7] proposed an ensemble-learning-based feature selection to enhance the performance of micro-array data classification. There are other techniques that have been combined with feature selection, such as tree ensemble [8] and feature extraction [9]. Zang et al. suggested a two-stage feature selection algorithm by combining ReliefF and mRMR [10], while Akadi et al. [11] proposed a two-stage feature selection algorithm for genomic data by combining Minimum Redundancy-Maximum Relevance and Genetic Algorithm methods to obtain the optimal feature subset. Yuchen Jiang et al. [12] discussed three different approaches for feature selection in large-scale industrial processes for soft sensor construction. Karthik et al. [13] presented a method to improve the performance of open source software data prediction using Bayesian classification. Bhadoria et al. [14] discussed the use of an auto-encoder for dimensionality reduction based on bunch graphs. A new feature selection technique with ensemble learning, introduced by Hashemi et al. [15], converts the feature selection procedure into a multicriterion decision-making problem that is subsequently analyzed using the VICOR method. Kusy et al. [16] addressed the problem of feature selection as an aggregate of three cutting-edge filtration techniques: the linear correlation coefficient of Pearson, the ReliefF algorithm, and decision trees. Chellappan et al. [17] discussed the feature selection mechanisms available in the Apache Spark platform. Feature selection algorithms have been used in many real-world applications in the available literature, including intrusion detection [18], text categorization [19], email classification [20] microarray analysis [6], [21], information retrieval [22], etc.
Kosko introduced the concept of fuzzy cognitive maps in 1986 [3] as an extension of the Axelrod's Cognitive Map proposal from 1976 [23]. FCMs provide a framework for representing complex systems by representing their components and causal relationships. FCMs add fuzzy logic to conventional cognitive maps. Due to their capacity to describe any complicated system, FCMs have attracted numerous researchers, and have been effectively used in a wide range of scientific domains. Using FCMs in the medical field, Giles et al. [24] studied the many causes of diabetes, Giabbanelli et al. [25] detected obesity through psychological behavior, and Papageorgiou et al. [26] used FCMs to investigate whether a person's propensity to acquire breast cancer is affected by their family history of the condition. For crisis management decision making, Andreou et al. [27] suggested using FCMs as a tool for modeling political and strategic challenges. Zhai et al. [28] performed a credit risk assessment using FCMs [22]. The existing literature suggests many FCM expansions. Carvalho and Tome [29] presented a rule-based fuzzy cognitive map as an expansion of FCMs to include methods for dealing with feedback. Cognitive maps that deal with diverse meaning contexts were postulated by Salmeron [30]. Intuitionistic Fuzzy Cognitive Maps (iFCM) were developed by Iakovidis and Papageorgiou [31] to address experts' reluctance in making decisions. Liu et al. [26] suggested an FCM variant that allows for the identification of dynamic causal links between the concepts. To represent dynamic systems, Aguilar [32] came up with the idea of dynamic random fuzzy cognitive maps (DRFCMs). The model of FCNs (fuzzy cognitive networks) was first proposed by Kottas et al. [33] based on the idea that equilibrium points always exist, while Chunying [34] presented Rough Cognitive Maps, a fuzzy cognitive map based on Rough Set Theory.
By building upon the relevant literature, this paper proposes an approach for the selection of features using distributed FCMs. The material and methods connected to distributed-FCM-based feature selection are presented in the next section. Next, the application of distributed-FCM-based feature selection is described and the obtained results are discussed. In the final section, conclusions and future challenges related to this issue are highlighted.

Materials and Methods
A framework for the selection of features using distributed FCMs is proposed in the current paper. Figure 1 depicts the overall workflow of the proposed method. In this method, the input dataset is used to construct a distributed FCM model, which is elaborated in Section 3.1. Section 3.2 explains how the constructed FCM is then used for method, the input dataset is used to construct a distributed FCM model, which is elaborated in Section 3.1. Section 3.2 explains how the constructed FCM is then used for feature selection, and the selected features are then passed onto the classification model for evaluation, as described in Section 3.3.

Distributed Fuzzy Cognitive Maps
A wide range of applications are available for FCMs. They provide a modeling technique that is useful to represent highly complex systems, and they can be also used to model uncertainties and improve accuracy in various application problems. A fuzzy cognitive map is a signed weighted digraph that includes fuzzy causal relationships between its nodes. An FCM consists of three components: concepts, state vector, and weight matrix. As part of the construction of an FCM, the state vector and weight matrix must be initialized, followed by training of the FCM model.
The state vector is composed of all the values of all the concepts in the system. Each positive value of the state vector depicts the inclusion of a particular feature. The weight matrix should be initialized to define the semantic properties of the dataset. A correlation matrix is used to initialize the weight matrix of the FCM. Within the weight matrix, each element represents the correlation coefficient between different features of the dataset. Each element is populated with the Pearson correlation coefficient value. To compute the correlation coefficient, the following formula is used: where is the correlation coefficient, is a sample's x-variable values, ̅ is their mean, is a sample's y-variable values, and is their mean. Regarding the values of , r = 1 indicates a complete positive correlation, while r = −1 means a perfect negative correlation. The weight matrix and the state vector are stored in a resilient distributed dataset.
FCM learning is applied on the initial state vector using the following formula: where is the weight of the link between concepts and , and indicates the value of concept at step 1.The sigmoid function is utilized as the threshold function f (x): Iteratively, the state vector values are computed and the weight is modified until epsilon, a residual value that yields the smallest error difference between succeeding concepts, is attained. The equation used in each iteration step to update the weight matrix is given in Equation (4).

Distributed Fuzzy Cognitive Maps
A wide range of applications are available for FCMs. They provide a modeling technique that is useful to represent highly complex systems, and they can be also used to model uncertainties and improve accuracy in various application problems. A fuzzy cognitive map is a signed weighted digraph that includes fuzzy causal relationships between its nodes. An FCM consists of three components: concepts, state vector, and weight matrix. As part of the construction of an FCM, the state vector and weight matrix must be initialized, followed by training of the FCM model.
The state vector is composed of all the values of all the concepts in the system. Each positive value of the state vector depicts the inclusion of a particular feature. The weight matrix should be initialized to define the semantic properties of the dataset. A correlation matrix is used to initialize the weight matrix of the FCM. Within the weight matrix, each element represents the correlation coefficient between different features of the dataset. Each element is populated with the Pearson correlation coefficient value. To compute the correlation coefficient, the following formula is used: where r is the correlation coefficient, x i is a sample's x-variable values, x is their mean, y i is a sample's y-variable values, and y is their mean. Regarding the values of r, r = 1 indicates a complete positive correlation, while r = −1 means a perfect negative correlation. The weight matrix and the state vector are stored in a resilient distributed dataset. FCM learning is applied on the initial state vector using the following formula: where W ij is the weight of the link between concepts C i and C j , and A (k+1) i indicates the value of concept C i at step k + 1.The sigmoid function is utilized as the threshold function f (x): Iteratively, the state vector values are computed and the weight is modified until epsilon, a residual value that yields the smallest error difference between succeeding concepts, is attained. The equation used in each iteration step to update the weight matrix is given in Equation (4). where Wji (k) represents the revised weight after the kth iteration. η k represents the value of the learning parameter in the kth iteration. Concept values fall within the interval [0,1], whereas weight values fall within the interval [−1,1].
A parallel learning process is proposed for FCM, as depicted in Figure 2. The parallelize function is given the weight matrix of the Spark's Resilient Distributed Datasets (RDD) as input. Through the parallelize function, the weight matrix RDD is divided into sets of causal relations, and multiple new RDDs are created that contain subsets of weight matrix values for each node in the distributed system. As well as the weight matrix, the FCM learning requires the state vector. This requires the state vector to be available in all nodes where the weight matrix has been distributed. Therefore, we use the broadcast function on the state vector RDD, and the state vector is duplicated across all nodes. All distributed nodes cache the state vector. As the weight matrix RDD contains multiple rows and columns, it occupies very large amounts of space. However, the state vector RDD is a one-dimensional vector, so replicating it across the nodes will not impact memory capacity. FCM learning is applied at each node using Equation (2), and a partial result is generated. A final global solution, which is the final state vector, is obtained by combining these partial results.
where Wji (k) represents the revised weight after the kth iteration.
represents the value of the learning parameter in the kth iteration. Concept values fall within the interval [0,1], whereas weight values fall within the interval [−1,1].
A parallel learning process is proposed for FCM, as depicted in Figure 2. The parallelize function is given the weight matrix of the Spark's Resilient Distributed Datasets (RDD) as input. Through the parallelize function, the weight matrix RDD is divided into sets of causal relations, and multiple new RDDs are created that contain subsets of weight matrix values for each node in the distributed system. As well as the weight matrix, the FCM learning requires the state vector. This requires the state vector to be available in all nodes where the weight matrix has been distributed. Therefore, we use the broadcast function on the state vector RDD, and the state vector is duplicated across all nodes. All distributed nodes cache the state vector. As the weight matrix RDD contains multiple rows and columns, it occupies very large amounts of space. However, the state vector RDD is a one-dimensional vector, so replicating it across the nodes will not impact memory capacity. FCM learning is applied at each node using Equation (2), and a partial result is generated. A final global solution, which is the final state vector, is obtained by combining these partial results. The physical execution plan of the distributed FCM is represented using a directed acyclic graph, as depicted in Figure 3. The dataset is read by the compiler and stored into the Hadoop distributed file system (HDFS). The weight matrix and the state vector are computed from the dataset. The weight matrix is passed to the parallelize function where it is distributed into chunks of data using a hash function and sent to each node in the distributed system. The state vector is broadcasted across the nodes using the broadcast() method. The FCM model performs its computations in each individual node and separate The physical execution plan of the distributed FCM is represented using a directed acyclic graph, as depicted in Figure 3. The dataset is read by the compiler and stored into the Hadoop distributed file system (HDFS). The weight matrix and the state vector are computed from the dataset. The weight matrix is passed to the parallelize function where it is distributed into chunks of data using a hash function and sent to each node in the distributed system. The state vector is broadcasted across the nodes using the broadcast()

FCM-Based Feature Selection
In order to extract the most significant features using FCM, a causal relationship graph is constructed with the features as the nodes of the graph. The presence of a specific feature is indicated by a positive value in the state vector and the absence of the feature is indicated by 0. A correlation matrix is used to initialize the weight matrix of the fuzzy cognitive map. During FCM learning process in each iteration different combinations of features in the state vector is evaluated. The feature set that provides the best level of classification accuracy for the test data is chosen at each iteration and permanently added to the subsequent state vector. Classification accuracy (CA) is computed as a proportion of the accurately classified test instances:

Accuracy
Successfully classified test cases Total number of test cases * 100 After the feature has been selected and added to the state vector, the same test is applied for the combinations of remaining features with the selected feature affixed at a particular position. As long as the target performance level is met or all features have been selected, these iterations are continued. Let F = {f1, f2, f3, f4, … , fn} be the set of all features in the dataset, S = {Ø} be initial set of selected features, A = {A1, A2, A3,…., An} be the state vector, and M be the weight matrix for the FCM. In each iteration, a different set of feature values in the state vector A are set to 1 and others are set as 0, for example, A = {01001010001}. The objective function for the FCM model is the maximization of the classification accuracy. The FCM model is trained using different combinations of the state vector A and the classification accuracies are computed based on Equation (5). The feature set A that gives the highest classification accuracy is selected and added to the selected feature set S. The iterations are continued until either the desired classification accuracy has been achieved or all the features in F have been selected at least once.

Classification Model
For the performance analysis of proposed model, the classification algorithms Naïve Bayes, Decision Tree, Random Forest, Multilayer Perceptron, and Logistic Regression were used. The distributed versions of the classification algorithms were used. The datasets were converted into pairs of feature vectors and class labels using the VectorAssembler method in spark. Since all the datasets were numerical datasets, no other preprocessing steps were necessary in order to process the datasets. The datasets were partitioned into different subsets and spread across different nodes. In the distributed Naïve Bayes classifier, the class labels and feature vectors are mapped together and distributed

FCM-Based Feature Selection
In order to extract the most significant features using FCM, a causal relationship graph is constructed with the features as the nodes of the graph. The presence of a specific feature is indicated by a positive value in the state vector and the absence of the feature is indicated by 0. A correlation matrix is used to initialize the weight matrix of the fuzzy cognitive map. During FCM learning process in each iteration different combinations of features in the state vector is evaluated. The feature set that provides the best level of classification accuracy for the test data is chosen at each iteration and permanently added to the subsequent state vector. Classification accuracy (CA) is computed as a proportion of the accurately classified test instances: Successfully classified test cases Total number of test cases * 100 After the feature has been selected and added to the state vector, the same test is applied for the combinations of remaining features with the selected feature affixed at a particular position. As long as the target performance level is met or all features have been selected, these iterations are continued. Let F = {f1, f2, f3, f4, . . . , fn} be the set of all features in the dataset, S = {Ø} be initial set of selected features, A = {A1, A2, A3, . . . ., An} be the state vector, and M be the weight matrix for the FCM. In each iteration, a different set of feature values in the state vector A are set to 1 and others are set as 0, for example, A = {01001010001}. The objective function for the FCM model is the maximization of the classification accuracy. The FCM model is trained using different combinations of the state vector A and the classification accuracies are computed based on Equation (5). The feature set A that gives the highest classification accuracy is selected and added to the selected feature set S. The iterations are continued until either the desired classification accuracy has been achieved or all the features in F have been selected at least once.

Classification Model
For the performance analysis of proposed model, the classification algorithms Naïve Bayes, Decision Tree, Random Forest, Multilayer Perceptron, and Logistic Regression were used. The distributed versions of the classification algorithms were used. The datasets were converted into pairs of feature vectors and class labels using the VectorAssembler method in spark. Since all the datasets were numerical datasets, no other preprocessing steps were necessary in order to process the datasets. The datasets were partitioned into different subsets and spread across different nodes. In the distributed Naïve Bayes classifier, the class labels and feature vectors are mapped together and distributed across nodes and a hash function is used to determine the conditional probability of the features, which is used to produce a probability table. The probability table is utilized in order to classify the dataset. In the distributed Decision Tree classifier, binary partitioning is done recursively to classify the features using a greedy algorithm. In the Random Forest model, each tree is trained using a different subset of the data. The Random Forest trees are actually trained on different parts of the same training set. A Tree Point structure is used to save the memory by storing the replica count of each instance in each subset. The number of mappers created is the same as number of trees in the Random Forest. Parallel training of a variable set of trees is optimized depending on memory constraints. Random Forest models reduce the risk of overfitting. Multilayer perceptron classifier (MLPC) is a feed-forward neural network classifier. MLPC has layered nodes. Network layers are completely linked. Input layer nodes represent input data. All other nodes translate inputs to outputs by linearly combining inputs with weights, bias, and an activation function. The model takes the composition of layers as input. While using the multinomial logistic regression model, a matrix of the number of outcome classes and the features is created and a Softmax function is used to model the outcome class's conditional probabilities.
The algorithm corresponding to the proposed methodology is given in Algorithm 1. In the proposed algorithm, the accuracy threshold is the accuracy value obtained when the classifier is applied on the dataset without any feature selection performed. Our intention is to obtain better accuracy values. Another assumption is the epsilon threshold value. Parallelize the WeightMatrix 10 Broadcast the StateVectorA 11 Update VectorA as weightMatrix * StateVectorA 12 Assign StateVectorA = updatedVectorA 13 Compute the classification Accuracy of updated StateVectorA 14 If(accuracy > accuracyThreshold) 15 Add the features in StateVectorA to FeatureVector 16 weightMatrix = updateWeights(weightMatrix) 17 epsilon = compute Epsilon() 18 if epsilon < threshold 19 break; 20 }

Results
The experiment was performed on a high-performance Hadoop cluster with one name-node server and two data-node servers with a combined capacity of 768 GB RAM and 144 core processors. The cluster supports Apache Spark 3.0.0 version. To determine the effectiveness and efficiency of the proposed model, 15 benchmark datasets available in the UCI machine learning repository [35], kaggle data repository, OpenML dataset repository, and PROMISE software dataset repository, which have been commonly used to evaluate feature selection models in the literature, were used.
The datasets used are summarized in Table 1. Various datasets with dimensionalities ranging from extremely low to very high and different sizes were taken into consideration to ascertain the performance of the proposed model on different types of dataset.

Proposed Feature Selection vs. Existing Feature Selection
A comparison of the performance of the proposed distributed FCM-based f selection model against a number of state-of-the-art methods available in the A Spark [17] platform was also carried out. The model was compared with existing d

Proposed Feature Selection vs. Existing Feature Selection
A comparison of the performance of the proposed distributed FCM-based feature selection model against a number of state-of-the-art methods available in the Apache Spark [17] platform was also carried out. The model was compared with existing distributed feature selection methods such as VectorSlicer, RFormula, ChiSqSelectotr, UnivariateFeature Selector, and VarianceThreshold Selector in terms of how well it selected the optimal number of features, as depicted in Table 2. The proposed model was capable of selecting the optimal number of features in most of the datasets considered.

Performance Evaluation of the Proposed Feature Selection
A set of five classification algorithms, Naïve Bayes, Decision Tree, Random Forest, Multilayer Perceptron, and Logistic Regression were used to evaluate the efficiency of the optimal feature set obtained using FCM-based feature selection. Table 3 represents the accuracy values obtained for different classification algorithms when the datasets taken into consideration were evaluated after FCM-based feature selection. The results depict that the Random Forest algorithm tended to produce maximum accuracy values as compared to other classification algorithms. Random Forest algorithms use bootstrap aggregation and randomization in selection of data nodes during the construction of decision trees to obtain a high degree of predictive accuracy. The Random Forest algorithm also tends to have the capability to handle large datasets more efficiently compared to other classification algorithms; therefore, Random Forest was chosen as the classification model to evaluate the performance and efficiency of the proposed feature selection method. An ensemble of individual decision trees comprises the Random Forest algorithm. Each tree in the random forest produces a class prediction, and the class with the most votes becomes the prediction of the model.

Performance Evaluation of the Proposed Feature Selection
A set of five classification algorithms, Naïve Bayes, Decision Tree, Random Forest, Multilayer Perceptron, and Logistic Regression were used to evaluate the efficiency of the optimal feature set obtained using FCM-based feature selection. Table 3 represents the accuracy values obtained for different classification algorithms when the datasets taken into consideration were evaluated after FCM-based feature selection. The results depict that the Random Forest algorithm tended to produce maximum accuracy values as compared to other classification algorithms. Random Forest algorithms use bootstrap aggregation and randomization in selection of data nodes during the construction of decision trees to obtain a high degree of predictive accuracy. The Random Forest algorithm also tends to have the capability to handle large datasets more efficiently compared to other classification algorithms; therefore, Random Forest was chosen as the classification model to evaluate the performance and efficiency of the proposed feature selection method. An ensemble of individual decision trees comprises the Random Forest algorithm. Each tree   Table 4 shows the results according to the average accuracy of the Random Forest classifier and optimal subset of features selected. The suggested model's performance on classification tasks was evaluated using the Random Forest method, since it achieved the best classification results for the datasets under consideration. The results showed that the proposed model gave us good output results in identifying the optimal subset of features selected and showed improvements in classification accuracies for all datasets taken into consideration. Through the accuracy of classification, the suggested model obtained outstanding results while only selecting a relatively small percentage of the features in the dataset, thereby considerably reducing the amount of data to be processed.

Performance Comparison with Existing Feature Selection Methods
The classification accuracy of the proposed model was compared with classification accuracies obtained after feature selection using state-of-the-art feature selection methods. The results obtained are depicted Figure 6. The results show that the proposed feature selection model produced better results for the datasets taken into consideration as compared to existing feature selection methods, namely VectorSlicer, RFormula, ChiSqSelector, UnivariateFeature Selector, and VarianceThreshold. The proposed model makes use of the accuracy values in order to determine the optimal feature set as opposed to using statistical methods to calculate the feature set; therefore, it acts as a good model for feature selection for classification tasks.

Performance Analysis and Discussion
In the course of this investigation, a model for the selection of features based on a wrapper technique was adopted. The feature selection model used in previous studies is a sequential implementation of the wrapper technique, as illustrated in Figure 8. In the sequential technique, learning is conducted consecutively following feature subset generation, and if it does not pass the evaluation threshold, the feature subset generation is repeated; otherwise, the subset is chosen as the best feature set.

Performance Analysis and Discussion
In the course of this investigation, a model for the selection of features based on a wrapper technique was adopted. The feature selection model used in previous studies is a sequential implementation of the wrapper technique, as illustrated in Figure 8. In the sequential technique, learning is conducted consecutively following feature subset generation, and if it does not pass the evaluation threshold, the feature subset generation is repeated; otherwise, the subset is chosen as the best feature set.
In the course of this investigation, a model for the selection of features base wrapper technique was adopted. The feature selection model used in previous stu a sequential implementation of the wrapper technique, as illustrated in Figure 8 sequential technique, learning is conducted consecutively following feature subset ation, and if it does not pass the evaluation threshold, the feature subset genera repeated; otherwise, the subset is chosen as the best feature set. In this work, a distributed feature selection approach was adopted, as depi Figure 9. In the proposed model, a feature subset is generated and the efficiency selected feature subset is evaluated in a distributed manner by the learning algor the generated subset satisfies the evaluation threshold, it is selected as the optimal set; otherwise, a new feature subset is generated and the evaluation process con The dataset is partitioned into different chunks as part of the learning process of ture selection model, as shown in Figure 10. In this work, a distributed feature selection approach was adopted, as depicted in Figure 9. In the proposed model, a feature subset is generated and the efficiency of the selected feature subset is evaluated in a distributed manner by the learning algorithm. If the generated subset satisfies the evaluation threshold, it is selected as the optimal feature set; otherwise, a new feature subset is generated and the evaluation process continues. The dataset is partitioned into different chunks as part of the learning process of the feature selection model, as shown in Figure 10.   The performance of the proposed model was compared with existing distributed feature selection models, namely VectorSlicer, RFormula, ChiSqSelectotr, UnivariateFeature Selector, and VarianceThreshold Selector. Table 2 demonstrates that, across the majority The performance of the proposed model was compared with existing distributed feature selection models, namely VectorSlicer, RFormula, ChiSqSelectotr, UnivariateFeature Selector, and VarianceThreshold Selector. Table 2 demonstrates that, across the majority of datasets, the minimum number of features was acquired by using the proposed distributed FCM-based feature selection technique. The datasets for which the proposed algorithm selected the smallest number of features has been highlighted. For all but three datasets, the proposed model chose the smallest number of features. The effectiveness of the optimum feature set acquired using FCM-based feature selection was evaluated using five classification algorithms: Naïve Bayes, Decision Tree, Random Forest, Multilayer Perceptron, and Multinomial Logistic Regression. After FCM-based feature selection, the datasets used for evaluation yielded accuracy values which are displayed in Table 3. The findings indicate that, compared to other classification methods, the Random Forest approach often yielded the highest accuracy scores for the considered datasets. The Random Forest method employs a majority agreement prediction technique, wherein a group of individual decision trees form an ensemble, thereby reducing the number of possible types of error and leading to higher accuracy of results. Hence, using the Random Forest algorithm, the performance of the proposed model was compared with existing feature selection algorithms. Figure 6 depicts the comparison of classification accuracies obtained when the proposed model was compared with existing methods. It can be noted that even though the number of features selected by the proposed model was not minimal in all the datasets, the figure depicts that the features selected were the optimal feature set, since the proposed model produced the best accuracy values as compared to other models.

Conclusions
Enormous amounts of high-dimensional data are prevalent in numerous fields, including social media, e-commerce, bioinformatics, healthcare, transportation, and online education. The preprocessing phase of feature selection has been widely employed to reduce the dimensionality of problems and increase the accuracy of classification. There has been an increase in the need for such approaches in recent years due to situations involving high numbers of input attributes and samples. In other words, the big data boom today has the resulted in the challenge of big dimensionality.
Featuring a distributed fuzzy cognitive map based feature selection method, this paper addressed the paramount need for a feature selection algorithm. The algorithm was tested on 15 benchmark datasets. A variety of comparisons were made with existing feature selection algorithms, including VectorSlicer, RFormula, ChiSqSelectotr, UnivariateFeature Selector, and VarianceThreshold Selector. The efficiency of the optimal feature set obtained was evaluated on different classification algorithms, namely Naïve Bayes, Decision Tree, Random Forest, MultiLayer Perceptron, and Logistic Regression. The results depict that the Random Forest algorithm produced the most accurate results for most of the datasets considered. The average accuracy was above 90% when using the Random Forest classifier along with the proposed feature selection method, in contrast to 85.6% without applying feature selection. The number of optimal feature sets selected was considerably less in the case of proposed model compared to the existing methods.