Effective Voting Ensemble of Homogenous Ensembling with Multiple Attribute-Selection Approaches for Improved Identiﬁcation of Thyroid Disorder

: Thyroid disease is characterized by abnormal development of glandular tissue on the periphery of the thyroid gland. Thyroid disease occurs when this gland produces an abnormally high or low level of hormones, with hyperthyroidism (active thyroid gland) and hypothyroidism (inactive thyroid gland) being the two most common types. The purpose of this work was to create an efﬁcient homogeneous ensemble of ensembles in conjunction with numerous feature-selection methodologies for the improved detection of thyroid disorder. The dataset employed is based on real-time thyroid information obtained from the District Head Quarter (DHQ) teaching hospital, Dera Ghazi (DG) Khan, Pakistan. Following the necessary preprocessing steps, three types of attribute-selection strategies; Select From Model (SFM), Select K-Best (SKB), and Recursive Feature Elimination (RFE) were used. Decision Tree (DT), Gradient Boosting (GB), Logistic Regression (LR), and Random Forest (RF) classiﬁers were used as promising feature estimators. The homogeneous ensembling activated the bagging- and boosting-based classiﬁers, which were then classiﬁed by the Voting ensemble using both soft and hard voting. Accuracy, sensitivity, mean square error, hamming loss, and other performance assessment metrics have been adopted. The experimental results indicate the optimum applicability of the proposed strategy for improved thyroid ailment identiﬁcation. All of the employed approaches achieved 100% accuracy with a small feature set. In terms of accuracy and computational cost, the presented ﬁndings outperformed similar benchmark models in its domain.


Introduction
The thyroid gland is located near the base of the neck and is responsible for secreting thyroid hormones, which play an important role in human metabolism [1]. When this gland is active, it secretes an excessive amount of hormone, which is referred to as hyperthyroidism. In contrast, insufficient thyroid hormone secretion results in hypothyroidism. The thyroid gland creates the thyroid hormones levothyroxine also known as T4 and triiodothyronine, also referred to as T3 in [2,3].
Hyperthyroidism is characterized by an abnormally high level of secretion. The human body's metabolism is quick, and a person may have symptoms such as rapid weight loss, irregular heartbeat, high blood pressure, and so on [4,5]. On the contrary, hypothyroidism is caused by a lack of hormone secretion, which may cause a person to experience sluggishness in metabolism, abrupt weight gain, slow heartbeat with a low pulse

•
Researchers achieved a lot of success in detecting thyroid illnesses, however, it is advised to utilize several parameters to diagnose thyroid problems. More criteria would necessitate more clinical testing for patients, which would be both costly and time-demanding. As a result, predictive models must be constructed which use as few parameters as feasible in detecting the illnesses while preserving both money and time for patients. When compared to prior studies, the dataset of this research contributes fewer, but very crucial and effective characteristics for better diagnosis of the disease; • It is critical to clean the sample data before modeling to assure that the data best reflect the situation. A dataset may comprise the missing and extreme values that are outside of the anticipated range and differ from the rest of the data. These are known as outliers, while the understanding and elimination of these outlier values may frequently enhance the performance of the machine-learning models. Therefore, in this study, the early preprocessing includes the detection and replacement of the missing values and the outliers with the mean values of the used features; • The feature-selection process uses feature significance ratings with the help of estimated feature importance from the used dataset. The training dataset is used to choose features, and then the model is trained using the selected features and evaluated on the test set. Both datasets were subjected to the XGBoost (XGB) feature importance to acquire a clear picture of the attribute relevance before selection; • Feature selection is a procedure in which you automatically pick those characteristics in your dataset that contribute the most to the output variables. The presence of irrelevant characteristics in your data might reduce the performance of many models. Feature selection before modeling not only improves accuracy but also reduces training time and the likelihood of overfitting. In this study, we implemented three popular attribute-selection techniques which are; SFM, RFE, and SKB; • The concept of a multilevel ensemble is introduced in this experimental work where the predictions of the bagging and boosting ensemble classifiers further undergo the voting ensemble with soft and hard voting. This methodology obtained state-of-theart results on both proposed and open source datasets. For performance evaluation, multiple metrics such as recall, hamming loss, precision, etc., have been used.

Literature Review
In the healthcare industry, data-mining techniques such as classification, segmentation, correlation, clustering, and regression may be used to detect diseases [13]. Every year, a significant number of people are hospitalized with thyroid disorders. As a result, obtaining an early and accurate diagnosis is becoming increasingly challenging for healthcare facilities. The mentioned literature review will highlight several machine-learning techniques used to classify thyroid illness in various research. Nowadays, early detection and indications play a very crucial part in the effective diagnosis of various diseases. This requires the use of ML algorithms for accurate prediction. Mushtaq et al. [14] used the KNN algorithm for breast cancer classification. It provides a method for determining the distance between two sets of data. The performance of KNN is dependent on the K value, which is the number of adjacent entities. To discover an effective KNN, this research investigates KNN performance employing multiple distance functions and K values. There are multiple research domains in which researchers used different ML techniques for the better and more efficient detection and diagnosis of diseases.
The research in [15] employs two ML methods to identify thyroid conditions by using SVM and RF. The Thyroid Dataset from the University of California Irvine (UCI) was used for the investigation. Both methods were evaluated in terms of accuracy, recall, F-score, and precision. The SVM and RF models scored 91%, and 89% accuracies, respectively. Studies showed that SVM outperforms RF in the identification of thyroid issues. ML classifiers were used to predict thyroid issues in [16]. Data preparation techniques were adopted to make the data more basic so that algorithms could detect the risk of patients acquiring this disease. Machine learning is widely used for disease prediction. SVM, DT, LR, artificial neural network (ANN), and KNN are some of the approaches that scientists employ to predict if a patient may acquire thyroid illness. A website has been built to collect user input to provide educated estimations regarding type of illness. Sonuc et al. in [17] divided thyroid illness into three different groups based on data from Iraqi citizens, where some of them had an overactive thyroid and others had hypothyroidism. The SVM, DT, RF, NB, LR, KNN, and linear discriminant analysis (LDA), in addition to multilayer perceptron (MLP), was implemented to classify thyroid issues. The most accurate classifiers in descending order were RF, DT, NB, LR, KNN, and LDA, followed by MLP with 89 and 88% accuracy, respectively. The supervised learning approach was selected for inclusion in the study [18]. Anaconda and python platforms were used to create these algorithms for identifying the type of thyroid illness. The authors employed a variety of methods, including SVM, KNN, DT, naïve Bayes (NB), RF, and LR, among others. The results were plotted to evaluate how Electronics 2021, 10, 3026 4 of 23 well LR matches up to RF in terms of accuracy. A low-cost thyroid diagnostic report is now available to patients using this technique. To identify thyroid texture, researchers in this study [19] offer three machine-learning-based methods known as the SVM, RF, and ANN. The researchers generated 30 spectral energy-based attributes for these classifiers during training via autoregressive modeling on a signal variant of 2D thyroid US pictures. Instead of using text-based descriptors, they employed image-based characteristics to illustrate thyroid tissues. When all three methods were used collectively, accuracy hovered around 90%.
Zhu et al. [20] proposed the use of ANN to develop a model for distinguishing benign from malignant nodules and for enhancing the accuracy of US-based objective diagnosis. Key sonographic markers and statistically significant changes made up the input layer of the ANN, which was utilized to predict nodule malignancy. The size, structure, echogenicity, internal composition, nodules, and peripheral halo of ultrasonography malignant nodules had a substantial association. When used on the training cohort, the ANN accurately predicted 82.3% of thyroid cancer cases with a value of 0.818 for area under the curve (AUC) and 84.5% accuracy rate. This method's findings had an accuracy, sensitivity, and specificity of 83.1%, 83.8%, and 81.82 %, respectively, in the validation cohort. The AUC score for this investigation was 0.828. Clinical datasets were used in [21] and compared SVM, NB, and DT classifiers. The SVM algorithm is most extensively used in ML. Researchers mixed two feature selection techniques to compare the model's performance. The filter technique was used first to pick the features, and the classifier's effectiveness was assessed using the wrapper approach. The Fisher Discriminant Ratio (FDR) value was being used in binary classification to rank features based on significance. Three performance indicators were used to evaluate the addition of advanced features to the categorization model. Sequential forward and sequential backward selection are two well-known systematic attribute selection techniques, utilized in this case analysis [22]. In nonlinear optimization problems, the evolutionary method is a popular strategy for picking features. The SVM was used to detect hypothyroidism. Thyroid disease was examined using two distinct types of data in this study. The first dataset was used from the UCI repository, while the second dataset featured real data from the Imam Khomeini Hospital at the K. N. Toosi University of Technology's Intelligent System Lab. To speed up CH diagnosis and therapy selection, a data-mining approach was applied by researchers. As a part of this cross-sectional study [23], authors deployed the SVM, MLP, Chi-Squared Automatic Interaction Detector (CHAID), and Iterative Dichotomiser-3 by integrating classification algorithm. By using the aforementioned classification methods, and bootstrap aggregating (Bagging), and boosting procedures, the negative impacts of dataset imbalance on classification outcomes were minimized. When using SVM-Bagging, precision and specificity were both 100%, the recall was 73.33%, and the F-measure was 84.62%. The investigative findings by authors in [24] revealed the DT attribute partitioning criteria for thyroid disease detection. Thyroid nodules may be effectively and efficiently classified using the method outlined below. In this study, methodologies such as DT, SVM, and NB were used to make a comparative diagnosis of thyroid illness. The accuracy goal for this classification is 99.89%. Previous attempts at employing the DT had disappointing results. Data from the UCI was used by Geetha et al. [25] in their analysis. The Hybrid Differential Evolution (HDE) kernel-based Naïve algorithm used high dimensionality to limit the existing 21 features to 10 attributes before running the algorithm. The accuracy of the kernel-based NB classification method has risen to 92% as a result of this development. As a hybrid model, it is critical to have a strong knowledge base that can be leveraged to tackle difficult learning tasks such as clinical diagnosis and prognostication. This research looked at a variety of ML methods and thyroid disease-prevention diagnostics in [26]. Depending on a patient's medical history, algorithms such as SVM, KNN, and DT were employed to assess the risk of thyroid illness. A three-stage approach for treating thyroid illness was devised by another team of researchers led by Chen et al. in [27]. The author Fedushko et al. proposed a big-data-and operational intelligence-based system, including distinct machine-learning and preprocessing techniques for effective classification in [28]. The (FS-PSO-SVM) CAD technique with particle swarm optimization performed better than the current methods and obtained a precision of 98.59% by utilizing 10-fold cross-validation. Dogantekin et al. classified thyroid illness with an accuracy of 91.86% using feature extraction, and feature reduction classification phases with generalized discriminant analysis (GDA), and wavelet support vector machines (WSVM) [29]. The researchers Keleş et al. developed an expert system for the detection of thyroid illness termed as an expert system for thyroid disease diagnosis (ESTDD). The neuro-fuzzy classification (NEFCLASS) method was used to apply fuzzy rules, and the results showed 95.33% accuracy [30]. According to Ozyilmaz et al., using a variety of neural network approaches including back-propagation-based MLP, the radial-based function, and adaptive conic-section function in neural networks, thyroid diagnostic accuracy was shown to be 88% [31].

Materials and Methods
The research technique is shown in Figure 1. Before beginning the analysis, it is crucial to display and visualize the data. The purification, cleaning, and reduction of useless data, as well as missing values, can be accomplished by data preprocessing to improve the data representation and accuracy of the model. Next, we used XGBoost classifier to visually represent the importance of attributes based on the F-score [32]. Furthermore, the three used feature-selection techniques, SFM, SKB and RFE, are presented with their estimators in Figure 1.
a patient's medical history, algorithms such as SVM, KNN, and DT were employed to assess the risk of thyroid illness. A three-stage approach for treating thyroid illness was devised by another team of researchers led by Chen et al. in [27]. The author Fedushko et al. proposed a big-data-and operational intelligence-based system, including distinct machine-learning and preprocessing techniques for effective classification in [28]. The (FS-PSO-SVM) CAD technique with particle swarm optimization performed better than the current methods and obtained a precision of 98.59% by utilizing 10-fold cross-validation. Dogantekin et al. classified thyroid illness with an accuracy of 91.86% using feature extraction, and feature reduction classification phases with generalized discriminant analysis (GDA), and wavelet support vector machines (WSVM) [29]. The researchers Keleş et al. developed an expert system for the detection of thyroid illness termed as an expert system for thyroid disease diagnosis (ESTDD). The neuro-fuzzy classification (NEF-CLASS) method was used to apply fuzzy rules, and the results showed 95.33% accuracy [30]. According to Ozyilmaz et al., using a variety of neural network approaches including back-propagation-based MLP, the radial-based function, and adaptive conic-section function in neural networks, thyroid diagnostic accuracy was shown to be 88% [31].

Materials and Methods
The research technique is shown in Figure 1. Before beginning the analysis, it is crucial to display and visualize the data. The purification, cleaning, and reduction of useless data, as well as missing values, can be accomplished by data preprocessing to improve the data representation and accuracy of the model. Next, we used XGBoost classifier to visually represent the importance of attributes based on the F-score [32]. Furthermore, the three used feature-selection techniques, SFM, SKB and RFE, are presented with their estimators in Figure 1.
The next step was the detection and removal of the outliers. It is very essential to detect and locate the outliers after attributes selection, hence the absence and presence of the outliers are proportional to the total number of selected features. The next step is to normalize the data from the selected features by using the scaling approach. The next step was the detection and removal of the outliers. It is very essential to detect and locate the outliers after attributes selection, hence the absence and presence of the outliers are proportional to the total number of selected features. The next step is to normalize the data from the selected features by using the scaling approach.
For this purpose, both standard and min-max scaling were implemented. Feature scaling is used to make the data more regular. Finally, the homogenous ensemble bagging (RF and Bagging Meta Estimator (BME)) and boosting (AdaBoost (AB) and XGB) are performed. After the classification, the predictions undergo the voting ensemble again, involving both soft and hard voting. The performance evaluation measures are further used for the clear assessment of implemented methodologies.

Dataset Description
This experimental research work focused on the dataset related to thyroid disease. The dataset was collected and gathered from a popular district headquarter hospital of the Punjab province, the city Dera Ghazi Khan, Pakistan. The dataset is carefully evaluated and verified by two expert endocrinologists from a well-known and famous hospital in Karachi, Pakistan [33]. The dataset contains 309 entities directly associated with the total number of subjects. Each person undergoes ten different screening tests that are further represented as features and one target variable represented as 'Class'. This outcome variable is further categorized into three distinct classes expressed as 'Hypo' for Hypothyroidism, Normal, and Hyperthyroidism is denoted as 'Hyper'. There is a total of 13 missing values represented as '?' in a 'T3' feature. Table 1 shows the details of this dataset. The descriptions of the output variable and categories have been illustrated in Figure 2.

Data Preprocessing
The dataset used in this research study is in the form of a CSV file. Therefore, there is a chance of the missing values, and it had few useless columns such as the 'Sr. No.' and 'Reference IDs' of the patients. These attributes should be removed from the dataset as they do not have any specific impact on the outcome variable 'Class', and severely affect the performance of the models. This dataset also contains very few integer and real values, and most of the attribute details are in the form of strings and characters. Hence, it is difficult for the libraries to perform operations on these values directly. We convert these characters or strings into real numbers or integer values, for example, in 'Pregnancy' the value 'Yes' represented as 1 and 'No' is denoted as 0. All the remaining features have been changed in a similar aspect. The 13 missing values present in the 'T3' represented by '?' were assigned with the mean values to obtain better performance. All the data cleaning process has been conducted in the preprocessing step. and most of the attribute details are in the form of strings and characters. Hence, it is difficult for the libraries to perform operations on these values directly. We convert these characters or strings into real numbers or integer values, for example, in 'Pregnancy' the value 'Yes' represented as 1 and 'No' is denoted as 0. All the remaining features have been changed in a similar aspect. The 13 missing values present in the 'T3' represented by '?' were assigned with the mean values to obtain better performance. All the data cleaning process has been conducted in the preprocessing step.

XGBoost-Based Feature Importance by Using F-Score
Strategies that allocate a score to input features depending on how valuable they are at forecasting an outcome variable are known as feature importance. In a predictive modeling project, feature relevance scores play a significant role in providing information about the attributes and insight into the model. It lays the foundation of dimensionality reduction for high-dimensional data and attribute selection, which can increase the efficiency and efficacy of a forecasting model on the problem. Statistical correlation scores, coefficients generated as part of linear and regression models, RF and DT oriented attributes scores, permutation-based scores, and F-score-based attribute importance are some of the most commonly used methods [32]. Using the SFM class, the XGB-based feature importance is implemented [34], which takes a model and transforms it into a subset with chosen characteristics. It is possible to use a model that has already been trained using the complete training dataset with this method. When a threshold is reached, it can choose which attributes to use. SFM's convert() function uses this threshold to ensure that features selected for training and testing are the same. The following example shows how to use XGB to first train and then test a model on a proposed dataset. The model is then wrapped in an SFM instance based on the feature importance determined from the training data. The training dataset is used to choose features, and the model is trained using the subset of features that have been selected. Finally, the model is evaluated on the test dataset and uses the same feature-selection methods as before. This technique is very helpful for a

XGBoost-Based Feature Importance by Using F-Score
Strategies that allocate a score to input features depending on how valuable they are at forecasting an outcome variable are known as feature importance. In a predictive modeling project, feature relevance scores play a significant role in providing information about the attributes and insight into the model. It lays the foundation of dimensionality reduction for high-dimensional data and attribute selection, which can increase the efficiency and efficacy of a forecasting model on the problem. Statistical correlation scores, coefficients generated as part of linear and regression models, RF and DT oriented attributes scores, permutationbased scores, and F-score-based attribute importance are some of the most commonly used methods [32]. Using the SFM class, the XGB-based feature importance is implemented [34], which takes a model and transforms it into a subset with chosen characteristics. It is possible to use a model that has already been trained using the complete training dataset with this method. When a threshold is reached, it can choose which attributes to use. SFM's convert() function uses this threshold to ensure that features selected for training and testing are the same. The following example shows how to use XGB to first train and then test a model on a proposed dataset. The model is then wrapped in an SFM instance based on the feature importance determined from the training data. The training dataset is used to choose features, and the model is trained using the subset of features that have been selected. Finally, the model is evaluated on the test dataset and uses the same feature-selection methods as before. This technique is very helpful for a better diagnosis of thyroid disorder. Figure 3 illustrates the feature importance with the highest F-scores for the selected features by each attribute-selection technique. better diagnosis of thyroid disorder. Figure 3 illustrates the feature importance with the highest F-scores for the selected features by each attribute-selection technique.

Attribute Selection Approaches
The process of identifying the most reliable, nonredundant, most relevant characteristics for use in model development is known as feature selection. As the number and diversity of datasets expand, it is critical to reduce their size methodically. The primary objective of feature selection is to boost the effectiveness of a predictive model while lowering the modeling computational cost. The details of each feature-selection method have been discussed in Table 2.

Attribute Selection Approaches
The process of identifying the most reliable, nonredundant, most relevant characteristics for use in model development is known as feature selection. As the number and diversity of datasets expand, it is critical to reduce their size methodically. The primary objective of feature selection is to boost the effectiveness of a predictive model while lowering the modeling computational cost. The details of each feature-selection method have been discussed in Table 2. The SFM is a meta-transformer that may be used in conjunction with any estimator that gives significance to each feature via a particular property (such as coef function, feature importance) or by an importance extractor. If the matching relevance of the attribute values is less than the specified threshold parameter, the characteristics are considered irrelevant and deleted. There are established mechanisms for calculating a threshold using a text input in addition to providing the threshold numerically. For example, "median", "mean" and fractional multiples of these, such as "0.1*mean" are available heuristics. In conjunction with the qualifying criteria, the max features option may be used to restrict the number of selected features. The implementation of SFM has been performed by using the sklearn package [35]. The estimators used in this approach are LR [36], RF [37], DT [38], GB [39].

Recursive Feature Elimination (RFE)
RFE is a feature-selection algorithm with a wrapper framework. This indicates that a distinct classification algorithm is provided and utilized in the method's core, which is wrapped by RFE, and further used to assist in the feature selection process. This method contrasts with the filter-oriented attribute selection where each feature is selected based on the highest and lowest score. RFE is a wrapper-based method that internally employs filter-based characteristics selection. RFE finds a set of attributes by starting with all the features in the training sample and successfully eliminating features until the target number is reached. The whole attribute-selection procedure is achieved by fitting the provided ML algorithm employed in the model's core, ranking features by significance, removing the least essential features, and refitting the model. This process is continued until only a certain number of characteristics remain. Features are rated using either the supplied ML model (e.g., some algorithms such as DT provide importance ratings) or a statistical technique. RFE is implemented in the sklearn ML package [35]. To obtain effective utilization of the RFE transformation, we first set up the class with the algorithm of choice provided by the "estimator" parameter and the number of attributes to pick via the "n features to select" function. In this experimental study, the used core model is DT and the estimators are the same as discussed in Section 2.4.1, which are, LR, RF, DT, and GB.

Univariate Feature Selection Based Select K-Best (SKB)
In this approach, the statistical measures can be used to identify the characteristics with the strongest link to the output variable. The Select K-Best class in the sklearn package is used with a series of various statistical tests to pick a particular number of features. This research work selects the following best characteristics, as detailed in Table 2. The first method was based on the chi-squared (chi2) test that used the statistical or t analysis for non-negative features. The other parameter implemented in SKB is the f-class-if function denoted as (FCI), which calculates the ANOVA, and F-value for the sample that has been supplied.

Automatic Outlier Detection and Removal using Isolation Forest (ISO)
It is critical to purify the data samples before modeling to guarantee that the observations accurately reflect the situation. A dataset may comprise extreme values that are beyond the anticipated range and dissimilar to the rest of the data. These extreme independent values are known as the outliers. These are unique observations that stand out from the others. Understanding and even eliminating these outlier values can help enhance ML modeling and model ability in general. Because of the unique characteristics of each dataset, there is no exact technique to describe and detect outliers in general. The common practice includes the evaluation of the raw data and determining if a given result is an anomaly or not. Statistical techniques can be used to detect occurrences that appear to be unusual or implausible based on the available data. After that, the fit model will determine which samples in the training sample are outliers and which are inliers. The model will next be fitted to the remaining instances and assessed on the complete test dataset once the outliers have been eliminated from the training dataset. The ISO method is employed in this research, which is a tree-based outlier-detection method. It is based on modeling regular data in such a way that oddities that are both limited in number and distinct in feature space are isolated [40]. Table 3 represents the outlier detection in the dataset for each feature-selection technique with the mean absolute error (MAE).

Homogenous Ensemble
Ensemble techniques in ML and data mining employ several learning algorithms to achieve higher prediction performance than each of the individual learning algorithms alone. A homogenous ensemble is a series of classification models of the same type, where each is constructed on a distinct sample of data [41]. The two crucial types of the homogenous ensemble are bagging and boosting, which have been implemented in this research as initial ensembles.

Bagging
To improve accuracy, bagging is a method that significantly decreases the variation. As a result, overfitting is no longer an issue, which was a major problem with many prediction models. Homogeneous weak classifiers learn data in parallel, independently of one another, and then integrate them by averaging the results. Because the weak base classifiers are merged to produce a single but powerful classification model, the approach is more reliable than using single models. The biggest problem with these models is that they are computationally expensive. When we train a model, we obtain a function that takes an input, gives an output, and is defined concerning the training dataset, regardless of whether we are interacting with regression or a classification problem. The fitted model is also subject to variability due to the theoretical variation of the training dataset.
The concept of bagging is straightforward, where we want to build a model with a reduced variance by "averaging" the predictions from multiple different models. However, in practice, we are unable to build entirely independent models due to a large amount of required data. To fit almost independent models, we rely on the good "approximate characteristics" of bootstrap samples. This starts with creating several bootstrap samples, each one acting as a separate and nearly independent dataset taken from the real distribution. For each of these data, we may then train a weak learner, and eventually combine them such that their outputs are averaged, resulting in an ensemble classifier with reduced variation. Approximate independence and identical distribution are characteristics of bootstrap samples, and this is also true for learned base models. The bagging classifiers used in this research are as follows.

Boosting
This ensemble technique is the most commonly used and powerful. It was developed for classification issues and was later extended to include regression problems as well. The combined multiple weak models are no longer fitted separately from each other in sequential approaches. The aim is to fit models repeatedly so that the training of a model at each stage is dependent on the models fitted in prior phases. Boosting is the most well-known of these techniques, and it results in an ensemble model that is less biased than the weak learners that comprise it. AB, XGB, Gradient Boosting Machine (GBM), and Light GBM are the available boosting algorithms. In the case of boosting, if two models are predicted incorrectly then their outcomes are analyzed and combined for extraction of a better prediction. As a result, boosting demonstrates the ensemble fundamental concepts of transforming a weak classifier into a better one. The boosting models used in this research are: • AdaBoost (AB) [44]; • XGBoost (XGB) [45].

Voting Ensemble of Homogenous Ensemble
A voting ensemble (sometimes known as a "majority voting ensemble") is an ML ensemble model that incorporates predictions from many other models. It is a strategy that may be used to increase model performance, ideally outperforming any single model in the ensemble. The predictions from various models are combined in a voting ensemble. It may be used to classify or predict data. In the task of regression, this entails determining the average of the models' predictions. In the event of categorization, the votes for each label are added together, and the label with the highest number of votes is predicted.
A voting ensemble can be thought of as a metamodel, or a model of models. It may be used as a metamodel with any collection of already trained ML models, and the existing models are not aware that they are being utilized in the ensemble. When we have two or even more models that execute a predictive modeling assignment well, a voting ensemble is ideal. The ensemble models must generally agree on their forecasts [46]. There are two ways to predict majority votes for classification, one method is hard voting and the other is soft voting [47]. Details of both voting techniques are as follows. Figure 4a illustrates the soft voting process. Soft voting entails adding up the anticipated probabilities or scores for each target class estimating the class label with the greatest likelihood. It also predicts the class with the highest summed probability based on the models. Let us consider that the classifiers from C 1 , C 2 , . . . C n and distribution of the probabilities for each classifier are Prob n max and Prob n min . Consider if the total number of classes is 'two' then the representation of these classes are Class 1 = 0 and Class 2 = 1. The weight assignment for each classifier is denoted as W 1 , W 2 , . . . W n . The calculation of the probabilities for the target class are as follows:

Soft Voting Ensemble
other is soft voting [47]. Details of both voting techniques are as follows. Figure 4a illustrates the soft voting process. Soft voting entails adding up the anticipated probabilities or scores for each target class estimating the class label with the greatest likelihood. It also predicts the class with the highest summed probability based on the models. Let us consider that the classifiers from 1 , 2 , … and distribution of the probabilities for each classifier are and . Consider if the total number of classes is 'two' then the representation of these classes are 1 = 0 and 2 = 1. The weight assignment for each classifier is denoted as 1 , 2 , … . The calculation of the probabilities for the target class are as follows:

Soft Voting Ensemble
The averages of the target variable classes are calculated as:

Hard Voting Ensemble
Hard voting entails adding up all the guesses for each class label and forecasting the class value with the most votes. In hard voting, we anticipate the class with the most votes from models. The mode of all predictions provided by multiple classifiers is used to classify input data using the hard voting classifier. When the weights associated with the distinct various algorithms are identical, majority voting is treated differently. In this case, consider again the total number of classifiers are , represented as 1 , 2 , … , whereas their predictions are denoted as 0 and 1 . The equation below and Figure 4b express the phenomena of hard voting classification.

Hard Voting Ensemble
Hard voting entails adding up all the guesses for each class label and forecasting the class value with the most votes. In hard voting, we anticipate the class with the most votes from models. The mode of all predictions provided by multiple classifiers is used to classify input data using the hard voting classifier. When the weights associated with the distinct various algorithms are identical, majority voting is treated differently. In this case, consider again the total number of classifiers are n, represented as C 1 , C 2 , . . . , C n whereas their predictions are denoted as P 0 and P 1 . The equation below and Figure 4b express the phenomena of hard voting classification.

Performance Assessment Metrics
Classification algorithms may be assessed in a variety of ways. Metrics analysis should be appropriately interpreted while assessing various learning methods. Some of the metrics generated from the confusion matrix are used to assess a diagnostic test for the classification of breast cancer [48,49] and human physiological conditions [50,51] using various ML classifiers. The confusion matrix includes a few key terms, such as A = True positive (TP), B = True negative (TN), C = False positive (FP), and D = False negative (FN). TP indicates that the system correctly predicts the outcome, and the outcome is also correct. The term FP refers to when the system predicts a right value, but the outcome is incorrect. TN indicates that the system predicts a false value, and the output is also a false value. The term FN refers to the system's prediction that the outcome would be a false value when the outcome is a true value.

Confusion Matrix-Based Metrics
The most significant and often used metric for assessing classifier performance is accuracy which is calculated by dividing the number of accurate prediction samples by the total number of observations in the dataset.
The ratio of genuine projected positive samples to the true positive samples is described as true positive rate (TPR) or recall.
The F-measure is sometimes referred to as the F1-score. It explained the harmonic mean of accuracy and memory. A model is regarded as excellent if its score is one or it has a low false test rate, but a value of 0 indicates poor performance. The F1-score equation: The Matthews correlation coefficient (MCC) was developed by Brain W. Matthews in 1975. This coefficient represents the connection between the observed and anticipated classifications. MCC is determined using the confusion matrix, and a + 1 number reflects flawless prediction, while a − 1 value indicates a disagreement between forecasting and true values. MCC is defined below.
Precision or positive predictive value (PPV) is the percentage of relevant occurrences among the retrieved events.

Statistical Test
Cohen kappa is a statistical measure used to measure the degree of agreement between two evaluators. It may also be used to gauge how well a categorization model is doing in the real world.
where p0 represents the overall model accuracy, and pe represents the degree of agreement between the predicted values of the classes and the actual class values.

Loss and Error Finding
Mean absolute error (MAE) is a measure of how far off the original and forecasted values are from each other, and are averaged across the whole data set.
Mean square error (MSE) reflects the difference between the actual and projected values, calculated by squaring the average difference across the whole dataset.
where Y i is the original value, Y represents the predicted value.
In statistics, the Hamming loss (HL) is the percentage of erroneously predicted labels.
where Y i,j is equal to the target, and Z i,j denotes the forecasted value.

Results and Discussion
The proposed methodology of an ensemble of homogenous ensemble hybrids with three feature-selection approaches and multiple estimators is presented in this section. The experiment has been performed on the Jupyter notebook with a python platform involving multiple ML libraries and packages. The splitting method with a ratio of 75% training and 25% testing has been implemented, with hyperparameters tuned for the classifiers. Table 4 demonstrates the accuracy of the RF and BME including their training and prediction time for each attribute-selection technique with the used estimator and function. It is clearly shown that all the classifiers obtained 100% accuracy with all the estimators in the implemented feature selection approaches. Only LR estimator from SFM attribute selection obtained 98.71% by using RF base learner. The lowest training and prediction time with 100% accuracy was attained by the RFE feature selection with DT estimator, only 01 selected feature, and the BME forecasting bagging model. The performance of the boosting predictors AB and XGB has been shown in Table 5. All the estimators with their feature-selection methods attained 100% accuracy. The exception is the FCI function in the SKB method with XGB classifier, which obtained 97.43% accuracy.   Table 6 reveals the second stage of the ensemble known as a voting ensemble with both soft and hard voting strategies. Although the computational cost of this ensemble stage is slightly higher than the first stage in terms of the accuracy and other implemented performance assessment measures, the proposed method attained a state-of-the-art result with zero error and loss and 100% precision, recall, MCC, kappa, etc. The proper implementation of the voting ensemble of homogenous ensemble refers to a minor delay in the computational operation due to the process of identifying the calculated or assigned weights and averages for the bagging and boosting classifiers in soft and hard voting. Figure 5 represents the confusion matrices for the soft voting ensemble of the bagging (RF, BME) and boosting (AB, XGB) predictors. Soft voting involves equal weights for each classifier. Figure 5a-d represents the confusion matrices with SFM feature selection, whereas Figure 5e,f exhibit the SKB, and RFE is included in Figure 5g-j, illustrating the performance of each implemented estimator or function.
As shown in Table 7, the method used in this research work has been investigated alongside other existing studies on the same dataset. The results of the proposed study were obtained by utilizing a variety of homogenous (bagging, boosting) ensembles with multiple feature-selection techniques. In this study, the researchers aimed for greater accuracy, reduced training, and prediction times. The hybrid implementation of the multiple feature selection, outlier, and anomaly detection with initial ensemble classifiers is performed by bagging and boosting techniques. The final prediction was conducted by another second stage of the ensemble process of the voting (soft and hard). The proposed method used a combination of complex algorithms and distinct strategies. This study attained the best results with higher accuracy, recall, and F1-score of 100% by utilizing less training and prediction time compared with existing hybrid models. This comparison concludes that existing approaches are not only more expensive to implement, but they also require more time to train and validate results.
Similarly, Figure 6 illustrates the hard voting ensemble. Figure 6a-d represents the SFM, Figures 6e and 6f denote SKB, and Figure 6g-j illustrates the confusion matrices for the RFE attribute-selection approach with the estimators.  Similarly, Figure 6 illustrates the hard voting ensemble. Figure 6a-d represents the SFM, Figure 6e and Figure 6f denote SKB, and Figure 6g-j illustrates the confusion matrices for the RFE attribute-selection approach with the estimators.  As shown in Table 7, the method used in this research work has been investigated alongside other existing studies on the same dataset. The results of the proposed study were obtained by utilizing a variety of homogenous (bagging, boosting) ensembles with multiple feature-selection techniques. In this study, the researchers aimed for greater accuracy, reduced training, and prediction times. The hybrid implementation of the multiple feature selection, outlier, and anomaly detection with initial ensemble classifiers is performed by bagging and boosting techniques. The final prediction was conducted by another second stage of the ensemble process of the voting (soft and hard). The proposed

Conclusions
The early detection and diagnosis of disease are critical for human survival. Recognition and identification have become more precise and accurate due to the use of machinelearning algorithms. Thyroid disease is difficult to diagnose because its symptoms can be mistaken for those of other ailments. New features in the thyroid dataset have a positive impact on classifier performance, and the results show that it provides better accuracy than previous studies. This research work focused on the implementation of the voting ensemble of a homogenous ensemble in combination with three separate attribute-selection techniques. The necessary preprocessing and detection of the outliers from the selected features were conducted before the classification process. The bagging and boosting ensembles contribute two algorithms for the initial ensemble. The bagging ensembles focused on the random forest and bagging meta estimator (BME) algorithms whereas the boosting ensemble implementation includes AdaBoost and XGBoost. Among all implemented ensemble techniques, the BME shows better performance by achieving the best accuracy in less training and prediction time. The consistency in the execution is independent of the total number of features selected for the datasets. In the second part of the classification, a voting ensemble with both hard and soft voting was implemented. Results show that all the feature-selection techniques, in combination with multiple estimators and ensemble techniques, attained the highest accuracy of 100% with a very low computational cost. Our proposed approach also obtained 100% results in terms of other used performance evaluation metrics. In comparison with existing studies, our method achieved the best results on the thyroid illness dataset.