Utilizing Nearest-Neighbor Clustering for Addressing Imbalanced Datasets in Bioengineering

Imbalance classification is common in scenarios like fault diagnosis, intrusion detection, and medical diagnosis, where obtaining abnormal data is difficult. This article addresses a one-class problem, implementing and refining the One-Class Nearest-Neighbor (OCNN) algorithm. The original inter-quartile range mechanism is replaced with the K-means with outlier removal (KMOR) algorithm for efficient outlier identification in the target class. Parameters are optimized by treating these outliers as non-target-class samples. A new algorithm, the Location-based Nearest-Neighbor (LBNN) algorithm, clusters one-class training data using KMOR and calculates the farthest distance and percentile for each test data point to determine if it belongs to the target class. Experiments cover parameter studies, validation on eight standard imbalanced datasets from KEEL, and three applications on real medical imbalanced datasets. Results show superior performance in precision, recall, and G-means compared to traditional classification models, making it effective for handling imbalanced data challenges.


Introduction
The problem of one-class classification poses a unique challenge in machine learning, where obtaining data from the target class is relatively easy, but data from the non-target class are either extremely scarce or entirely absent.Identifying the data sample from the nontarget class is crucial, as exemplified in the Stroke-Poor dataset, where correctly identifying patients enables more proactive medical interventions by healthcare professionals.In practical scenarios like rare disease identification, target class data (non-rare disease cases) often dominate the dataset, while non-target-class data are challenging to acquire due to cost constraints or physiological characteristics.In such highly imbalanced [1] or one-class situations, building a reasonable model using traditional supervised learning algorithms becomes a formidable task.
This article introduces a method based on K-means to replace the inter-quartile range mechanism in the One-Class Nearest Neighbor (OCNN) algorithm.Additionally, we propose a novel strategy, the Location-based Nearest Neighbor (LBNN) algorithm, aiming to provide improved model performance with comparable time complexity.Experimental validation involves assessing algorithm performance enhancement using KEEL datasets [2] and comparing with traditional algorithms in real medical data experiments.The article's structure includes the research motivation, objectives, and an overview of the article's organization.It discusses the problem background, the relevant literature, OCNN mechanisms, and the LBNN strategy and details the experimental process, providing analysis and interpretation of results for different experiment types.The article concludes with a summary of the findings and suggests future research directions.We believe the key contribution with this new strategy would provide better performance results on predicting imbalanced applications, such as heart diseases, diabetes mellitus, or septicemia, compared with current bio-engineering algorithms.Here, we emphasize that the target class we mentioned above is the data sample that is easier to access, which means the non-target class is the minority sample, which is a non-accessible class.

One-Class Classification
The assumption underlying the one-class classification problem [3] is that during the training process, only one class is available, referred to as the target class, while the remaining classes are considered non-target classes.One-class classifiers leverage the distinctive characteristics of the target class to identify a boundary encompassing all target class data or the majority of it.In practical applications, such as wearable devices for individual electrocardiogram monitoring in precision healthcare (predicting conditions like high blood sugar or potassium levels), initially, data may be collected only from healthy patients.It is impractical to wait until a sufficient amount of data are gathered before initiating predictions, presenting a typical scenario for a one-class classification problem.
An alternative solution involves using public data to predict individual health conditions, but this approach has proven ineffective due to the inconsistent physiological characteristics among individuals (e.g., variations in electrocardiogram wavelength, peak positions, and heights).To overcome the limitations of traditional classification algorithms in such situations, one-class classification algorithms play a crucial role.
A suitable one-class classifier must exhibit strong generalization capabilities, maintaining a high recognition rate for non-target classes while avoiding overutilization of target class information to prevent overfitting.Proper handling of target class and outlier [4,5] values is essential to derive effective decision boundaries.The one-class classification problem can be mathematically expressed as follows: Here, d(z) is an unknown measurement of data z with respect to the target class group (e.g., distance, density), θ z is the threshold for d(z), and f (z) is a function determining whether z is accepted as the target class.

One-Class Support Vector Machine (OC-SVM)
The OC-SVM [6] is a type of unsupervised algorithm based on the SVM [7,8], which can be used for novelty detection and anomaly detection.The objective of the OC-SVM is to find a decision function or hyperplane, attempting to separate the target class data from the non-target-class data.As illustrated below in Figure 1, most of the training data are allocated to one region and assigned a value of +1, while data outside this region are assigned a value of −1.The OC-SVM utilizes a kernel function (typically Gaussian) to map the input data to a higher dimensional space and aims to find a minimal hyperplane.Given a set of target class training data x i ∈ R d , i = 1, . . ., N , we can represent the following quadratic programming expression: Here, N represents the total number of data points, v ∈ (0, 1) is used to determine the upper limit of outlier values, ξ i represents the slack variable for each data point, and Φ is the kernel function.

One-Class Nearest-Neighbor (OCNN) Algorithm
The OCNN algorithm can be classified into four types based on the number of nearest neighbors [9,10] chosen: 1.
Find the nearest neighbor of the test data in the target class, and then find the nearest neighbor of this nearest neighbor (11NN).

2.
Find the nearest neighbor of the test data in the target class, and then find the K-nearest neighbors of this nearest neighbor (1KNN).

3.
Find J-nearest neighbors of the test data in the target class, and then find the nearest neighbor of each of these J-nearest neighbors (J1NN).4.
Find J-nearest neighbors of the test data in the target class, and then find the K-nearest neighbors of each of these J-nearest neighbors (JKNN).
Figure 2 illustrates the four different OCNN methods.The black circles represent the target class data, and the red asterisk represents an unknown data point.To determine whether the unknown data point belongs to the target class, different numbers of nearest neighbors are selected based on the parameters J and K.After calculating the average distance of J-nearest neighbors and the average distance of K-nearest neighbors, the values are compared with the threshold θ.Taking JKNN as an example, the detailed process please referring to Algorithm 1 as follows : Algorithm 1 Pseudo-code of JKNN Calculate the distance from the J-nearest neighbors of the test data z to their respective K-nearest neighbors and compute the average.NN tr k (NN tr j (z)) represents all K-nearest neighbors of the J-nearest neighbors, expressed as:  In the OCNN algorithm, the threshold θ can be fixed at 1 or chosen arbitrarily.Here, we will discuss the relationship between 11NN under different threshold values θ and other OCNN methods.
In Figure 2a, when 11NN has a threshold θ set to 1, if D 1 > D 11 , the test data will be classified as a non-target class (outlier), even if D 1 is only slightly larger than D 11 .Intuitively, the distance (D 1 ) from non-target class (or outlier) to its nearest neighbor should be much larger than the distance (D 11 ) from the nearest neighbor to itself.This can be expressed mathematically as: When θ > 1, some data that were originally classified as a non-target class due to the rule D 1 > D 11 will be accepted as a target class.This situation aligns more with our intuition about outliers.Finding the optimal θ will be an important issue, and the optimal θ will change depending on the dataset and evaluation criteria.
Figure 2b is the 1KNN for non-target class data, and we represent its distance to the nearest neighbor as: where D 1i (K ≥ i ≥ 1) is the distance from the test data point to its i-th nearest neighbor, and we can observe that D 1i should increase as i increases.Expanding the inequality we obtain: Based on the derivations from the inequalities ( 6) and (7), we obtain: This demonstrates that the 1KNN method can produce similar effects to 11NN (θ > 1).J1NN: In contrast to 1KNN, we consider Jnearest neighbors for the test data but only consider one nearest neighbor for each of these neighbors.As shown in Figure 2c, similar to the derivation of 1KNN, we obtain: This proves that the J1NN method can produce similar effects to 11NN(θ < 1).JKNN: We calculate the average distance of J-nearest neighbors and their respective K-nearest neighbors.Based on the previous derivations of 1KNN and J1NN, we obtain: As the parameters J and K vary, we observe that these two parameters will offset each other's influences.When β α > 1 , JKNN is similar to 11NN (θ > 1), allowing it to accept more outliers as the target class.When β α < 1, JKNN is similar to 11NN (θ < 1), making the criteria more stringent, and more data will be considered as outliers.

One-Class Nearest Neighbor (OCNN) Parameter Optimization
Based on the above discussion, we understand the relationships between different types of OCNN classifiers.The settings of parameters J, K, and threshold θ will be an important topic.
Parameter optimization is a challenging issue in one-class classifiers because, in the training data, only data from the target class can be used, unlike the situation with multiclass data, where traditional classifiers utilize data from different classes to make decision boundaries.Here, we use some methods to identify outliers in the target class data for parameter optimization of the OCNN classifier.Regarding the selection of nearest neighbors and their distance calculations, we can identify the following issues faced by different OCNN classifiers: Firstly, we assume the target class as negative data and the non-target class as positive data.

•
False Negatives: In real-world datasets, noise samples may be generated due to human errors (incorrect labeling, operational negligence, etc.).The OCNN classifier described earlier cannot detect this phenomenon.When target class samples exhibit a tight configuration, noise samples far from the cluster will lead to unknown non-targetclass data being incorrectly classified as target class data.

•
False Positives: If we do not find an appropriate decision threshold θ after removing noise samples from the dataset, the OCNN classifier will identify many test data as non-target-class data.Another situation leading to false positives occurs when the target class in the training data cannot demonstrate sufficient representativeness.
Yin et al. [11] mentioned that in one-class classification problems, designing an error detection system while simultaneously reducing false negatives and false positives is a difficult task due to the lack of non-target class data.Generally, one-class classifiers are sensitive to parameter settings [12].A common approach to optimizing parameters for one-class classifiers is to use generated synthetic samples.Ref. [13] attempts to assume the distribution of the non-target class using artificially generated samples.However, generating artificial data has some issues, requires in-depth knowledge of the domain, and may lead to overfitting.
In this article, we describe how to optimize the parameters of OCNN classifiers using one-class data using the outliers identified by the above methods, and we consider them as proxies for the non-target class and use cross-validation to optimize parameters J, K, and θ.

K-Means with Outlier Removal (KMOR) Algorithm
KMOR [14] is an algorithm based on K-means that can simultaneously detect outliers and perform clustering [15][16][17].Traditional K-means algorithms can experience drastic changes in clustering results due to the presence of outliers, as illustrated in Figure 3 below.Circles represent normal data, while asterisks represent outliers.Setting the number of clusters to two, the left portion shows clustering results without considering the presence of outliers, while the right portion demonstrates clustering results that account for outliers, resulting in a more asymmetric clustering outcome.To address the aforementioned issue, KMOR introduces the concept of the K + 1 cluster, where data identified as outliers are assigned to the K + 1 cluster.Additionally, these outliers are independently treated in the objective function.The objective function of KMOR is defined as follows: In the equation, U represents the membership of all data points in the clusters, Z denotes the cluster centers, and u i,j = 1 if a data point x i belongs to the jth cluster.n 0 restricts the maximum number of data points identified as outliers in the entire dataset.
D(U, Z) can be expressed as follows: The parameter γ > 0 represents the weight of the average distance from all nonoutliers to their respective clusters.γ and n 0 control the number of outliers in the dataset.Updating cluster centers and cluster assignments differs slightly from the original K-means.The rules are as follows: 1.
Rule 1: Calculate the distance from each data point to all cluster centers.If the distance from the data point to the cluster center is less than D(U, Z), assign it to the cluster with the shortest distance; otherwise, assign it to cluster K + 1.

2.
Rule 2: Update the cluster centers for clusters 1 to K by averaging the data points in each cluster.Data points classified as cluster K + 1 do not participate in the calculation.
The detailed process please referring to Algorithm 2 as follows: Algorithm 2 Algorithm flow for KMOR  We initially employ outlier detection methods (such as IQR or KMOR) to identify outliers in the target class of training data.These outliers are then treated as non-targetclass instances, forming a binary dataset.Subsequently, using the K-fold cross-validation method, we divide the dataset into K subsets, with one subset reserved for testing and the remaining (K − 1) subsets for training.We evaluate the performance of each subset for various combinations of J and K values in JKNN.Based on the results obtained, we store them in a two-dimensional matrix indexed by J and K. Finally, we select the indices (representing J and K values) corresponding to the highest average performance.

Optimal Parameter θ
After identifying outliers in the training dataset and reorganizing it into a binary dataset, we employ cross-validation to determine, for each validation data point, the distance to its nearest neighbor and the distance between that neighbor and its own nearest neighbor (11NN).Dividing these distances yields a test threshold value, θ.After computing this for all folds, we obtain a θ vector of length N. We then compare these thresholds, calculating an array representing the performance of G-means.The index of the best G-means value in this array corresponds to the optimal threshold value, θ.

Location-Based Nearest Neighbor (LBNN) Algorithm
We understand that, whether it's 11NN or JKNN, when an unknown data point comes in, they both need to perform two rounds of nearest neighbor searches on the entire training dataset.The first-round searches for J-nearest neighbors and the second round searches for J.K-nearest neighbors result in a time complexity of O(2dn + J × 2dn) → O((J + 1) × 2dn), where the part of searching for nearest neighbors of nearest neighbors can be anticipated to be mostly within adjacent blocks.In other words, the unknown data point and these nearest neighbors are mainly compared locally (Figure 4), without considering the overall distribution characteristics of the data.This may affect the final performance of the model.Therefore, we propose a clustering-based nearest neighbor search strategy, LBNN, to compare with 11NN and JKNN.Our LBNN strategy initially applies KMOR [18] clustering to the training data, setting a percentile Q, 0 ≤ Q ≤ 1.For an unknown data point u i , i ∈ {1, 2, . . .n}, we find one nearest neighbor P i,c , c ∈ {1, 2, . . .k} from each cluster.These nearest neighbors are considered reference points for their respective clusters.Finally, we calculate the distance L i,c between the reference point P i,c and the other data points in the same cluster.If the distance d i,c from the unknown data to any cluster reference point is less than the percentile Q of its distance L i,c , we classify the unknown data as the target class.compute distance between P and other data in With the rapid development of modern technology, the improvement in the performance of hardware and software, and the widespread application of the Internet of Things, data are generated at an unprecedented rate.This includes high-definition videos, images, text, audio, and data obtained from the rise of social relationships and the Internet of Things.Such data often possesses features with multiple dimensions, presenting a challenging task for accurate data analysis and decision making.Feature selection can effectively handle multidimensional data and enhance learning efficiency, a notion that has been proven in both theory and practice.
Feature selection refers to the process of obtaining a subset of features from the original features based on certain criteria.The feature selection criteria gather relevant features of the dataset.It plays a crucial role in reducing the computational cost of data processing by eliminating unnecessary and irrelevant features.Feature selection is considered a preprocessing step for data and learning algorithms, where good feature selection results can improve model accuracy and reduce training time.
In this article, the real-world medical data we used contain a large number of features (64 in total).To enhance the performance of various algorithm models, we employed a stepwise feature selection method called "Stepwise".The concept is, in each round, to select only one feature at a time, retaining the combination of features with the best performance or continuing until a specified number of features to retain is achieved.Assuming a dataset has ten features, and our evaluation metric is the area under the receiver operating characteristic (ROC) curve, which is the area under the ROC curve (AUC), the detailed process is as follows: 1.
In the first round of feature selection, we select one feature at a time for model training.
After testing all features, we retain the feature with the best AUC performance.

2.
Similar to the first step, we choose one feature at a time from the remaining nine features, but this time, we include the feature retained from the first round.In the end, we obtain the two features with the best performance.

3.
We repeat the first and second steps until the specified number of retained features is achieved.

Results
This section provides an overview of the experimental environment, including details on the dataset utilized and the configuration of relevant parameters.

Execution Environment
The experiments were conducted in a controlled setting to ensure consistency and reproducibility.The hardware device is manufactured by HP who's headquarter is located in Palo Alto, CA.United States.The hardware and software specifications are outlined below: The specified parameters and evaluation metrics serve as the foundation for assessing the performance of KMOR-integrated algorithms on the KEEL dataset.The rigorous parameter tuning and metric selection aim to provide a comprehensive evaluation framework for the conducted experiments.The algorithm-specific parameter configuration for the KEEL dataset is as below Table 1: Table 1.Relevant parameter settings for algorithms using KMOR on KEEL dataset.

Number of rounds 10
Nu for One-Class SVM 0.5 Gamma for One-Class SVM 1/d (number of features) KEEL is a software tool (Version Release 3.0 and the released date: 9 April 2018) that assesses evolutionary algorithms for data mining problems such as regression, classification, clustering, pattern mining, and more.KEEL provides a repository of preprocessed datasets for classification problems, including imbalanced datasets.Imbalanced datasets are a special case for classification problems where the class distribution is not uniform among the classes.Typically, they are composed of two classes: the majority (negative) class and the minority (positive) class.We leveraged eight standard imbalanced datasets from the KEEL dataset repository collected from various domains, all comprising two classes.The datasets such as glass2 and glass4 are for glass classification, with '2' in glass2 indicating the second class as positive and the rest as negative, forming a one-versus-rest two-class dataset.Similarly, datasets like glass4, Yeast4, and ecoli4 follow the same concept.The Yeast series and ecoli4 datasets are for biological applications (protein classification).The segment0 dataset is for outdoor object image classification, with features consisting of various pixel information for the images.On the other hand, the pageblocks0 dataset is for document classification, with features comprising the layout information of the documents.
Regarding the real medical datasets, we obtained three types of medical data from Dr. Tsai, who is one of our authors working in Chang Gung Memorial Hospital; those datasets could be obtained from Dr. Tsai's email address, tcmnor@cgmh.org.tw.
ROSC (return of spontaneous circulation) indicates that an emergency patient had no breathing or heartbeat upon admission.For such patients, we have two different outcomes for prediction.ROSC-CPC12 predicts whether a patient will have an excellent prognosis after 12 months (able to live independently) after discharge, the minority samples indicate the alive patients, and the majority samples indicate the patients not able to survive; ROSC-30DayS predicts whether a patient will survive 30 days after discharge, in which the minority samples indicate those patients survived after 30 days, and the majority samples indicate those patients that did not survive.Stroke-Poor is used to predict whether a stroke patient will have another severe stroke after discharge, where the minority samples mean the patients will not have another stroke, and the majority samples indicate the patients will have another stroke.The features of these medical datasets are all derived from the patients' physiological test results (blood sugar, blood pressure, etc.) and the use of various drugs and treatment methods.
In the below Tables 2 and 3, we listed the dataset structure, including features, samples, minority samples, majority samples, and imbalanced ratio.

Experimental Framework
The datasets utilized in this article consist exclusively of binary classifications.In the training process of the One-Class Nearest Neighbor (OCNN) algorithm, one of the classes is designated as the target class.Initially, the dataset is divided into training and testing sets using cross-validation.Subsequently, the training data are subjected to the methodology outlined in Section 2, distinguishing them into target class data and outliers.The outliers, treated as non-target-class data, are employed for parameter optimization.The final evaluation of the model is conducted using the testing data, generating various performance metrics.The experimental framework is illustrated in Figure 5.In contrast to the OCNN training architecture, our approach, LBNN, initiates with clustering the one-class training data using KMOR.A percentile, denoted as Q, is set during this process.The experimental framework is illustrated in Figure 6.
From the above experimental result from Tables 4-8, we can tell the improved OCNN and LBNN methods exhibit better performance on most datasets.Although the LBNN method has poorer performance in terms of TPR, it compensates by significantly improving TNR, resulting in more stable variations between TPR and TNR and hence better G-means performance.In a few datasets, the LBNN method has an AUC less than 0.6, which may be due to insufficient representativeness of one-class data, overlapping phenomena, or unprocessed features.

Real Medical Dataset Results
In the Tables 9-11 below, we list the evaluation metric results for three real datasets from a local hospital.We selected 3 traditional methods for our LBNN comparison which including Logistic Regression, Support Vector Machine, Random Forest [19].From the above result with the ROSC-CPC12 dataset, here we can see the LBNN method obtains better results compared with the three other traditional methods except for accuracy and specificity.The recall for the LBNN model improved significantly.In contrast, the LBNN method underperformed the traditional model in terms of accuracy and specificity.The higher recall of the LBNN model indicates a more stringent evaluation of unknown data, even at the cost of sacrificing accuracy for the target class.For the ROSC-30Days dataset, the results of the LBNN method are all better than the three other traditional methods, and the insight of the results should indicate that with lower imbalance ratio data, the LBNN method obtains significantly improved results compared with the ROSC-CPC12 dataset results.From the above Stroke-Poor results, the LBNN method was able to obtain better performance on precision and recall, which is more important as this means fewer falsepositive and fewer false-negative cases.For the Stroke-Poor example, it means there were fewer misdiagnosed patients.Therefore, we can believe the LBNN method obtained enhanced performance for those three real medical datasets.

Discussion
For KEEL dataset experiments, based on the experimental results, our enhanced OCNN method and the proposed LBNN method exhibit superior performance across most datasets.While the LBNN method may slightly lag behind the other methods in terms of its true-positive rate (TPR), its substantial improvement in its true-negative rate (TNR) contributes to a more stable overall performance.The variations in TNR and TPR are relatively steady.Consequently, the LBNN method outperforms most methods when evaluated against G-means.In terms of the area under the curve (AUC), a few datasets show values below 0.6, possibly due to inherent high overlap in the dataset or insufficient representation of class characteristics.Comparing G-means metrics, the LBNN method demonstrates superior stability.
For the real dataset experiments, in investigating the ROSC-CPC12 dataset, given its substantial imbalance, we employed traditional model preprocessing through data sampling, such like oversampleing [20][21][22][23] and undersampleing [15,24,25] methods.The results revealed a notable improvement in recall but at the expense of precision and specificity.According to experimental findings, the LBNN method exhibited the best performance in terms of precision, recall, and G-means across all medical datasets.However, in terms of accuracy and specificity, the LBNN method generally lagged behind the other traditional algorithms.This is attributed to the LBNN method's use of a stricter standard to enhance re-call, sacrificing the precision of the target class and subsequently reducing overall accuracy.Nevertheless, as described by medical professionals, in healthcare applications, emphasis is placed on precision and recall as crucial metrics (typically requiring a recall above 0.8).This implies that the LBNN method is better suited to effectively identify individuals truly afflicted with a disease, aligning more closely with the practical requirements of medical applications.Future research directions may involve further optimizing the LBNN method to enhance its performance on other metrics and further applying integrated approaches to strike a balance between accuracy and recall.

Conclusions
In practical applications, the collection of industrial and medical data often encounters challenges related to one-class or imbalanced scenarios.Traditional algorithms typically require data sampling, weighting adjustments, or cost-sensitive learning to achieve reasonable results aligned with specific context needs.However, these methods have drawbacks, such as information loss in sampling, potential overfitting with added minority class data, and questionable interpretability when generating synthetic samples.Weight adjustments and cost-sensitive methods focus on ensuring accurate classification of the minority class, leading to decreased recognition rates for the majority class.
Addressing imbalanced data, we approach the problem from a one-class perspective.In the medical data experiments, we used the majority of data from the majority class for training, with the remaining majority class data and all minority class data mixed for testing.Experimental results indicate that the LBNN method outperforms other traditional algorithms in terms of accuracy, recall, and G-means.The LBNN method can also be applied to personal wearable devices in advanced healthcare.In this application, where initially collected patient data are predominantly normal, requiring patients to wear the device until sufficient data are collected for both classes is impractical.The LBNN method provides a viable solution during the transition period of single-class or extremely imbalanced data.
While the LBNN method exhibits good accuracy and recall for the non-target class, it sacrifices some specificity.Future considerations may involve integrating other traditional algorithms to reinforce the accuracy of the target class.In terms of computational complexity, the search for the nearest neighbor is the most time-consuming step.Exploring approximate nearest-neighbor methods could be attempted to observe the trade-off between algorithm acceleration and accuracy changes.However, a challenge remains in the application of the OCNN and LBNN methods in edge computing.They lack the capability of Online Learning, common in some neural network models, to update internal model parameters at a faster rate when new data arrive.In the OCNN and LBNN models, every new data point requires retraining the entire dataset, leading to time-intensive predictions.

Data Availability Statement:
The datasets we got from KEEL is open source and can be found from below link https://sci2s.ugr.es/keel/imbalanced.php (accessed on 19 February 2024) and the real medical datasets are not available due to privacy restriction.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
consider the test data z as the target class; otherwise, consider it as a non-target class.

Figure 2 .
Figure 2. Four kinds of OCNN.The red five-pointed star symbol represents a new data sample, (a) represents 11NN, the new data sample distance to the closest data point is D1 and the point to it's the closest data point is D 11 , (b) represents 1KNN, in this case K = 3, (c) represents J1NN, in this case, J = 3, (d) represents JKNN, J = 3, K = 2

Figure 3 .
Figure 3. K-means and KMOR illustration.The five-pointed star means the outliers.

Algorithm 3 8 :
Our LBNN strategy has a search time complexity of O(kdn + kdn) → O(2kdn), where k is the number of clusters.The LBNN process is illustrated in Algorithm 3. Pseudo-code of LBNN 1: Input: Target training data (D), number of cluster (k) percentile rank (Q), testing data (T) 2: Output: an result array R (prediction of T) 3: Apply KMOR clustering for Target training data (D), then we get k clusters in D 4: N ← number of testing data (T) 5: Foreach i ∈ 1, . .., Foreach i ∈ 1, . .., N do 9: foreach i ∈ 1, . .., k do 10: find nearest neighbor P of T[i] in cluster j and 11: record this distance d 12:
Accept or reject the test data z as target class data.3: Calculate the distance from the test data z to the J-nearest neighbors and compute the average.NN tr j (z) represents the J-nearest neighbors of z, expressed as: 1: Input: N target class data points with d dimensions, test data z, nearest neighbor parameters J, K, and threshold θ. 2: Output: Means with Outlier Removal (KMOR) technique, the setting of the cluster number k is a critical consideration.We employ the KMOR technique to perform clustering on the data, selecting the cluster number k based on the minimum value obtained from the objective function.

Table 11 .
Evaluation metrics for Stroke-Poor.