# A New K-Nearest Neighbors Classifier for Big Data Based on Efficient Data Pruning

## Abstract

## 1. Introduction

- Taking into account two new factors (cluster’s density and spread shape) in order to prune the training dataset more efficiently; and
- Implementing the proposed method and evaluating its classification accuracy and time cost in comparison to KNN, LC-KNN, and RFKNN methods.

## 2. Related Work

## 3. The Proposed KNN Algorithm for Big Data

_{1}), most of its nearest neighbors are located in the other cluster. Consequently, using cluster 2’s training samples helps the KNN to find better neighbors and probably yield a more accurate classification.

Algorithm 1. Choosing the proper cluster. |

Input: a test sample ($t$) to be classified by KNN and $m$ clusters of data |

Output: a data cluster which is used for finding the $k$ nearest neighbors |

1. Begin |

2. for $i=1tom$ |

3. Find the Euclidean distance between $t$ and cluster center $i$ (${d}^{i}$) |

4. end for |

5. for $i=1tom$ |

6. if ${d}^{i}\le \alpha \times {d}^{j}\forall j\ne i$ |

7. return $i$ |

8. for each cluster $i$ which ${d}^{i}\le \alpha \times {d}^{ave}$ |

9. Calculate the distance defined by Equation (2) (${D}^{i}$) |

10. for each cluster $i$ which ${d}^{i}\le \alpha \times {d}^{ave}$ |

11. if ${D}^{i}\le \beta \times {D}^{j}\forall j\ne i$ |

12. return $i$ |

13. for each cluster $i$ which ${D}^{i}\le \beta \times {D}^{ave}$ |

14. Calculate the density of the cluster using Equation (3) ($den{s}^{i}$) |

15. Choose the cluster with maximum density |

16. End |

Algorithm 2. The proposed KNN for big data. |

Input: a large amount of data and a set of test samples that should be classified (data space has n dimension) |

Output: the estimated class of each test sample |

1. Begin |

2. Divide the large amount of data to $m$ separate clusters using k-means algorithm |

3. For each cluster $i=1\dots m$ |

4. Calculate the size of cluster (number of data samples) |

5. Calculate ${d}_{j}^{i}\left(j=1\dots n\right)$ |

6. For each test sample |

7. Choose the proper cluster of data by using Algorithm 1. |

8. Employ the KNN algorithm on the selected cluster to find the estimatedclass of the test sample |

9. End |

## 4. Experimental Results

#### 4.1. The Characteristic of the Datasets

#### 4.2. Performance Evaluation with Different Values of the m Parameter

#### 4.3. The Effect of $\alpha $ and $\beta $ Parameters’ Values on the Accuracy of the Proposed Method

#### 4.4. Classification Accuracy and Time Cost Comparisons

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

**Figure 1.**Inefficiency of the distance to cluster’s center measure as a single criterion for choosing the proper cluster.

**Figure 3.**Classification accuracy of the proposed approach for different values of the $\alpha $ parameter. As it can be seen, the proposed approach has a better performance for the value of 0.5 on average.

**Figure 4.**Classification accuracy of the proposed approach for different values of the β parameter. As it can be seen, the proposed approach has a better performance for the value of 0.7 on average.

**Figure 5.**Classification accuracy of the proposed approach (PA) and LC-KNN for different k values. As it can be seen, the higher values of the k parameter lead to the lower classification accuracy, in general, for both the proposed approach and LC-KNN.

Dataset Name | Number of Instances | Number of Attributes | Number of Class Labels |
---|---|---|---|

USPS | 7291 | 256 | 10 |

MNIST | 60,000 | 780 | 10 |

GISETTE | 13,500 | 5000 | 2 |

LETTER | 20,000 | 16 | 26 |

PENDIGITS | 10,992 | 16 | 10 |

SATIMAGE | 6430 | 36 | 6 |

ADNC | 427 | 93 | 2 |

psMCI | 242 | 93 | 2 |

MCINC | 509 | 93 | 2 |

**Table 2.**Classification accuracy (mean of 10 runs) of the proposed approach and LC-KNN algorithm at different values of the m parameter.

m | USPS | MNIST | GISETTE | LETTER | PENDIGITS | SATIMAGE | ADNC | psMCI | MCINC | |
---|---|---|---|---|---|---|---|---|---|---|

10 | LC-KNN | 0.9355 | 0.8389 | 0.9526 | 0.9495 | 0.9721 | 0.8883 | 0.7667 | 0.5833 | 0.6159 |

Proposed approach | 0.9501 | 0.8691 | 0.9647 | 0.9484 | 0.9798 | 0.9162 | 0.7711 | 0.6165 | 0.6198 | |

15 | LC-KNN | 0.9338 | 0.8364 | 0.9494 | 0.9469 | 0.9711 | 0.9468 | 0.7500 | 0.6042 | 0.5633 |

Proposed approach | 0.9512 | 0.8680 | 0.9623 | 0.9467 | 0.9790 | 0.9374 | 0.7680 | 0.6427 | 0.5802 | |

20 | LC-KNN | 0.9300 | 0.8353 | 0.9411 | 0.9451 | 0.9700 | 0.8884 | 0.7143 | 0.6500 | 0.6500 |

Proposed approach | 0.9495 | 0.8675 | 0.9608 | 0.9457 | 0.9781 | 0.9206 | 0.7628 | 0.7013 | 0.6833 | |

25 | LC-KNN | 0.9284 | 0.8338 | 0.9321 | 0.9423 | 0.9687 | 0.9421 | 0.7071 | 0.6417 | 0.5746 |

Proposed approach | 0.9482 | 0.8669 | 0.9567 | 0.9448 | 0.9775 | 0.9456 | 0.7601 | 0.7025 | 0.6147 | |

30 | LC-KNN | 0.9275 | 0.8313 | 0.9192 | 0.9403 | 0.9683 | 0.8878 | 0.7190 | 0.6125 | 0.5984 |

Proposed approach | 0.9475 | 0.8658 | 0.9513 | 0.9439 | 0.9761 | 0.9258 | 0.7608 | 0.6784 | 0.6421 |

**Table 3.**Classification accuracy and time cost of the proposed approach in comparison to the other related works on different datasets.

Dataset | KNN | LC-KNN | RFKNN | Proposed Approach | ||||
---|---|---|---|---|---|---|---|---|

Accuracy | Time | Accuracy | Time | Accuracy | Time | Accuracy | Time | |

USPS | 0.9503 | 44.8120 | 0.9300 | 4.9874 | 0.9471 | 7.3458 | 0.9495 | 5.1213 |

MNIST | 0.8768 | 35.0211 | 0.8353 | 4.6309 | 0.8534 | 6.9142 | 0.8675 | 4.8757 |

GISETTE | 09660 | 296.4012 | 0.9411 | 37.5111 | 0.9631 | 51.2430 | 0.9608 | 40.6560 |

LETTER | 0.9518 | 26.3548 | 0.9451 | 4.3528 | 0.9489 | 6.5956 | 0.9457 | 4.7016 |

PENDIGITS | 0.9793 | 9.6935 | 0.9700 | 3.2756 | 0.9772 | 5.0158 | 0.9781 | 3.5584 |

SATIMAGE | 0.9315 | 4.7499 | 0.8884 | 1.7377 | 0.9281 | 2.7885 | 0.9206 | 1.9511 |

ADNC | 07906 | 0.0473 | 0.7143 | 0.0450 | 0.7709 | 0.0459 | 0.7628 | 0.0453 |

psMCI | 0.7195 | 0.0240 | 0.6500 | 0.0234 | 0.6964 | 0.0236 | 0.7013 | 0.0236 |

MCINC | 0.7201 | 0.0766 | 0.6500 | 0.0690 | 0.6916 | 0.0713 | 0.6833 | 0.0694 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

