Improved k-Means Clustering Algorithm for Big Data Based on Distributed SmartphoneNeural Engine Processor

: Clustering is one of the most signiﬁcant applications in the big data ﬁeld. However, using the clustering technique with big data requires an ample amount of processing power and resources due to the complexity and resulting increment in the clustering time. Therefore, many techniques have been implemented to improve the performance of the clustering algorithms, especially for k-means clustering. In this paper, the neural-processor-based k-means clustering technique is proposed to cluster big data by accumulating the advantage of dedicated machine learning processors of mobile devices. The solution was designed to be run with a single-instruction machine processor that exists in the mobile device’s processor. Running the k-means clustering in a distributed scheme run based on mobile machine learning efﬁciently can handle the big data clustering over the network. The results showed that using a neural engine processor on a mobile smartphone device can maximize the speed of the clustering algorithm, which shows an improvement in the performance of the cluttering up to two-times faster compared with traditional laptop/desktop processors. Furthermore, the number of iterations that are required to obtain (k) clusters was improved up to two-times faster than parallel and distributed k-means.


Introduction
Currently, we are in a data flood era, as proven by the massive amounts of continuously generated data at unprecedented and ever-increasing scales. In the recent decade, machine learning techniques have become increasingly popular in a wide range of large and complex data-intensive applications, such as astronomy, as well as medicine, biology, and other sciences [1]. These strategies offer potential options for extracting hidden information from the data. However, as the era of big data approaches, the growth of dataset collection in such a large and complex way makes it difficult to deal with it using conventional learning methods, as the learning process for traditional datasets is not designed for and may not work well with large amounts of data. Most classical machine learning algorithms are built to process data that are loaded into memory [2], which is no longer true in the big data context.
The usefulness of the massive volumes of data can be achieved only if the meaning of those data is guaranteed, and proper information can lead to the right path. The information-gathering process from huge unstructured or semi-structured data is the socalled clustering technique. Clustering is a technique of grouping elements based on the similarity of their characteristics and returns those elements as clusters. Thousands of clustering algorithms have been published based on this concept, and k-means is one of the most used. k-means is widely used with a wide range of applications due to its simplicity of implementation and its effectiveness. In the literature, different modifications have been proposed for improving the performance and efficiency of the k-means clustering algorithm.
Big data analytics can extract useful information from numerous amounts of data generated by a variety of sources [3]. Although computer systems and Internet technologies 1.
Maximizing the performance of the k-means algorithm by running it on the dedicated neural engine processor of smart mobile devices by editing the code and steps of the kmeans algorithm to run on the single-instruction-based machine with an ARM-based processor; 2.
Dividing the k-means clustering technique into two parts. The first part, which contains the main tasks of the k-means algorithm, runs on a neural engine processor, while the second part, which contains the general tasks of the k-means technique, runs on the mobile general-purpose processor; 3.
Using a distributed Spark server, multiple mobile devices can process the k-means algorithm many times in a distributed network and give the optimal result to the server to select the optimal number of (k).
This paper contains seven sections. The Section 1 discusses the importance of big data and clustering algorithms in real-world applications and especially focuses on the main characteristics and limitations when working with big data. The big data clustering algorithms and main techniques are discussed in Section 2. Section 3 presents the main advantages and problems in the current big data platform. Section 4 presents the recent research in big data clustering. Section 5 presents the proposed solution. Section 6 discusses the implementation and results of the proposed solution. Section 7 is dedicated to the conclusions.

Big Data Clustering
Clustering is the technique of splitting data objects into a set of clusters based on the similarities and differences of the objects within datasets. It is considered as one of the important methods to recognize the set of objects based on their similarities and generate a pattern from a dataset that is unlabeled (unsupervised). Typically, the technique of big data clustering can be categorized into two types [10]: single-machine clustering and multiple-machine clustering techniques, as is illustrated in Figure 1. The single-machine clustering technique is designed to cluster the data on a single device without the ability to take advantage of a gridded or distributed high-performance system. However, the multiple-machine clustering technique is much faster and more adaptable and can handle the new big data challenges much better [11]. Therefore, in this article, the main focus was on multi-machine clustering.

Multi-Machine Clustering
Multi-machine clustering is divided into three main types based on the mechanism of taking advantage of multi-machine performance. The multi-machine term is used for a single or many devices that have a multi-processing ability. Typically, the multi-machine technique is divided into two main categories, which are:

1.
Parallel clustering: In general, parallel computing is used to handle many tasks simultaneously, which is typically designed to handle heavy tasks. Therefore, as clustering is considered a heavy task, parallel computing is used with it. In this section, some distributed clustering and parallel processing techniques that can deal with big data are examined. The parallel clustering splits the data partitions that are distributed on different machines [12]. This improves the scalability and speeds up the clustering [13]; 2.
MapReduce-based clustering: This technique works with large volumes of data by partitioning tasks to be distributed and executed on a large number of servers. Such a technique decomposes the tasks (Map) into smaller tasks that can be dispatched. Finally, it collects and consolidates the results (Reduce) [14]. Figure 2 describes the function of the MapReduce technique. In this work, special attention was given to partitioned-based clustering algorithms. This technique family has been proven in terms of efficiency and clustering performance, and this is due to its ability to detect noisy objects and find clusters of unusual shapes.

k-Means Clustering
k-means clustering seeks to group n objects into k clusters based on the nearest mean value group, serving as a prototype of the cluster. The iterative refinement technique is the most common technique used with the standard k-means algorithm. However, this technique is the so-called native k-means to differentiate from the much faster alternative solutions that have already been developed. Each clustered object should be compared and calculated based on the nearest centroid points, which are random initialized points to represent the cluster center.
k-means clustering has three main issues that most of the recent solutions have been trying to tackle, which are [15,16]: (P1) pre-clustering requirements, which represent the number of clusters and answer the following question "Do the data have clusters? If yes, how many (k)?"; (P2) finding the k clusters of the data; (P3) post-clustering validation, which represents the validity and accuracy of the clustering process.

Big Data Platform
A summary of the most-used platforms for big data is given in this section. In general, there are two types of platforms, which are: horizontal scaling and vertical scaling, as shown in Figure 3.
The horizontal scaling platforms work based on scaling the processing performance by increasing the number of devices in the clustering system. This technique includes MapReduce, peer-to-peer networks, and Spark. On the other hand, the vertical scaling principle is to scale the processing power for clustering of the device itself (vertically). This technique includes a graphics processing unit, field programmable gate array (FPGA), and multi-core CPU. Each platform has several specific design clustering algorithms, and some related techniques are discussed in this paper.

Spark Platform
Apache Spark is a data-intensive application framework that is designed to process big data and can be executed on commodity clusters [17]. The main difference between the Spark framework and the competition such as MapReduce is that it loads only the useful dataset into the memory, which enables iterative jobs to run queries on big datasets. Such a technique can reduce the execution time significantly.
There are three fundamental aspects presented in Spark, which are [18]: resilient distributed datasets (RDDs), parallel operations, and shared variables. RDDs are a sharedmachine-based collection of objects that can be restored in the case of loss. They are reusable in multiple parallel MapReduce jobs, as they can be stored in memory. Secondly, the tasks can be executed in parallel with RDDs by reducing and collecting for each operation. The last aspect consists of broadcast variables and accumulators.
Practically, the main advantages of Spark are its flexibility, ease of use, and that the program does not need any abstraction. The high performance of Spark enables it to deal with data in a real-time streaming module and by using distributed workers, which is caching the result partially in memory. Moreover, it is very efficient such that it outperforms the Hadoop MapReduce framework [19] while preserving the fault tolerance and scalability of MapReduce, which is up to 10-times faster for interactive machine learning workloads [12].

Related Work
A variety of scalable machine learning algorithms have been developed in recent years to address various difficulties in big data analytics, with the majority of them designed to cluster data before processing.
In 2017, Reference [20] used a data science and engineering solution based for fast k-means clustering of big data. The article addressed the k-means problems with the risks associated with the random selection of the k centroid, the equal circular cluster, and the complexity of k-means, which is very high. The solution used a heuristic prototype-based algorithm together with k-means clustering to improve the efficiency and scalability of the clustering process.
In 2018, Reference [21] studied the performance of the k-nearest neighbor classifier (KNN) and k-means clustering for predicting diagnostic accuracy. The proposed solution aimed to improve the performance of the diagnosis and classification process for the machine learning repository prepared by the University of Wisconsin Hospital. The performance of KNN was compared with k-means clustering using the same dataset from the hospital.
In 2018, Reference [22] worked on the improvement of k-means clustering using the density canopy. The density canopy technique has been used for preprocessing procedures before running the k-means clustering as canopy results in the initial clustering centers. The proposed solution was tested and simulated with the well-known dataset from the UCI machine learning repository and a simulated dataset with noise. The simulation results showed an improvement in the performance of the k-means algorithm by up to 30.7% in terms of accuracy on the UCI dataset and up to 44.3% with the simulated data compared with the traditional k-means algorithm. However, the accuracy improvement went down to only 6% in some cases.
In 2019, Reference [23] proposed a novel and efficient clustering technique for big data clustering in Hadoop. The proposed solution tried to solve the efficiency of the k-means clustering algorithm by using a new hybrid algorithm on the basis of the precision, recall, F-measure. The performance of the solution was evaluated regarding the execution time and accuracy using the National Climatic Data Center (NCDC) dataset, which contains data from over 20,000 stations around the world. However, the analysis of the data result was not clear, and the execution time was not tested for a multi-k value.
In 2019, Reference [24] worked on the importance of the mixed data approach by addressing the limitations of most clustering algorithm solutions, which are focused on either numerical or categorical data only. The authors implemented the clustering technique on a real-world mixed-type dataset to validate the clustering with such data types. The results showed that using clustering techniques for mixed data in general via a discretization procedure may result in the loss of essential information.
In 2020, Reference [25] proposed an improvement of the k-means clustering algorithm for big data. The proposed solution aimed to improve the efficiency and accuracy of the data clustering. However, the experiment used a private artificial dataset with up to 50,000 records to measure the clustering performance compared with simple k-means. The results showed an improvement in the clustering speed of up to 50%.
In 2020, Reference [26] proposed a new solution to improve the k-means clustering for big data using the Hadoop parallel framework. The improvement was in the clustering process by dividing and merging clusters according to the distance between them. The points that did not belong to a cluster were divided to be clustered with the nearest clusters. Such a method should improve the efficiency of k-means clustering. The KDD99 dataset was used to evaluate the performance of the solution with shared memory space and a parallel Hadoop platform. The simulation results showed an improvement in the accuracy by up to 10%.
In 2021, Reference [27] worked on the improvement of k-means clustering for big data. The improved version of the k-means algorithm was applied with acceptable precision rates to big data with lower processing loads. Using this technique, the distance between points was used along with their variations with the last two iterations. Furthermore, those points were compared together with the research index cluster radius. The results showed an improvement in clustering by up to 41.8% in the base case. Many real datasets were used in the experiment, such as the Concrete, Abalone, Facebook, and CASP datasets. Moreover, the processing power requirement of the solution was high, which was performed with 12 GB of RAM and a multi-core Core i7 processor.
In 2021, Reference [28] analyzed the performance of the simple k-means algorithm and improved it with the parallel k-means technique to optimize the clustering of a dataset. The solution was designed to take advantage of multi-core machines to improve the clustering performance. The experiment used an education sector dataset with up to 10 times of evaluation to calculate the time elapsed and the number of iterations for the clustering. The overall performance was improved up to three-times faster than the simple k-means algorithm. Table 1 summarizes the related works that made an improvement of k-means clustering to work with big datasets.

Proposed Solution
The rapid growth of the data volume has led to the expansion of the use of clustering algorithms. However, the effectiveness of the traditional techniques is not always high because of the high computational load. The basic idea of the proposed algorithm is to take advantage of the high performance of smart mobile devices in machine learning through an artificial intelligence processor dedicated to machine learning algorithms. The main goal is to speed up the performance of the k-means algorithm to work better and give higher performance by taking advantage of the technology mentioned above. Due to the simplicity, efficiency, and adaptability of the k-means algorithm, it is considered the most popular clustering method. The cluster efficiency should be always maintained when involving a large dataset, while the intercluster and intracluster distances between data objects in a dataset should be considered. A new distributed clustering based on the k-means algorithm is suggested.
The proposed algorithm distributes a dataset equally to the smartphone device, which is connected to a cloud server. Then, it runs the clustering algorithm, taking advantage of the machine-learning-engine-dedicated processors in modern mobile devices. The goal is to preserve the dataset characteristics while improving the clustering performance. The new solution can handle the three issues in the k-means algorithm, which are: the number of k, optimal centers, and best clustering performance. The centers are calculated for each cluster on each device and compared later with the other clustering results that come from the devices in the network.
Algorithm 1 shows the main steps of the processing on the server's side. The Spark framework starts preparing the dataset to be sent to the clients, then begins receiving the requests from the clients. The server keeps listening to the clients, which request the content or send the results. The best result from the received results is selected. Each dataset is marked with the best random centroid points.

Require: Big dataset
Input: Prepare dataset for clients Start: Init database to receive results while f ile = IncomeHTTP − Stopped do Save: centroidpoints from clients end while Spark: select best result Save: best centroid points Output: centroid points for the dataset Figure 4 describes the complete layout for the network of the proposed solution. First, the big dataset is in the cloud server and managed by the cloud management server, which is responsible for the data flow, results, and optimal solution selection. Besides that, the Spark server system also works with the cloud management system to select the optimal solutions that come from the mobile devices in the network. Finally, the system requires a mobile device with a dedicated ML engine connected to the cloud server. The mobile device starts downloading the dataset from the server, then processes the data with the random centered point of the k-means algorithm. With the high performance of dedicated ML processors and taking advantage of the multi-core system of ML processors, mobile devices can cluster data at a high speed. Each mobile device in the network returns the random center point and clustering accuracy to the server, as shown in Figure 5.

Proposed Solution Processing
Parallel neural engine k-means clustering consists of three main steps, which are: In the first step, the mobile device's main processor divides the dataset into subdatasets. All neural processors within the mobile device system process these sub-datasets with an initial centroid and a specific number of clusters. Each processor calculates clusters, and then, the main processor collects the results to be prepared and sent to the server. This process continues until there is no change in the clusters. Each partition of the dataset has its centroid points; therefore: Send initial centroids: C1, C2, C3, C4, . . . , Cn and sub-datasets to all connected processors receive clusters and centroids from all the processors. The server recalculates the distance amongst all these centroid points later to detect the best performance from all results that are collected.
The number of incoming results to the server depends on the number of connected devices. Therefore and due to the vast amount of modern mobile devices (phones, tablets, etc.), many results are collected by the server. Such an amount of results requires a selection algorithm to select the best results of clustering, and as a result, the optimal center points can be expected in the future without requiring sending the data again to the mobile devices.
As described in Algorithm 2, the mobile device sends a request to the server to retrieve the dataset, then it initializes the centroid points with a random location.

Algorithm 2 k-means clustering on mobile devices.
Require: Request dataset file from the server Input: random centroid points Start: clustering points while f ile = end do Calculate: meanvalue end while Output: clustered points The cluster initialization requires partitioning all the dataset into k number of subdatasets where the initial clusters can be calculated with the following equation: initial cluster = f loor(n/k) where n is the total number of sub-data partitioning and k is the number of clusters. The mobile device starts clustering the objects with a loop until the end of the dataset file. The output of the process is the clustered points with the initial centroid points. The distance between centroid points is calculated on the server's side to determine the best result from the collected results, and this is achieved by applying the following equation: where C i is the initial centroid points and D is the data item, which should be less than the (n) value. Figure 6 summarizes the data processing on the mobile device. The data are requested from the server by the mobile device, then they are partitioned to be processed by the neural engine and general-purpose processor. The general-purpose processor is responsible for the general calculation, such as the initialization of the centroid points and calculating the variant of each cluster. The neural engine processor calculates the rest of the operations and tasks with the k-means algorithm. All these tasks run in parallel for both the neural and general-purpose processors.
The proposed solution does not directly affect the clustering precision because it does not edit the clustering precision part in the standard k-means clustering. The solution focuses on improving the performance of the standard k-means clustering by editing the way that k-means runs and clusters the objects as shown in Figure 6. It splits the clustering processes for both the neural and general-purpose processors; therefore, the core of the kmeans clustering itself is not affected. However, the precision is improved in different ways. Running k-means clustering with high performance and very low consumption regarding both the energy and system resources enables running the clustering process many times to obtain different results. The results collected from the mobile devices connected to the server are evaluated to select the best amongst them.

Complexity Analysis
k-means algorithms are comprised of two phases, which are computing the initial clusters after dividing the dataset into k equal parts and then calculating the arithmetic means where the second phase is to assign the item to an appropriate cluster. O(n) is the time complexity of the first phase, as it requires partitioning and finding a fixed period. On the other hand, the second phase has O(nkt) complexity, where n is the number of items in the data, k is the number of clusters, and t is the number of iterations that are performed to cluster all items.
Thus, the total time complexity of the k-means clustering algorithm is equal to adding the maximum of O(n) to O(nkt), that is: so the overall time complexity is O(nkt).

Analysis of the Experiment Results
The comparison between the proposed solution and the recent k-means algorithm improvements was based on the following terms: • Number of iterations, which represents the number of iteration processes that are required to cluster the dataset into (k), the number of clusters; • Cluster quality, which represents the performance of the clustering in terms of the accuracy of the clustering; • Elapsed time: this term is calculating the time that is required to cluster the dataset into (k) clusters.
Each of the above terms was calculated for each algorithm and compared with the proposed solution to present the quality and validity of the proposed solution.
The main results of the proposed solution are shown with all the above performance measurements and with two main big datasets, which are: the Google Play Store dataset and the KDD99 dataset.
The results obtained were compared with the k-means algorithm. The experiments showed the performance of the proposed algorithm for big data. The proposed solution was developed with ARM-based code with a dedicated machine learning code for the smartphone ML engines. For the iOS device, the Swift programming language was used, whereas the Kotlin programming language was used for the Android smartphone. For the server's side, the Python programming language with the Spark server was used with the Fedora Linux operating system. The k-means algorithm was run on a desktop-based processor (Intel Core i5 3.5 GHz with four cores) and the Unix-based operating system without an ML engine, then with mobile devices with ML engines. The mobile devices that were used in the experiment were the iPhone 11 Pro Max with an ML engine, which contains 16 cores dedicated to the machine learning process, and the Samsung S22, which includes a neural processing unit (NPU), as the Android smartphone with a 16 bit floatingpoint number (FP16). The experiments were divided into two categories: single-core neural engine performance and multi-core neural engine performance with mobile ARM cores.
Several datasets were used in the experiments to validate the proposed solution in many situations and dataset properties. All the dataset characteristics are given in Table 2.  [26] 71 MB 9,000,000

Neural Engine Performance
First, the k-means algorithm was evaluated with a single-core neural engine processor without using the mobile ARM processor. This means all the operations were examined with only neural cores. Table 1 shows the result of the experiment, which clearly shows the high performance of the mobile processors when running a machine learning algorithm compared with the non-ML processor.
The performance as shown in Table 3 was about twice as fast with a dedicated ML processor when processing the 11 million records of the Google Play Store dataset. The next experiment was performed on the education sector dataset, and the mobile processors showed a performance up to 10-times faster than the desktop processor. The second experiment ran the k-means algorithm on many devices connected to the server. This part of the experiment required testing with the real-world server; therefore, a testing cloud server was built. The result as shown in Figure 3 describes the effects of the number of mobile devices on the number of results in minutes by taking the time that was required to send and receive the requests from the devices. To validate the proposed solution compared with the recently advanced k-means, the number of iterations was calculated with different numbers of (k). Table 4, the number of processors used to perform the computations for each series run was verified. The number of clusters in the k-means algorithm was specified to generate efficient results for different variations of the clusters. Clustering is an issue that mainly depends on the data used and the problems considered. The proposed algorithms showed a significant increment in the efficiency of clustering in terms of execution. In Table 5, several iterations are fixed in the case of the parallel neural k-means algorithm using the education sector dataset, i.e., for k = 4, 5, 6, 7, but this kept changing from one run to another in the case of the parallel k-means clustering algorithm with multiple running times.   Figure 7 shows the overall performance of the proposed solution compared with simple k-means clustering and parallel k-means clustering using the education sector dataset. It is noticeable that the proposed solution outperformed the compared algorithms in terms of the number of iterations. The number of iterations decreased significantly due to the efficiency and performance of the dedicated neural engine processor. The neural engine processor was designed with ARM-v8-based technology, and this enabled executing each task with a single 64 bit instruction with up to 16 cores [29]. Furthermore, with the minimal energy consumption using the single-instruction technique, it enabled running all processor cores to execute tasks in parallel at the same time efficiently [30]. Such a technique lets the tasks in the clustering run on dedicated neural cores and as a result decreases the iterations number, which is required to run over all objects.

As shown in
In Table 6, the number of iterations is fixed in the case of the parallel neural k-means algorithm using the Google Play Store dataset, i.e., for k = 4, 5, 6, 7, but kept changing from one run to another in the case of the parallel k-means clustering algorithm with multiple running times. Figure 8 shows the overall performance of the proposed solution compared with simple k-means clustering and parallel k-means clustering using the Google Play Store dataset. It is noticeable that the proposed solution outperformed the compared algorithms in terms of the number of iterations by up to four-times better and up to ten-times better than simple k-means clustering.

Time Elapsed Performance
Calculating the time elapsed for the proposed solution is an important term as it presents the overall performance in terms of time. The performance of the solution compared with standard k-means (simple k-means) and parallel k-means was calculated, and the results are shown in Table 7 for the education sector dataset. To measure the performance of the solution compared with standard k-means (simple k-means) and parallel k-means with a big dataset, the elapsed time was calculated, and the results are shown in Table 8 for the Google Play Store dataset. The next experiments were performed over another big dataset and compared with an advanced k-means clustering algorithm [26]. To achieve a fair comparison, the KDD99 data that was used in [26] was used in the experiment with the same settings and row count for clustering. The results are shown in Table 9, giving the comparison amongst the proposed solution, single-machine clustering, and Hadoop platform clustering. It describes clearly the efficiency and high performance of the proposed solution compared with the advanced k-means clustering. The time consumption for the proposed solution was up to four-times faster than the single-machine and Hadoop platform clustering in the overall comparison.

Multiple Cores and Multiple Processors
In the second part of the experiment, the clustering algorithm was divided into two sections. First, the normal operations and tasks were run on mobile ARM processors and cores to take advantage of the high performance of the cores and preserve the energy of the device at the same time. Second, the k-means algorithm itself was run specifically on the neural engine cores of the mobile processor. Figure 9 shows that increasing the number of cores besides the neural engine cores significantly increased the performance of the overall clustering by up to two-times faster than using only neural engine cores for the overall system.

Conclusions
Processing big data efficiently requires using mobile-processor-based clustering. Many related clustering techniques were discussed in this paper to provide an overview of the clustering technique. k-means, which is a clustering-based algorithm, was used in this paper to implement the proposed technique. It can work efficiently with numerical data better than categorical data. Running k-means clustering on a machine-learning-based processor can significantly improve the performance and efficiency of the clustering. Using a distributed system can effectively handle the problem of running k-means clustering to find the optimal centroid points. Moreover, the number of iterations of clustering can be increased without affecting the speed of the overall system.