Two-Step Classiﬁcation with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

: At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classiﬁcation is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classiﬁers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classiﬁers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classiﬁcation is ab ovo performed to the same datasets. Speciﬁcally, the singular value decomposition of the data matrix determines ﬁrst a set of transformed attributes which in turn drive the classiﬁers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and F 1. The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classiﬁers with respect to both complexity and the abovementioned metrics.


Introduction
In the big data era, the major drivers behind redefining application paradigms are the need for efficient and reliable distributed computing frameworks as well as the ability to handle large data volumes collected from fields so diverse as smart cities, digital health care, high resolution navigation systems, and digital cultural heritage. In addition, the need for the development of programming languages which take into account both the defining characteristics of these frameworks and the recent lessons about defensive programming, code readability and maintenance, and version control in an abstract and programmer friendly manner, must also be taken into account. Such is the case of the Apache Spark framework, a significant part of the Hadoop ecosystem along with Kafka, in order to build streaming platforms and Hive aiming to manage data in relational databases, in conjunction with Scala. The latter is a multi-paradigm (functional and object-oriented) programming language with a strong static type system running on top of the JVM ecosystem. Moreover, PySpark, namely the Spark Python API, has remarkable success since it provides Python functionality for Spark Resilient Distributed Datasets (RDDs), the main Spark data abstraction type.
Among machine learning (ML) tasks, classification stands out as one of the most computationally intensive ones. This can be attributed not only to the massive data volume driving the advanced ML models but also to the time required to fetch a specific batch with dynamically formed or even ad hoc queries involving thousands of rows. Moreover, in the big data era, even elementary operations, such as a single inner product or a hash mapping, may require additional time because of the number of operands involved and their distribution across the shared file system. In addition, the iterative training process of most classifiers results in a high number of computations. Even evaluating a loss function may take a significant portion of the total computations in some very rare cases. In order to improve performance, feature selection or transformation can take place before the actual classification. Although this preprocessing step requires time, the resulting feature set may contain considerably less attributes which may further represent the various classes in an easily separable way. In turn, this leads to a more efficient classification process in terms of both accuracy and total computation time.
The primary research objective of this article is twofold. Firstly, the two-step feature selection and classification methodology is described. As a preprocessing step, the singular value decomposition (SVD) has been selected as it efficiently identifies eigenfeatures hidden in massive datasets. As stated in our previous work, learning new data features while preserving old data features can be considered as one of the most important goal of incremental learning methods [1]. Secondly, the results of the proposed methodology applied to a considerable number of real and synthetic datasets using the Apache Spark MLlib, a machine learning library optimized for distributed systems which provides among others the SVD functionality, are discussed. Several classification models such as decision trees, random forest, and logistic regression, have been investigated and their performance in terms of precision, recall and F1 metric, as the dataset size varies, has been recorded. As a secondary objective, the specifics of the Spark system, along with the PySpark and the SparkQL modules developed in order to perform the two-step classification, are outlined.
The principal motivation behind this article is to show the potential of preprocessing in a typical data science pipeline. Additionally, the distributed nature of Apache Spark makes the preprocessing step considerably easier, thus gaining valuable insight into the attribute space with comparatively low complexity.
The remainder of this work is structured as follows. Section 2 briefly overviews the relevant scientific literature and various cloud computing methodologies. Section 3 presents the classification algorithms available by MLlib and used in the proposed approach along with the proposed methodology. Section 4 discusses the training steps, the analysis cases as well as the two datasets used. Moreover, Section 5 describes the evaluation experiments conducted and comments on the results. Finally, Section 6 recapitulates the main conclusions and draws directions for future work. Matrices are represented by boldface uppercase and vectors by boldface lowercase, whereas ordinary lowercase are reserved for scalars.

Data Mining and Two-Step Classification
The field of data mining came into existence relatively recently with the stated objective of systemizing the methodologies for extracting hidden patterns or other knowledge of interest from massive datasets. For instance, data mining provides the tools for discovering latent correlations between features, thus allowing feature transformation and dimensionality reduction [2]. Both are crucial in the extract-transform-load (ETL) cycle in databases. As stated in [3], the knowledge discovery has a wide range of applications, with marketing, finance and fraud detection defining a portion of them. The process of knowledge discovery is structured in several stages, the first of which is feature selection, as presented in Figure 1. The preprocessing and transformation steps follow and lead to the main stage of data mining, where a suitable algorithm or an ad hoc version of it, extracts latent information in a form appropriate for future use [4]. Data mining consists of various tasks, most notably visualization, cleansing, missing value completion, outlier discovery, clustering, dimensionality reduction, and classification. In its most basic form, the latter consists of assigning labels taken from a finite set to algorithmic objects, for instance graphs, vectors or countable sequences, based on a small number of known label-object pairings. A methodological approach to classification stems from the common engineering intuition to dividing hard optimization problems to simpler subtasks in order to obtain a solution. The classification techniques based on this approach are many and diverse. In the Gauss-Dantzig selector, linear programming performs very sparse model selection under certain inequality constraints and then ordinary least squares estimate the values of the model coefficients [5,6]. Another prime example constitutes the generic scheme proposed in [7] for clustering objects stored in GIS databases, where in each clustering iteration, there is a step dependent only on non-spatial attributes followed by a step where the location-based operations are executed. In addition, in [8], a methodology for deduplicating records in databases in an unsupervised way is described, where synthetic high quality records are created based on the underlying domain and subsequently they are driven to a high performing classifier such as a support vector machine (SVM). In [9], an expectation-maximization (EM)-like clustering algorithm for fuzzy data based on belief functions is implemented and compared to one-step fuzzy classifiers. Finally, an implicit two-step spatiosocial clustering of trilingual tweets is proposed in [10], where a genetic algorithm takes into account both linguistic similarity and spatial proximity models. Other applications include limiting vulnerability from cyber attacks [11] and improving healthcare services [12].

Distributed Computing
The need for parallel and later for distributed processing and storage systems was apparent at least since the first days of computers. Concerning the latter, after a long evolutionary course and with the rapid scaling of the Internet in the late 1990s and early 2000s, large storage systems were developed by literally every major Internet company, since conventional RDBMS solutions proved inadequate for huge non-tabular data volumes [13]. However, in the beginning, these systems were relatively specialized. Google Dremel [14] constitutes a striking example of a distributed system for querying large datasets, while Google Pregel [15] is a system targeted to cope with linked knowledge [16,17]. In the early 2010s, NoSQL databases, like Cassandra and HBase, arose in order to deal with non-tabular data types in the distributed manner of scaling out according to the BASE guideline set instead of the traditional scale up and the ACID principles.
Regarding processing large datasets, Apache Spark [18], an integral part of the Hadoop ecosystem introduced in 2009 [19], is perhaps one of the most well-known platforms for massive distributed computing. Unlike Hadoop which is based on the MapReduce computing paradigm, Spark is based on DAG paradigm. In the latter, computation flow must be arranged in a way which eliminates non-local loops so that intermediate results, which can be any primitive data type but also SQL tables, graphs, or tensors depending on the modules installed, can freely flow in the form of the abstraction of Resilient Distributed Dataset (RDDs), a fault-tolerant correlation of elements distributed across many nodes. Alternatively, based on dual graph properties, the computations can be moved and executed on large data stores in order to avoid costly data copies. On the contrary, the MapReduce paradigm [20] consists of three stages; map, shuffle and reduce. In the map stage, the data are split into a number of chunks and each chunk is sent to a mapper with the aim of executing the Map() algorithm. Its result is a set of <key, value> pairs which, during the shuffle stage, are grouped by the key, and in following, each key is fed into the proper reducer executing the Reduce() function. It is possible that the result of the reduce phase can be driven to a new MapReduce job. The main advantage of DAG over MapReduce is that intermediate results don't have to be flushed to the disk, since DAG allows local iterative computations in memory, which can have a positive impact on the performance [21]. Based on the success of Spark, other data mining platforms seek to exploit its potential. One example is considered the Distributed Weka Spark [22] that is based on Weka [23] and implemented on top of Spark.
Because of the inherent complexity in terms of memory, computation and communication of ML problems, these problems are prime applications for distributed systems. In [24], a kd-tree was implemented on Hadoop, while in [25], a fast parallel k-means clustering algorithm was developed based on MapReduce. Another example is Pegasus, a big graph mining tool built on MapReduce and introduced in [26]. What is more, other platforms for distributed computing can be considered. dist-keras is a version of the keras TensorFlow front-end supporting a number of distributed iterative optimization schemes, such as AdaGrad and Adam. H2O is a platform for large scale descriptive statistics and statistical ML methods including 1 or lasso regularization and coordinate descent. Elephas aims at massively parallel neural networks of various configurations, robust numerical training procedures, as well as extensive convergence checks. Complementary to distributed computing, local parallel computing with GPUs or TPUs has been also applied to Deep Learning (DL). TensorFlow constitutes an open source low-level framework which has been originally designed from Google to simulate brain circuits [27]. It relies on GPU computing and organizes its computations on graphs populated in sessions. Known front-ends include theano and the abovementioned keras. Keras has been used in a framework in order to estimate the probability that the next mention of any account will be to a verified account or simply when a tweet will be directed towards a verified account [28], that is an important metric of digital influence. Finally, PyTorch is an independent ecosystem for ML in Python for tasks including computer vision, NLP along with neural network training.
Furthermore, distributed algorithms exist for most, if not all, data mining tasks. For instance, the back error propagation (BEP) algorithm used to train feedforward neural networks is a distributed version of the gradient descent algorithm [29]. The scheme for hierarchical routing under energy conservation constraints in ad hoc networks presented in [30] and the density-based iterative method of [31], can be considered as basic examples of distributed clustering. Distributed classification for fully cooperative agents is described in [32]. Distributed string matching has been proposed in [33] as a means for understanding various types of human motion from body sensor readings. In addition, consensus methods have been considered for distributed signal classification in [34]. In addition, distributed signal processing algorithms such as those presented in [35] for signal classification in low power node sensor networks and in [36] for classifying acoustic objects in sensor arrays of variable topology based on sound propagation features, can serve as preprocessing steps for data mining methods. Monte Carlo simulations for Hadoop are described in [37].
Finally, Spark contains MLlib, a specialized distributed ML library, which also makes use of the cloud. MLlib contains efficient and scalable implementations of algorithms for classification, clustering, regression and collaborative filtering as well as APIs for Java, Scala and Python [38]. A sentiment analysis tool for binary and ternary classification of the emotional content of tweets based on Spark was proposed in [39]. Moreover, [40] is a survey of ML algorithms over Spark with distributed hash table (DHT) structures with a benchmark stored in Cassandra. Ultimately, a novel scheme based on Bloom filters in Spark for exploiting hashtags and emoticons in addition to natural language processing techniques inside large tweet collections in order to evaluate their sentiment polarity is introduced in [41].

Proposed Method
In this section, we will discuss the classification algorithms utilized in our experiments as well as the proposed methodology. The focus lied in the relationships between the dataset size and the computation time needed to perform classification as well as between the dataset size and the metrics evolved. The results of our work are presented in Tables 1-14. Recall, precision, and the F1 metric were used as the evaluation metrics of the different algorithms. In addition, the time needed to train each classifier is presented, as well as information about the analysis cases. However, before presenting the various classifiers used in the experiments, the important characteristics of 10V data will be discussed.

10V Data
Big data is a rather generic term which may actually be misleading in a number of significant cases. since the extraction of non-trivial knowledge from them is complicated not only by sheer size but also from other factors such as the lack of structure or random patterns of missing values. In order to describe this kind of data, the term 10V data has been coined. The latter summarizes the following top data properties: • Volume: The volume of 10V data greatly exceeds the main memory capacity and, therefore, the data have to be moved to secondary memory and possibly across over a distributed system such as NFS, Minio, and HDFS. Thus, new computational strategies about moving the computations and not the data have to be developed. • Velocity: This factor refers to the rate data are generated or refreshed. Data update rate depends on the application and ranges from milliseconds to hours. The larger the time scale, the bigger the need for a sizeable buffer zone. • Variety: 10V collected data can well be structured in various formats, semistructured, or unstructured according to the collection policies. It is not unlikely the same dataset contains graphs, images, sound clips, maps, video, and text as well as raw measurements in binary blobs. • Variability: Large datasets are bound to contain missing values and outliers, both of which need to be addressed with cleansing and anomaly discovery methods as appropriate. The latter are crucial in improving data quality and facilitate building more efficient pipelines. • Volatility: This factor refers to the useful data life span. Before the advent of 10V data it was not uncommon to store everything in data warehouses. Now there is pressure for more selective strategies. Choosing which entries or which transformed attributes for long term storage is a major topic in data science. • Visualization: Humans tend to understand better the "big picture" inherent in the processed and refined knowledge because of the way brain operates. Thus, information visualization may well be the key to successfully conveying a message. Although lately the importance of storytelling techniques has gained traction, visualization remains the best and easiest way to describe sizeable amounts of knowledge. • Vulnerability: Each and every computing device nowadays is a potential security threat. The systems of a data processing pipeline are no exception, especially if they collect measurements with the outside world. • Veracity: Creating ad hoc queries and small scale tests easily is an important parameter in data quality. Given the unstructured nature of 10V data, the development of ad hoc queries is paramount as invaluable insight can be gained from them. • Validity: This factor refers to how relevant the dataset is to the questions which are to be answered. This includes among others how data are sampled and collected, how frequently are updated, how the collection methodology influences data integrity, and what transforms are appropriate. • Value: The design of a specialized ML pipeline or the implementation of a generic one is an expensive action. Therefore, there should be at least some evidence that the knowledge contained in the raw data can be of tremendous benefit.

Decision Trees
Decision trees constitute a classification algorithm which can be represented as a tree structure [42]. The nodes of the tree represent features of the dataset and, depending on each feature's value, we navigate through the tree structure. This procedure continues until a leaf is reached, as a leaf represents an output class. The root of the tree is chosen as the feature which best partitions the training dataset by minimizing a loss function. This procedure is recursively executed to each subtree, each representing a dataset partition, until the training data are divided into subsets of the same class.

Random Forest
Random forest is considered a generalization of the decision tree [43] as it consists of a set of decision trees. In order to classify a new input, we insert it in each tree of the forest, ending in the "vote" of an output class. The class that will receive the most votes, is the output class that the random forest will return. To construct each tree, a number of cases from the original data, equal to the number of trees of the forest, are sampled with replacement and in following a subset of the features of that specific selection in order for the size level of tree to be increased, are utilized.

Logistic Regression
Logistic regression is a regression model that can be utilized when the dependant value (i.e., output class) is categorical. It uses a logistic function to express the relationship between the dependant and the independent (i.e., features of each class) value. Apache Spark supports both binomial and multinomial logistic regression; for example, assuming we have K output classes, then one class is chosen as pivot, in following K − 1 models are created and the class with the largest probability among the K − 1 models is considered as the result.

Gradient-boosted Trees
Gradient-boosted regression trees use tree averaging for ML purposes [44]. The difference is that although they are based on tree averaging, instead of training many trees, small trees are used in order to subsequently avoid overfitting. Each new tree that is added, attempts to minimize the current remaining regression error. Despite the fact that this classifier was initially proposed for regression purposes, it can predict the average output class and also minimize the squared-loss.

Multilayer Perceptron
A multilayer perceptron constitutes an artificial neural network model used to map a set of input vectors to a set of known classes. It comprises of layers with nodes, having a specific weight, that are fully connected to the nodes of the next level. With use of the back-propagation method, which changes the connection weights of each node to the ones of the next layer in order to minimize the output error, the multilayer perceptron for a given dataset can be trained.

One-Vs-Rest
The one-vs-rest (or one-vs-all) classifier uses a base classifier, which can efficiently perform binary classification. As a second step, it trains a single classifier per output class by considering the portion of the data that belong to that class as positive and the rest as negative. In order to determine the label of the output class, it uses the classifier with the highest confidence score.

Feature Selection
Feature selection is a challenging topic in classification since the set of features of minimum cardinality is rarely known in advance; this also accurately describes a given dataset in the sense that adding more features to that set improves a pre-specified classification performance metric. On the other hand, too many attributes may in fact impede the classifier. SVD is a linear algebraic technique which relies on the factorization of the original data matrix A ∈ R m×n with m observations (rows) and n attributes (columns) in order to achieve dimensionality reduction as in Equation (1): In Equation (1), U and V are orthonormal matrices containing the respective bases for two distinct r-dimensional linear spaces that comprise eigenfeatures. The nature of these spaces depends directly on the underlying domain. For instance, in the document retrieval case, SVD is another name for the latent semantic indexing (LSI). In the latter, U and V are the eigendocument and eigenterm spaces respectively, which combined yield the original document-term matrix. Σ is always a diagonal matrix with strictly positive diagonal entries σ k,k , 1 ≤ k ≤ r and dictates how the two features spaces are composed. As it can be seen from the right form of Equation (1), this coupling is straightforward as only the k-th column u k of U and the k-th row v T k of V are combined in a cross product scaled by σ k,k . A key aspect of SVD is that r is directly discovered from A and as a result, it belongs to unsupervised ML methods.
SVD works best when the attributes are linearly connected. In that case, the resulting attribute set is typically smaller and yet captures most of the essence of the original attribute set. When there are non-linear connections in the feature set, then the SVD yields the orthogonal projection of the best attribute set to the observation data.
As stated in the dimensionality reduction, RDD based, API documentation of MLlib, spark.mllib provides an efficient SVD implementation for row oriented matrices, provided in the RowMatrix class. According to the same documentation, two strategies can be followed:

•
When n is smaller than 100 or when k exceeds n 2 , then the Gramian matrix A T A is computed and its top eigenvectors are subsequently computed locally at the Spark driver. • Otherwise, the Gramian matrix is computed in a distributed way and its top eigenvectors are again locally computed at the driver as in the previous case.

Complexities
Ordinarily, the complexity of a full SVD execution in serial environments is considered to be exceedingly large, albeit it is polynomial, namely O (() n 3 ), where n is the longest dimension of A. This is attributed to the fact that every right eigenvector of both the Gramian matrix and its transpose has to be computed in order for the factors U and V to be computed. Nevertheless, in our experiments, only the singular values, namely the diagonal entries σ k,k , need to be computed in order to assess which feature is essential. This reduces the full SVD problem to the computation of the eigenvalues of the Gramian matrix or those of its transpose, whichever is shorter in dimensions. In turn, this can be reduced to computing the eigenvalues of an equivalent companion matrix. Due to the special structure of the latter, its eigenvalues can be computed in a quadratic number of steps.
Moreover, in a distributed system such as Spark, each node can undertake the computation of a chunk of eigenvalues with much lower complexity. This can be accomplished through a number of ways. For instance, given a set of orthonormal vectors as a starting point, the power method can be used in order to computed each eigenvalue. The upper bound for this method is pseudolinear in the size of the Gramian matrix or its transpose, but it may require more communication. On the other hand, the companion matrix method may require more operations but is cheaper in terms of communication. At any rate, the SVD implementation of the MLlib is quite efficient and takes full advantage of the underlying distributed system.

Implementation
Our approach follows the proposal of [3], as presented in Section 2. However, since this procedure is dataset specific, we will discuss it separately for each dataset. Initially, we need to introduce the framework on which the computation took place. The overall architecture of the proposed system is depicted in Figure 2 taking into account the corresponding modules of our approach. Specifically, a preprocessing step is utilized and in following, the classification procedure is employed. Furthermore, Figure 3 takes a deeper look at the system by illustrating the Spark stack from an ML perspective. It can be seen that there are many specialized libraries providing functionalities, especially when there is no obvious implementation given the DAG model of Spark. Therefore, development of data science applications and pipelines is greatly simplified. The complexity metric chosen was the wallclock time. Although in distributed systems there is inevitably network traffic involved in the computation, the Hadoop ecosystem is strongly built around the principle that the computations may be moved freely but data rarely are. Moreover, the local intensive computations are CPU bound. Thus, the wallclock time is a relatively safe indicator of the amount of time required.

Databricks
The creators of Apache Spark have also founded Databricks with the aim of providing researchers with a Web-based platform where they can store and analyse their data with Spark and perform analysis ranging from ad hoc queries to complete data pipelines. In addition, computing clusters provided by the framework, depending on the needs of each case, are provided. Databricks comes in several different editions, among them the community one, which we have used in our experiments. This specific version offers researchers a mini cluster with 6 GB of RAM and also cloud storage. As programming language for our implementations, Python (PySpark) was chosen. Another feature of Databricks is the dataframe (DF), an expansion of the RDD, which is a distributed collection of data but unlike RDD, data are organized in a tabular form. Furthermore, a DF provides certain optimizations in order to achieve faster processing. Of course, it is possible to transform a DF to RDD and vice versa.

Higgs Dataset
The first dataset used in our experiments is Higgs artificial dataset [45] that is provided by UCL (http://archive.ics.uci.edu/ml/datasets/HIGGS). This dataset was created using Monte Carlo simulations and contains 11M rows of data with 28 attributes, where 21 of them correspond to kinematic properties measured by accelerators and the last seven are functions of the previous ones. The purpose of this dataset is to provide researchers with tools to distinguish whether a collision could produce Higgs bosons.
The first task we performed was to check if the dataset needed any preprocessing. Since it was artificial, it needed only to be ensured that no null entries were present. Then, we performed two classification analyses. The first one contained all the 28 attributes whereas the second contained the first 21 attributes. In order to perform the analyses, we split the original data into smaller segments that we present below. There are a couple of things to consider before proceeding.
Initially, we did not choose to perform the analysis on the remaining seven attributes, since they consist on such a low number that during the two-step classification process, where we would have to pick only some of them in order to perform the classification analysis, we would lose a decent amount of data and the results were off. In following, we did not use any special rule during the splitting of the data. In order to perform the analyses, we split the original data into smaller segments as presented in Section 5. Finally, it has to be noted that the corresponding dataset is used for binary classification.

PAMAP Dataset
The second dataset in our experiments is PAMAP realistic dataset [46], also provided by UCL (http://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring). It contains 2.8M rows of data from 8 subjects and 12 different activities. The devices used to obtain those measurements were inertial measurement units on the ankle, chest and hands of the subject as well as a heart rate monitor. There were missing entries, mainly because of entries lost from the wireless sensors and problematic hardware setup. In addition, the sampling frequency of the heart rate monitor was lower than that of the inertial measurement units. All those missing values were indicated as NaN.
Our first attempt to clean up the data by withdrawing all the rows containing at least one NaN value resulted in withdrawing 2.6M rows of data. To overcome this obstacle, we replaced all the intermediate NaN values of the heart rate monitor with the last known value. After that replacement, the dataset contained only 13K rows with one or more NaNs, which were ignored in the analysis. Additionally, rows with saturated accelerometer or invalid orientation readings were excluded too, reaching a total of 18K discarded rows. Apart from that, there was also some additional preprocessing; the dataset contained a column with the timestamp at which the reading took place. That feature was also excluded as no time-history related analysis was conducted.
In order to perform the analyses, we split the original data into segments of 5K, 10K, 25K, 50K, 75K, 100K, and 200K rows. However, in order for the results to be comparable, we had to keep the ratio of instances belonging to specific classes the same across all experiments. Thus, we ensured that at each subset, the fraction of observations belonging to a specific class is the same.

Analysis Cases
For each segment of both datasets, a series of classification analyses was performed. • Decision trees were optimized based on the depth of the trees with values ranging between 2 and 30. • Random forest was optimized in terms of the number of trees in the forest with values ranging between 1 and 60, having a step equal to two and maximum tree height equal to 10.

Evaluation
The results of our work are presented in Tables 1-14. The metrics of recall, precision, and F1 evaluate the performance of each classifier. In addition, the computation time necessary to train each classifier is presented as well as information about the optimal classifier parameters.

Analysis with all Features
The results for the decision tree classifier are presented in Table 1. We can observe that, as expected, the metrics were getting better, but also the resulting tree was getting bigger and more complex as it had more nodes and paths, and it took more time to be evaluated as the subset size grows. However, it is worth noting that the relation between the dataset size and computation time was not linear. For instance, for a dataset 20 times bigger, as it happens from 5K rows to 100K rows, we had to spend less than twice the time for the small dataset, whereas for a dataset twice the size of the original, the computation time was almost the same.
In the case of random forest, we can observe from the entries of Table 2 that, like the decision tree, the results were better as the dataset size grew. Additionally, the number of the trees in the forest were approximately the same in almost every examined case. The sole exception was when the dataset comprised of 200K rows, in which case the forest had less trees and also the trees themselves had fewer nodes.
In the case of the binomial logistic regression of Table 3, since there was no optimization done, we can observe that the computation time was high. In addition, as expected, the metrics were approximately proportional to the dataset size.
In following, in Table 4, the results for the gradient boosted trees classifier are presented. Same as before, we observe that the bigger the dataset the better the results were, as well as the complexity of the returned tree.
Finally, the multilayer perceptron classifier is presented in Table 5, where the bigger the dataset, the more accurate classifications were performed. In terms of complexity of the neuron network, we can observe that its complexity did not scale that much as the dataset size was getting bigger, although the smallest dataset returned the most hidden layers, without showing high values at the metrics; a percentage of 60.6% in F1 metric was the second lowest, after logistic regression.
To sum up the results, the gradient-boosted trees classifier showed the best performance, achieving a 72.5% score of the F1 metric, but it also had the highest training time. The binomial logistic regressor showed the poorest results, with a highest score of 64.8% in the F1 metric, but it was the fastest. It should be though noted that, in terms of computational time, it is not a fair comparison since in all the other classifiers, an optimization regarding a parameter was performed. Regarding the irregularities in the results, in the sense that smaller datasets gave better results than the ones containing more entries, this is probably due to some overfitting procedures, caused by the fact that we randomly split the dataset in a certain amount of rows.

Analysis with Limited Number of Features
Next, we repeated the same tests with the same settings, limiting the number of features that we used as input to both one-step and two-step classification experiments to the first 21 features.
The results of the decision tree classifier are presented in Table 6. In the same context as before, as the dataset got larger, the better the classifier results were getting and the more complex the resulting tree. Some observed irregularities might have been caused due to the biased dataset since we performed random selection of rows. Specifically, the tree in the case of 5K rows was larger than the tree in the case of 10K for both one-step and two-step classifier, although in the latter to a lesser degree. In this case, the proposed two-step classifier performed the same as the formal classification method in terms of tree complexity and F1 metric (about 2% better results for maximum cases), although it required about 32% more time to calculate the classification model. Again, the scaling of the classifier was not linear regarding computation time and dataset size.
In Table 7, the results of the random forest classifier are presented. Again, as the dataset grew, the results were getting better and the resulting model was getting more complex in terms of the total number of nodes. Both one-step and two-step classification methods performed the same in terms of F1 metric, with the model of the two-step classifier being a bit simpler, but it required more time to be calculated (about 5% in the case of 200K rows).
Results from the binomial logistic regression classifier are depicted in Table 8. Both one-step and two-step classifiers performed almost equally in term of F1 metric, while the computation time was still low. In addition, as we are already aware, the bigger the dataset, the more accurate the classifier got.
In Table 9, the results of the gradient boosted trees classifier are depicted. Again, the two classification methods performed almost equally in terms of the F1 metric, but like the random forest, the two-step method required more time for evaluating the model. As expected, as the dataset was getting bigger, the classification performance improved.
Finally, in Table 10, the results of the multilayer perceptron classifier are presented. As expected, the results improved as the dataset size got larger. Moreover, in terms of performance, the two-step classification method provided approximately 5% less accurate results (in terms of F1 metric), but required about 15% less time to calculate the model.

Discussion over Higgs Dataset
To sum up the results in the binomial classification case, we could say that the proposed two-step classification method produced, in most cases, about 5% less accurate results. Taking into consideration the time needed to calculate the model, the results are not clear. There were cases that the proposed method performed better than the one-step (normal) classification method, like the case of the multilayer perceptron when we considered the total number of cases in which it took 15% less time, but there were also cases that it took it more time, like the decision tree classifier. There, we took into account limited number of features that required about 32% more time to calculate the mode, whereas there was a minor increase in the metrics, e.g., 2%. It must be noted though, that this output could be dataset specific, as the proposed method did not work as well as it worked in the case with all features taken into account.
Before proceeding, we should note that, as a general rule of thumb, when we took into account the total number of features, the classifier performed better in terms of the F1 metric, but it required more time to compute the classification models. To be more specific, in the case of the one-step classification method, in almost all of the cases, we lost about 7-10% in terms of the F1 metric, while in terms of computational time, the following percentages are considered (for 200K rows). In the case of multilayer perceptron, it required 13% less time to compute the model, in following the gradient boosted tree classifier took 15% less time, the decision tree 20% less time, but in the case of random forest we had almost a tie (required 3% more time).
Examining the two-step classifier, in the case of decision trees, we lost about 5% in the F1 metric, while in the case of random forest, the loss was 6%. In addition, in the case of logistic regression, the loss was 4%, in following, in the case of gradient boosted tree classifier, the loss was 5.5% and in the case of multilayer perceptron, we had the highest loss of 12%. In terms of computational time, we had an almost tie (about 1% to 3% difference) in all but the multilayer perceptron classifier, which took 13% less time to be computed. As expected, there is no point in comparing the computational time of the logistic regression classifier, since its computational time is very low.

PAMAP Dataset
The second dataset that we examined was PAMAP in terms of multinomial classification. We performed a series of classification analyses, as with the Higgs dataset, with the exception of using one-vs-rest classifier instead of gradient boosted trees classifier, as the latter does not support multinomial classification.
In Table 11, the results of the decision tree classifier are presented. It can be immediately seen that the overall scores were higher than the ones examined in the Higgs dataset. In the same context as before, as the dataset got larger, the better the classifier results got and the more complex the resulting tree. Worth noting is the fact that in this case, the height of the tree had the highest possible value in all cases, but its complexity increased since nodes kept increasing in number. In addition, the computational time needed, as expected, kept increasing along with the size of the dataset, but not in a linear manner; when the size of the dataset increased by 40 times (from 5K to 200K rows), the time needed nearly doubled. Comparing the proposed two-step classifier with the one-step classifier, we observe that the former performed almost the same as the formal classification method in terms of the F1 metric, with the resulting tree being less complex. In addition, it required about 10-15% less time to produce the classification model.
In following, in Table 12, the results of the random forest classifier are presented. Again, as the dataset grew in size the results got better and the resulting model got more complex in terms of the total number of nodes, although we should mention that the values of the metrics were high in almost all the cases. Both one-and two-step classification methods performed almost the same in terms of the F1 metric (about 0.5% less accurate results), with the model of the two-step classifier being a bit simpler. In addition, in all but the 200K rows case, two-step model required about 5% less time to be computed (in the case of 200K rows required 0.5% extra time). As the dataset size grew bigger, so did the three metrics, except the case of the 10K rows. In overall, this classifier produced slightly lower metrics compared with the decision tree classifier. Furthermore, the training time of this classifier was higher enough than the decision tree one. When the size of the dataset increased by 40 times, it needed almost 3 times more for the classifier to be trained.
The next classifier is the multinomial logistic regression, which is depicted in Table 13. Again, as in the Higgs dataset, there was no parameter optimization performed. With few exceptions, we can argue that as the dataset size increased, so did the metrics, although some fluctuation existed. The metrics' scores was lower that the decision trees as well as random forest classifier. As for the scaling of time regarding the size of the dataset, for a 20 times bigger dataset, from 10K to 200K rows, the classifier's training time increased by 7 times. When comparing the two classification methods, the results were almost the same, with a 4% drop in F1 metric in the case of two-step classifier, which was also about 10% faster. Table 14 depicts the results of the multilayer perceptron classifier. Concretely, the results got better as the dataset grew larger. Both one-and two-step classification methods provided almost the same results, both in F1 metric and in time needed to produce the model.
Finally, the last classifier examined is one-vs-all in Table 15. As the base classifier, we have chosen logistic regression. This classifier performance was rather poor, both in time needed to train the classifier and in the metrics' scores. Examining again the F1 metric, it achieved 77% while the logistic regression achieved about 80%. Considering time scaling, with regard to the dataset size, for 20 times bigger dataset, the classifier's training time was increased by 3 times. To sum the results up, we can see that in the case of multinomial classification, the proposed two-step classifier performs almost identically to the one-step classification method in terms of the F1 metric. However, in all cases, two-step classifier was faster and the produced models, as in the random forest and decision tree case, were simpler. This in an indication that the preprocessing step has successfully identified a smaller attribute set which captures the essence of the larger one, as it was also the case with the Higgs dataset.
Overall, it can be argued that the best choice in PAMAP dataset is the decision tree classifier, since it performed better in the metric scores from all other classifiers and its training time scaled well regarding the dataset size. As we have discussed in the Higgs dataset as well, logistic regression again was the fastest one to train, although the comparison is not fair since there was no parameter optimization.

Conclusions and Future Work
This article focuses on two topics directly related to distributed ML, namely on the performance of one-vs two-step classification in terms of precision, recall, and the F1 metric as well as on the relationships of the dataset size with these metrics and with the total computation time. The two-step classification as a methodological framework is rooted in the engineering approach of dividing a complex task into simpler ones. So far the framework has yielded a number of important techniques such as the Gauss-Dantzig model selector and the EM algorithms for parameter estimation. The proposed architecture has the following general form: • SVD has been initially applied to the original data matrix in order to obtain a low dimensional representation with a smaller attribute set. • The classifier has been subsequently applied to the new attribute set in order to obtain the final labellings.
In order to test our approach, we examined two different datasets, one binary and one multiclass, and we recorded the performance of various classification algorithms in a distributed environment. Each classification method had strengths and limitations, depending on the dataset. For example, in Higgs dataset, decision tree was mediocre, but in PAMAP, it outperformed all the other classification methods.
From this work, certain conclusions can be drawn. First and foremost, it shows the potential of Spark to apply well-known ML operations to big data seamlessly without knowing how the dataset is distributed across HDFS. Additionally, it should be noted that it is not always apparent how an ML algorithm, especially an iterative one, can be implemented in the Spark DAG paradigm. The latter demonstrates the important contribution of MLlib as it provides a number of important classification algorithms. Third, from an algorithmic perspective it shows that modular architectures based on preprocessing can be more efficient, especially in a distributed environment, and thus the computational resources given to the preprocessing step are well spent. This is especially true for the SVD, which is typically considered expensive in traditional computing systems for large datasets. Fourth, sophisticated algorithms add significantly to the performance of a system. The wallclock time required to complete classification has been used as the complexity benchmark for reasons stated in Section 4.
As for future work, more datasets can serve as performance benchmarks of the proposed classification method. More tests will yield a better understanding of the optimal combinations of feature set size and classifier. Of course, the cluster on which we run the classification methods is another key aspect of the cloud computing in general, so checking the same or similar algorithms and dataset on different clusters may prove fundamental and reveal latent knowledge. Funding: This article is part of Project 451, a long term research initiative whose primary objective is the development of novel, scalable, numerically stable, and interpretable tensor analytics.