Universal Reconﬁgurable Hardware Accelerator for Sparse Machine Learning Predictive Models

: This study presents a universal reconﬁgurable hardware accelerator for efﬁcient processing of sparse decision trees, artiﬁcial neural networks and support vector machines. The main idea is to develop a hardware accelerator that will be able to directly process sparse machine learning models, resulting in shorter inference times and lower power consumption compared to existing solutions. To the author’s best knowledge, this is the ﬁrst hardware accelerator of this type. Additionally, this is the ﬁrst accelerator that is capable of processing sparse machine learning models of different types. Besides the hardware accelerator itself, algorithms for induction of sparse decision trees, pruning of support vector machines and artiﬁcial neural networks are presented. Such sparse machine learning classiﬁers are attractive since they require signiﬁcantly less memory resources for storing model parameters. This results in reduced data movement between the accelerator and the DRAM memory, as well as a reduced number of operations required to process input instances, leading to faster and more energy-efﬁcient processing. This could be of a signiﬁcant interest in edge-based applications, with severely constrained memory, computation resources and power consumption. The performance of algorithms and the developed hardware accelerator are demonstrated using standard benchmark datasets from the UCI Machine Learning Repository database. The results of the experimental study reveal that the proposed algorithms and presented hardware accelerator are superior when compared to some of the existing solutions. Throughput is increased up to 2 times for decision trees, 2.3 times for support vector machines and 38 times for artiﬁcial neural networks. When the processing latency is considered, maximum performance improvement is even higher: up to a 4.4 times reduction for decision trees, a 84.1 times reduction for support vector machines and a 22.2 times reduction for artiﬁcial neural networks. Finally, since it is capable of supporting sparse classiﬁers, the usage of the proposed hardware accelerator leads to a signiﬁcant reduction in energy spent on DRAM data transfers and a reduction of 50.16% for decision trees, 93.65% for support vector machines and as much as 93.75% for artiﬁcial neural networks, respectively.


Introduction
Until recent discoveries of convolutional neural networks and the other deep learning architectures, multi layer perceptron (MLP) artificial neural networks (ANNs), decision trees (DTs) and support vector machines (SVMs) were recognized as the most popular predictive models in the field of machine learning (ML). Although CNNs have replaced ANNs, DTs and SVMs in the fields of computer vision and natural language processing, ANNs, DTs and SVMs are still among the most widely used predictive models in the field of data mining [1][2][3]. The implementation is based on the fact that the DOT product of two multi-dimensional vectors, as the core data processing operation, is common for all three supported ML models. The same authors extended their solution to homogeneous and heterogeneous ensemble classifiers in [34].
Another way to improve classifier throughput performance is to compress and reduce the size of the predictive model, by using different sparsification approaches. Sparsification techniques have been mainly explored in the field of ANNs and convolutional neural networks (CNNs) [35][36][37][38][39][40] and have resulted in the significant reduction in model size. Compression has been significantly less explored in the fields of DTs and SVMs. Authors in [41,42] recognized the benefits of minimizing the number of non-zero hyperplane coefficients in oblique DTs. However, the focus in these studies was a feature/attribute selection and detection of irrelevant and noisy features and not the reduction in model size or inference process acceleration. Compression of the SVM size by using a smart selection of support vectors during the training was presented in [43], while in [44] authors proposed the complementary idea for the SVM size reduction through the removal of attributes from support vectors.
The benefits of having a sparse predictive model are fully exploited only when the model is being executed on a hardware accelerator that can process sparsified models. In the field of CNN accelerators this has been heavily exploited, resulting in numerous solutions being proposed [45][46][47][48][49]. Surprisingly, in the field of DTs, SVMs and ANNs, only a handful of hardware accelerators capable of directly processing sparse models have been proposed in [44,50], despite the obvious benefits of accelerating sparse ML predictive models. A hardware accelerator of sparse oblique DTs was presented in [50], where it was reported that oblique DT sparsification led to both instance processing speedup and memory reduction. A hardware accelerator for sparse SMVs was proposed in [44], with similar benefits regarding the improvements in inference speed.
To the best of our knowledge, there is no published result concerned with the development of a hardware accelerator capable of accelerating different sparse ML model types, like DTs, SVMs and ANNs. This could be of a great interest for the applications relying on using hybrid-classifier systems, for example, [51][52][53][54][55][56].
In this study, we present the Sparse Reconfigurable Machine Learning Classifier, SRMLC-an application specific hardware accelerator for efficient processing of sparsified decision trees, support vector machines and artificial neural networks. The SRMLC is based on the implementation proposed in [33] where the underlying core operation is MAC (multiply and accumulate); however, it is optimized in order to support the acceleration of sparse predictive models in which the majority of model weights are set to zero. Consequently, without any performance degradation in terms of classifier accuracy, the SRMLC processing latency is significantly reduced, as a result of skipping numerous MAC operations with zero-valued operands. Compared to previously published results in [33], there are five major contributions of our approach: 1.
Sparsification-our design is the first universal reconfigurable machine learning classifier accelerator which is optimized to support sparse data representations and which benefits from such sparse data manipulations.

2.
Scalability-one of the major design goals during the development of the SRMLC architecture was supporting better scalability on FPGA platforms. This is feasible due to the fact that one MAC unit, a basic building block within the SRMLC, uses one DSP, one BRAM and 300 LUTs, compared to previously published RMLC [33], where scalability was constrained by using too many BRAM blocks per one DSP.

3.
Improved throughput-as a result of the classifier sparsification, the SRMLC has significantly improved throughput for DTs and SVMs. Regarding MLP ANNs, throughput is improved as a result of the fact that a single layer can be assigned to multiple MAC units for processing, which is not possible in the architecture proposed in [33]. 4.
Reduced processing latency-the SRMLC introduces a huge reduction in instance processing latency, as it allows for the usage of multiple MAC blocks for processing a single classification instance. The latency is reduced for all supported classifier types.

5.
Energy efficiency-since the SRMLC uses sparse data representation, it suffers much less from the well known issue of power hungry data transfer between the DRAM and the accelerator. Improved energy efficiency is a consequence of the significantly reduced amount of data that needs to be transferred between the external DRAM and the accelerator core.
To the best of our knowledge, the SRMLC is the first reported hardware accelerator for sparse classifiers of this kind. In order to demonstrate functionality of the SRMLC, in this study we will also present algorithms for the training of sparse ANNs, SVMs and DTs and translate the compressed trained models into the sparse binary format, which can be directly handled by the SRMLC.
The remainder of this study is organized in the following way. In Section 2, we will present training algorithms for sparse classifiers which can be used to obtain predictive models that are significantly reduced in size. Three algorithms are presented: one for sparse ANN training, another for sparse DT induction and the third for sparse SVM training. In Section 3, a universal reconfigurable hardware accelerator for sparse classifiers is introduced. The proposed hardware accelerator benefits from the sparsity in predictive models and performs faster classifications by computing only multiplication operations with non-zero operands. In Section 4, we report the experimental results of the benchmarking of our SRMLC architecture performance using datasets from the UCI machine learning repository. The conclusion and final remarks are given in the Section 5.

Training of Sparse Predictive Models
In order to provide a sparse representation that will be efficiently processed by the SRMLC, a classifier's training process is modified by removing model parameters, according to the desired level of compression. However, the reduction in predictive model size has to take into account the resulting model accuracy drop. Usually, in the available literature, 1% of the absolute accuracy drop is acceptable when sparsifying the predictive model, so we have used this as a reference during our training process: model sparsification stops when the absolute accuracy of the sparse model is more than 1% below the absolute accuracy of a non-sparsified model. The same approach is used for all three classifiers, even though the training process itself is significantly different for each classifier type.

Pruning of ANN Model during Training
First, we will present a pruning algorithm of ANN model, which is used during the ANN training phase, in order to obtain sparse ANN model.
An ANN can be considered as a weighted directed graph, where nodes are artificial neurons, which are connected by directed weighted edges. Recurrent ANNs are ANNs which allow feedback connections, while feed-forward ANNs do not. In Multi-Layer Perceptron, a widely used type of feed-forward ANNs, individual neurons are organized into layers and the only connections that are allowed are the ones between adjacent layers of the network. Besides that, neurons are connected in a feed-forward manner, with no connections between neurons of the same layer and no feedback connections between the layers. The structure of the MLP ANN is shown in Figure 1. Three types of layers exist in MLP ANN: input, hidden and output. The input layer is composed of N input neurons, where N is the number of problem attributes. The output layer calculates the output values of the MLP ANN, while all layers between the input and the output are considered as hidden. Each hidden layer consists of an arbitrary number of neurons. Each (let us say k th ) neuron in the hidden or the output layer calculates its output as: where x is the input vector for a neuron containing N neuron outputs from the previous ANN layer w is a weight vector w = (w k1 , x k2 , . . . , x kN ), and b k is a scalar, usually called the offset. For neurons in the input layer, x is a vector holding attribute values of the input instance. As mentioned above, for the neurons within hidden and output layers, vector x is the vector composed of the outputs from neurons in the previous layer of the network. Weight vector w for each layer has the same length as x for that particular layer. Function f : R → R is called the activation function and it can be either a linear or nonlinear real function. Many different activation functions can be used in MLP ANNs, for example, the hyperbolic tangent and sigmoid functions: In order to understand the ANN pruning process, let us observe a hidden or the output layer with M neurons and with N-dimensional input x. Then, the output of this layer can be written as where x 1×N = (x 1 , x 2 , . . . , x N ) is an input for the layer, W is a matrix and B 1×M = (b 1 , b 2 , . . . , b M ) is a vector consisting of the corresponding layer's neuron offsets. Finally, Y 1×M = (y 1 , y 2 , . . . , y M ) is an output vector consisting of the outputs from all neurons in the given layer. The training of the ANN is the process by which the dataset consisting of N inst problem instances is applied to the ANN. During the training, network parameters W and b (for each layer) are fine-tuned to provide the ANN output, for each given input instance, that is equal to the expected output.
The pruning ANN algorithm is run once the ANN training is complete. The goal of the pruning algorithm is to determine which network weights are the least significant and set corresponding elements of matrix W to zero, for each ANN layer. The pruning ANN algorithm iteratively repeats this procedure, as long as the accuracy drop of the reduced model is acceptable. As a result, the pruned ANN will have the majority of weights set to zero, allowing the SRMLC to skip all computations where these zero-valued weights are multiplicands in multiply operations. In order to prune the ANN, the pruning algorithm starts with a low pruning factor MI N_PRUN_FACT (for example 10%), increasing the pruning factor by PRUN_FACT_I NC_STEP (for example 5%) in each iteration.
At the beginning, Algorithm 1 initializes the current pruning factor and evaluates the non-pruned trained ANN model. At line 5, a main loop starts and repeats as long as the accuracy of a pruned model is not degraded. Algorithm creates an empty array abs_W_array of the same size as W_array and populates it with the absolute values from corresponding elements in W_array, at lines 7-10. W_array, which is an input to the algorithm, represents all connections within the MLP ANN and hence holds W matrices from Equation (5) for all layers in the given MLP ANN. At line 11, a required number of zero elements is determined based on the current pruning factor, number of ANN layers and matrix W size. At lines 13-34, num_zero_elem minimum elements in abs_W_array are found and corresponding elements in temp_W_array are set to zero. In order to avoid selection of the same element several times, min_elem_set is used to store previously selected elements. Prior to updating current_acc at line 37, we retrain the MLP ANN with modified weights in order to increase the classifying accuracy, at line 36. Please note that temp_W_array is used for pruning, while W_array is only updated once we are confident that the accuracy of pruned model is still acceptable, at the beginning of a new iteration. If W_array was updated directly, the returned value from the algorithm would not be the correct one.

DT Model Sparsification during Induction
For the given classification problem, represented by a set of n numerical attributes A i , i = 1, 2, . . . , n, axis-parallel DTs compare a single attribute A i , from the test instance, against the corresponding threshold a i and check if A i > a i . Such test is performed in each of the DT nodes. Inducing (training) of axis-parallel (or orthogonal) DT assumes the assignment of an attribute to each DT node, hence the order of comparisons, as well as the threshold value required for each comparison, a i . Oblique DTs represent generalization of axis-parallel DTs, allowing multiple attribute testing in each DT node. As a result, oblique DTs are usually smaller in size, while providing higher classifying accuracy, when compared to corresponding axis-parallel DTs. In oblique DTs, multivariate testing is expressed with the following formula: In Equation (6) coefficients a i , i = 1, . . . , n + 1 are called the hyperplane coefficients. Finding the optimal oblique DT is proven to be an NP-complete problem [57], so many oblique DT induction algorithms use some kind of heuristic in order to find sub-optimal hyperplane coefficients, some of them being [58,59]. HereBoy evolutionary algorithm [60] was used in [61] to solve the hard oblique induction problem, while the authors in [50] extended the algorithm in order to support the induction of sparse oblique DTs. In order to obtain sparse DTs with high accuracy, which will be processed by the SRMLC, we decided to use the similar recursive evolutionary algorithm for the induction of sparse DTs in this study.
Algorithm 2 builds maximally sparsified DT which has an acceptable accuracy drop, compared to the non-sparsified DT. At lines 1-2, non-sparsified DT is built using Algorithm 3 and its accuracy is evaluated on a validation subset. Using the same Algorithm 3, at lines 5-10, sparsified DTs are built, each time with increased sparsification factor (starting from MI N_SPARS_FACT, increased by SPARS_I NC_FACT in each iteration), until the evaluated accuracy drops below the tolerated threshold. Algorithm 3, used by Algorithm 2, is a recursive algorithm which builds sparse oblique DT, based on a target sparsification factor (required percentage of zero-valued coefficients in the output DT), provided as an input parameter. The other input for the algorithm is a set of input instances, where each instance contains the instance attributes and output class. At line 1, a new node is created. Lines 2-4 represent the terminating condition of the recursive algorithm-once all input instances belong to the same class, that root is marked as a leaf and the corresponding label is added. If this is not the case, assign the most appropriate label to this node before entering the loop at lines 7-22. In this loop, after finding sub-optimal hyperplane position by using the HereBoy algorithm, a required number of hyperplane coefficients, calculated from sparse_coe f _perc, is set to zero. This is completed in the loop at lines [11][12][13][14][15][16][17][18][19] where, one after the other, each coefficient from the copy of hyperplane_coe f s is set to zero, followed by the calculation of resulting fitness. The coefficient which has the smaller impact on the fitness is considered the least important and set to zero after the complete set of coefficients has been investigated (lines [20][21]. This is repeated until the percentage of zero-valued coefficients reaches sparse_coe f _perc. When completed, at lines 23 and 24, input instance set input_instances is split into two disjoint subsets based on their position relative to the hyperplane. Two subsets are then used as an input for recursive calls of Algorithm 3 (lines [25][26] where the right and the left subtree from the node root are built in the same manner. Once a complete DT is built, the root node representing the tree is returned from the algorithm. Next, we will explain in more detail the evolutionary algorithm Find_best_hyperplane_pos, which is invoked at line 8.
An input for Algorithm 4 is a set of input instances. First, it creates initial hyperplane coefficients providing that not all instances from input_instances are located on the same side of the hyperplane. The algorithm then iterates through a predefined number of generations, mutating hyperplane coefficients with certain probability in each generation. Mutation here refers to a random bit flip in the binary representation of hyperplane coefficients. If new, a mutated hyperplane has a better fitness compared to the old one, and it will proceed to the next generation as the best individual. At the end, after NU M_GENERATIONS, the best individual will be returned back as the output of the algorithm.
In order to calculate a fitness, both in Algorithms 3 and 4, the Algorithm 5 is used. Algorithm 5 first finds k as the total number of classes in the given classification problem. Then, the total number of input instances, N, is determined as the length of the input array input_instances which is used in the current DT node. N i is also determined as the number of training instances that belong to class i, i = 1, . . . , k, from the total of N instances. At the end, N 1i is calculated as the number of instances be-longing to class i, i = 1, . . . , k , which is located above the hyperplane represented by hyperplane_coe f f icients. Algorithm 5 returns value 0 ≤ f itness ≤ 1.

5:
N 1i ← number of training instances that belong to class i and are located above hyperplane defined by hyperplane_coe f f icients 6: end for

Attribute Sparse Training of SVM
Similar to other supervised machine learning algorithms, SVMs are first trained during the learning phase, followed by the predicting phase when SVMs are used to classify input instances. During the learning phase, a training set TS with m instances is used: The goal of SVM training is to design a hyperplane which will separate "positive" input instances from the "negative" ones, while trying to maximize the margin between the hyperplane and the closest instances on both sides. Hyperplane equation is Even though the hyperplane is designed with constraint to maximize the distance from the closest input instances, in general, this condition cannot be fulfilled for all input instances. In order to solve this, the SVM algorithm allows the incorrect classification for some of the instances from the training set. The problem of finding an optimal separating hyperplane can be defined formally as the constrained quadratic programming (CQP) problem: In the optimization problem given above, one part 1 2 w 2 positions the hyperplane so that the margin is maximized, while the second part C m ∑ i=1 ε i relocates the hyperplane in order to minimize the number of misclassified training set instances. A parameter C defines the trade-off between those two complementary conditions. By using Lagrange multipliers, the original CQP problem is transformed into its dual QP form, which is easier to solve: In the dual QP problem definition from above, W is a symmetric positive semidefinite m × m matrix. Beside input instance labels y i , y j , each element w ij from W is calculated by using a kernel function K. Some of the most popular kernel functions are An efficient algorithm for solving the QP problem called Sequential Minimal Optimization (SMO) is proposed in [62]. Once a training phase is completed, a majority of Lagrange multipliers α i will be zero, and only non-zero multipliers will be the ones corresponding to, so called, support vectors. These are the input instances that are located closest to the hyperplane, from both the "positive" and "negative" side. A number of support vectors (l) are usually significantly smaller than the total number of training instances: l << m. In predicting phase, only support vectors are used for the classification of a new input instance: where x i is the support vector and x is the input instance to be classified. If the result of V(x) is negative, an input instance is classified as a class −1; otherwise, it is classified as +1. A generalization of the presented mathematical models and methods, which allows for multi-class prediction, is proposed in [63].
In order to decrease the number of multiplications during the predicting phase, Algorithm 6, presented below, will eliminate some of the input instance attributes from the input instance set. This will be done by setting "the least significant" attributes to zero, resulting in the reduction in the multiplication number during the input instance classification.
Algorithm 6 first trains SVM by using the original input dataset and then evaluates the model at lines 1-2. At line 3, an initial value of a target sparsification percentage is set using the MI N_SPARS_PERC parameter. A main loop (lines 6-22) is executed until the accuracy of a sparsified model drops below the tolerated lower threshold, calculated from the accuracy of the non-sparsified model. Within the loop, at line 8, this algorithm sorts the input dataset in descending order with respect to the number of non-zero attributes within each input instance. As a result, the first input instance in the dataset will be the one with the largest number of non-zero attributes. The algorithm then calculates current_sparse_perc at lines 9-11, prior to entering the inner loop (lines [12][13][14][15][16][17][18]. In this loop, the algorithm starts with a selection of the first input instance from the dataset. Recall that this is the input instance with the largest number of non-zero attributes. Then, "the least significant" attribute from that input instance is set to zero at lines 14-15. This attribute is selected as the one with the smallest absolute value. After eliminating the least significant attribute, current_sparse_perc is updated (line 16) and input_instances is re-sorted since the input instances within the dataset have been modified. This inner loop is repeated until a desired target_sparse_perc is reached. When the loop completes, a standard SVM training is invoked on the sparsified input dataset. As a result, a trained SVM will consist of sparse support vectors, eventually leading to improved performance of the hardware accelerator that implements such SVM. Please note that, similar to the pruning/sparsification of ANNs and DTs, sparse_svm is updated based on temp_sparse_svm only when a current sparsified predictive SVM model has the acceptable accuracy (beginning of the main loop, line 7).

Sparse Reconfigurable Machine Learning Classifier
In order to benefit from sparse classifier models, a dedicated hardware accelerator, called the SRMLC (sparse reconfigurable machine learning classifier), is developed. A proposed architecture was inspired by a previously published accelerator [33], which can be reconfigured to support different classifier types, but cannot take advantage of sparse classifier models. The SRMLC, proposed in this study, is a novel architecture, which is designed and optimized to work with such sparse classifiers.
A proposed SRMLC architecture has a reduced instance processing latency and enhanced throughput when compared to the RMLC architecture published in [33]. The instance processing latency is reduced as a result of parallel and simultaneous DOT prod-uct calculation. Additionally, as a consequence of sparse data processing, outputs are calculated faster, since a significant number of DOT product multiplications are skipped, which results in a higher throughput.
The presented SRMLC architecture contains three main hardware modules, as shown in Figure 2. The first module (TOP-CTRL) controls the operation of the complete accelerator design. After reading configurations from the memory through the AXI-Full interface, TOP-CTRL configures other blocks via dc-config-bus and rc-cfg-bus interface busses, depending on the specific configuration.
The main CPU in the system can additionally configure the accelerator core through the AXI-lite interface. The configuration registers related to the control of the entire accelerator are located in the CONFIG-REGS module. One of the RMLC core features, especially important for understanding the section with the experimental results, is to fetch multiple input instances as a single memory block in a so-called batch mode. With this approach, the impact of DRAM memory latency is diminished and higher throughput is achieved.
For all types of supported predictive models, the second module (CALC) calculates DOT products, which are the core operations during the input instance classification. The CALC module receives the input instance from TOP-CTRL block. The output of this module is an array of calculated DOT products, which will be used by the RECONF module.
The third module (RECONF) is a reconfigurable block, which, according to the configuration, determines the type of currently active classifier. The RECONF module receives, as an input, an array of calculated DOT products from the CALC module and uses them to compute the classification results at the output.
The architecture of the CALC module is shown in Figure 3. The core of the CALC module is the array of DOT module instances, where each calculates a single DOT product between the input data vector and coefficient/weight vector, defined by the used predictive model (DT, SVM or ANN). As the SRMLC accelerator core can be parametrized, L d , which is the number of DOT module instances within the array, is one of the most important system parameters. Each DOT module within the array can be configured through the configuration bus (dc-config-bus). The input data vector is transferred to DOT modules via the CALC module controller (CALC-CTRL). It is actually transferred in segments, as explained later in the text. The architecture of the DOT module is shown in Figure 4. Please note that all submodules in the figure are shown conceptually, since the actual implementation contains more than one level of pipeline processing. The DOT module is the most important for the system performance and, in most configurations, it also utilizes the most hardware resources. In order to support a sparse data processing, the configuration of each DOT module consists of classifier weights and increment values, stored within the internal DOT module memory (CFG-MEM). Classifier weights are coefficients that are used in DOT products for a given classifier type-weights in MLP ANNs, sparse hyperplane coefficients in DTs and sparse support vector coefficients in SVMs. Due to the compression of classifier weights, increment values are also stored in CFG-MEM to enable reconstruction of the actual positions within the original dense weight/coefficient vector. For example, assume that 10 pruned/sparsified classifier weights are 10, 0, 0, 25, 0, 0, 0, 0, 129, 38. For such a small compressed classifier weights segment, values stored in CFG-MEM would be: • Please note that due to a narrower dynamic range of increment values, they can be represented with fewer bits. Specifically, in the SRMLC architecture, weights are stored as 16-bit wide words and increments are stored as 4-bit wide nibbles, which results in the reduction in memory required for storing predictive model weights.
CFG-MEM memory is a main consumer of the memory resources within the DOT module. It is modeled so that the highest possible speed can be achieved with FPGA implementation. Additionally, the model is such that it can be easily mapped to the memory resources of FPGA devices, which is why this module is the main consumer of memory resources. The module also contains the control logic necessary to control multiple phases of a pipeline processing. Figure 5 shows how the compressed classifier weights are stored in CFG-MEM memory. Within each memory location, a single non-zero classifier weight (denoted by V), and its corresponding increment value (denoted by I), is stored. CFG-MEM memory is divided into a number of sections. In the case of processing SVMs, each section holds the non-zero coefficients of a single support vector. In the case of processing MLP ANNs, each section holds the non-zero weights of one neuron. Finally, in the case of DT processing, one section holds the non-zero hyperplane coefficients of a single DT node. The section number is shown as the subscript of V and I in Figure 5, and the shaded locations indicate memory locations belonging to one section. The numbers shown below the V i and I i symbols indicate the index of a non-zero classifier weight and its corresponding increment value, within the current section. Please note that the width of the given section depends on the number of contained non-zero weights. Hence, it is not necessary that all sections within the CFG-MEM have the same width. The main component of the IFM-SEL module is the multiplexer, which selects the correct input vector value as an operand for multiplication. A multiplexer requires LUTs for implementation, so this module is the main consumer of LUTs in the FPGA implementation.
The outputs of the IFM-SEL module are used by the MAC-CALC module, which performs multiplication operations and accumulates the results in order to calculate the DOT product. The main goal when designing this module was to implement it within a single DSP block of the FPGA device, with the smallest possible utilized logic overhead. Therefore, this module is the main consumer of DSP blocks during FPGA implementation of the accelerator core.
Each of the submodules within the DOT module is designed to consume the critical FPGA resources (DSP, memory and random logic-LUTs) in a balanced way. With such an approach, the entire architecture scaled well, when increasing the available hardware resources of the FPGA device, since all resources were consumed equally. This is in stark contrast to the RMLC architecture which failed to balance the utilization of memory resources and, therefore, scaled poorly. Figure 6 shows the architecture of the RECONF module. The SHIFT-CUT block receives the array of calculated DOT roduct values and stores them internally, allowing DOT modules to process new input vector values, which results in a specific course grained parallelism within the SRMLC architecture. Buffered values are stored within the shift register, large enough to store L d DOT product values. The lowest L r out of those L d values are passed to computing lanes (RC-LANE), where L r is the number of used computing lanes within the design and is also an additional SRMLC configuration parameter. After distributing L r values to appropriate computing lanes, the SHIFT-CUT block will shift contained DOT products in order to prepare the next batch of L r points for processing. Each RC-LANE can perform multiple functions, depending on the configuration of the SRMLC accelerator. Figure 7 shows that the RC-LANE module contains several pipeline stages, where each is configurable via the configuration bus. Depending on the current configuration, the RC-LANE can perform non-linear function evaluation, multiplication or shift contained value in order to adjust its fixed point number format. Each of these stages can be skipped, if not needed for the calculation. The reduction submodule from the RECONF (RC-REDUCE) receives all inputs from the RC-LANEs and performs different post-processing steps, depending on the current configuration. When the final stages of SMVs are processed, it will accumulate received values. In the case of MLP classifier processing, the RC-REDUCE module will pass calculated values from neurons or accumulate values depending on the configuration. At the end, the RC-REDUCE module is in charge of the selection process within decision trees, selecting the right or the left subtree, based on a previously calculated DOT product.
Next, we will show how the RC-LANE modules are configured in order to process SVM, DT or MLP ANN classifier types. Please notice that Figures 8-10 show the required configuration of a single RECONF block for simplification. In the examples below, active blocks within the RECONF module are shaded. Figure 8 presents the configuration of one RECONF module in the case of DT processing. The figure also shows an example DT, as well as the way the DT is mapped to a corresponding configuration, stored inside the RC-REDUCE module. In this configuration, each DOT module computes a DOT product of the input instance and assigned DT node hyperplane coefficients and outputs the result for the RECONF module. The selected DT nodes' hyperplane coefficients are stored inside the CFG-MEM module, labeled with nd0, nd1 . . . in Figure 8. In this configuration, RC-LANE modules only adjust the number formats of computed DOT products. The RC-REDUCE module takes computed DOT products and iteratively compares them with the appropriate threshold values, t i , while traversing a DT from the root node until a leaf node is reached.
In the example DT from Figure 8, a classification of input instance will go as follows. In the first step, the DOT product d0 will be compared to the threshold value t0. Depending on the outcome, the RC-REDUCE module will select which DOT product to use for the following comparison. For example, let us assume that the result of the d0 > t0 comparison is such that we should visit the DT node n1 next. This means that the RC-REDUCE module will next compare the dot product d1 with threshold t1. This procedure is repeated until RC-REDUCE reaches a leaf node. This completes the current input instance classification, and the computed input instance class membership value is transmitted through the val_out output port. For example, if the result of the d1 > t1 test leads to reaching the c0 leaf node, the classification of the input instance is completed and the RC-REDUCE module will output the c0 value on the val_out output port.  Figure 9 shows one non-linear SVM configuration. In this configuration, each DOT module calculates DOT product of the input instance and the subset of support vectors, associated with the corresponding DOT module. The subset of support vectors is stored within the CFG-MEM module, labeled with sv0, sv1 . . . in the Figure 9. The calculated DOT values are passed from the CALC module to the RECONF module. In the Figure 9 these values are designated as d i , i = 0, 1, . . . , L r − 1. When operating in SVM mode, RC-LANE modules use non-linear memory to calculate the kernel function, specified by the user. The values after calculating the non-linear function are labeled with a i , i = 0, 1, . . . , L r − 1. The MUL submodule is used to multiply non-linear values with Lagrange multipliers, α i , i = 0, 1, . . . , L r − 1. SHIFT modules are used to adjust number formats, in order to minimize a quantization loss during the calculation. The values k i , i = 0, 1, . . . , L r − 1, obtained after multiplication with Lagrange multipliers, are sent to the RC-REDUCE module, which accumulates them in this operating mode. With such configuration, accumulated sum of values k i is obtained. Additionally, this module adds offset value to the accumulated sum and compares it with zero to obtain the final SVM classification result of the current input instance. Lagrange multipliers, kernel function samples and the offset are an integral part of the SVM architecture configuration and they are set by the TOP-CTRL module.  Figure 10 shows the SRMLC configuration processing the sparse MLP classifier with a single hidden layer. As shown in Figure 10, when the architecture is configured to work as MLP, each DOT module calculates DOT products of the input instances and neuron weights for each neuron from a subset associated with a corresponding DOT module. The subset of neuron weights is stored inside the CFG-MEM module, labeled with n0, n1 . . . in Figure 10. The calculated neuron output values are forwarded to the RECONF module, marked with d i , i = 0, 1, . . . , L r − 1 in the Figure 10. RC-LANE modules receive these values and pass them to non-linear memory, where the samples of specified activation function are stored. In this way, the output values a i , i = 0, 1, . . . , L r − 1 are obtained after applying the activation function, as shown in Figure 10. These values are sent to the RC-REDUCE module which only forwards them to the output, without additional processing.

Analytical Model of Instance Processing Throughput of SRMLC Architecture
In this section, we will provide an analytical model of the SRMLC architecture that shows the number of cycles required to classify a single input instance. The analysis will focus on the instance processing throughput of the architecture, so the latency required to deliver the data to the computing module will be neglected. The following description of the analytical model corresponds to the operation of the SRMLC architecture in a batch mode.
The architecture contains two main computing modules: the CALC module and the RECONF module. These two modules work in parallel, organized as a two-stage pipeline. Please note that both the CALC and RECONF modules internally also use pipelining. For different configurations of the SRMLC architecture, one of these modules will be a processing bottleneck during the instance classification.
Let N proc be the number of tree nodes, the number of support vectors or the number of neurons, depending on the architecture configuration (DT, SVM or MLP). Let L d be the number of DOT modules available in the current configuration of the SRMLC architecture. When N proc ≤ L d , N proc DOT modules will be active in parallel during the current input instance processing in a single run. However, if N proc > L d , the current input instance will be processed in N proc /L d iterations during which all L d DOT modules are engaged and the last iteration in which the number of active DOT modules equals: Let L r be the number of RC-LANE modules within the RECONF module, and let N r be the number of cycles necessary to perform a single instance classification by the RECONF module. The architecture should always be configured so that relation L d > L r holds. Then: Let N nzv be the number of non-zero weights and let N c be the number of cycles required by the DOT module to compute all DOT products using the current instance. Then, due to a three cycles delay of pipeline processing: If N a is the input instance attributes number, and P is the factor with which the classification models are pruned (percentage of weights that will be set to zero), then, Since the CALC and RECONF modules are connected in a pipeline, the number of cycles, N, required for the SRMLC architecture to classify a single input instance, equals Throughput with which the SRMLC architecture can process input instances can be calculated as: where f is the operating frequency of the SRMLC accelerator. From the above analysis, it can be concluded that in order to obtain higher throughput, by increasing the pruning factor, the architecture needs to be configured so that N c becomes a dominant factor in Equation (12).
For fixed architecture parameters (L d and L r ), the values of N r and N c change as a staircase function when N proc is changed. N r is increased by 1 whenever N proc is increased by L r . N c is increased by (1 − P)N a , when N proc is increased by L d . As a result, the value of N r increases more frequently in smaller steps. If we want the relation N c > N r to hold, then the worst case scenario happens if N proc reaches the value which causes the update of N c . That is the case when D = L d , so: In that case, Equation (9) becomes: From Equations (11) and (15), we can derive the condition that the CALC module is the real processing bottleneck, meaning that the sparsification/pruning factor has an impact on the processing throughput of the accelerator. N c > N r (16) (N a (1 − P) If Equation (19) holds, the sparsification/pruning factor P increases throughput for the given architecture parameters (L d and L r ) and the problem instance (N a ). In the experimental results (Section 4), the SRMLC architecture is configured so that Equation (19) holds for almost all cases.

Experimental Results
In order to benchmark our approach, we have conveyed several experiments and the results are presented in this section. In the first subsection, we show the result of ANN pruning by using our Algorithm 1 and the results of DT and SVM sparsification by using Algorithms 2 and 6. In the second subsection, we compare our work with previously published work [33], while in the third subsection we present the comparison with embedded software implementations of ML predictive models.

Experiments for Pruning ANNs and Sparsification of DTs and SVMs
In order to be able to benchmark the performance of Algorithms 1, 2 and 6, UCI machine learning repository datasets from Table 1 were used. Short names shown in Table 1 correspond to the names used in Figures 11-22. The Tensorflow framework [64] has been used for evaluating Algorithm 1. The instances from the UCI machine learning repository with missing values are removed from datasets, while all results reported below are the averages of five ten-fold cross-validation experiments. This assumes that the original dataset D is divided into 10 non-overlapping subsets, D 1 , D 2 , . . . , D 10 , which consist of uniformly selected instances from D. During each cross-validation iteration, ANN is built by using the instances from the D \ D i set and tested on D i the set (i = 1, . . . , 10). By repeating this procedure five times, 50 ANNs are constructed in total for each dataset. Then, the average classification accuracy is calculated as the percentage of input instances, which are correctly classified. In order to obtain the ANN pruning curve, which shows how the accuracy of a pruned ANN drops as the pruning factor increases, we slightly modified Algorithm 1. The Algorithm 1 presented in Section 2 exits the main loop as soon as the accuracy of a pruned ANN drops more than 1% below the absolute accuracy of non-pruned ANN. Instead of this upper limit, for experimental purposes, we have swept current_pruning_ f actor from MI N_PRUN_FACT to MAX_PRUN_FACT. In our experiments, a MI N_PRUN_FACT parameter was set to 5% and a MAX_PRUN_FACT parameter was set to 99%. Figures 11-14 show the results of training and pruning the ANN with a single hidden layer which has 64 neurons on 15 datasets from the UCI repository. The presented charts show the impact of a pruning factor (X-axis) on classification accuracy (Y-axis), where the name of the dataset is given above the chart. While the blue line shows the accuracy of the pruned ANN, which depends on a pruning factor and eventually drops, a red dashed line shows the lower limit of a pruning tolerance, since it is drawn 1% below the value of the non-pruned ANN absolute accuracy. Figures 11-14 show that max_pruning_ f actor above 80% can be used with the accepted accuracy drop on 13 out of 15 datasets, which means that, most of the time, more than 80% of the ANN weights can be set to zero, without any loss in accuracy. During the training and pruning, several ANN architectures were used, containing multiple hidden layers and a varying number of neurons per hidden layer. Neither increasing number of hidden layers nor increasing number of neurons within the hidden layer above 64, resulted in better accuracy on the chosen datasets.    Figures 15-18 show results of the sparse oblique DT induction, performed on the same subset of datasets from the UCI repository which were used for the training and pruning of ANNs. Once again, we slightly modified Algorithm 2 in order to obtain the sparsification curve. Hence, instead of exiting the main loop once the accuracy of sparse DT drops below the tolerated lower accuracy limit, we sweep current_spars_ f actor between MIN_SPARS_FACT and MAX_SPARS_FACT in our experiment set to 10% and 90%, respectively. In Figures 15-18, a blue line shows the accuracy of a sparse DT, which decreases as a sparsification factor increases (similar to pruning of ANNs). A red dashed line shows a lower limit for accepted accuracy, calculated as an absolute 1% drop from the non-sparse DT model accuracy. Although sparsification factors are lower compared to ANNs, the results show that a significant percentage of DT coefficients can be set to zero during induction, with the acceptable accuracy drop. However, opposite from ANN pruning, during DT sparsification, high percentages of sparsification factor could not be obtained for most of the datasets, since iterative evolutionary algorithm could not converge in those scenarios. This is the reason why, for some datasets, MAX_SPARS_FACT of 90% could not be reached in Figures 15-18. Similar to the experimental results for MLP ANNs, all reported results are calculated as averages of five ten-fold cross-validation experiments.      Figures 19-22 show results of the attribute sparse SVM training, performed on the same 15 UCI datasets, which were used for the training and pruning of the ANNs and the sparse induction of DTs. As with other two algorithms, we had to modify Algorithm 6 in order to be able to draw the sparsification curve for SVM predictive models. Hence, instead of using the main loop exiting condition at line 6, we swept target_spars_perc between MI N_SPARS_PERC set to 5% and MAX_SPARS_PERC set to 95%. Similar to previous charts, a blue line was used to present the accuracy of an attribute sparse SVM for a given dataset, decreasing as the sparsification factor increases. A dashed red line shows the lower limit for tolerated accuracy and it is calculated after subtracting the 1% absolute accuracy drop from the non-sparse SVM model accuracy (accepting absolute 1% for the tolerated accuracy drop). As results show, at least 60% of the sparsification factor can be achieved for the majority of the used datasets. Similar to previously presented results for ANN pruning and DT sparsification, results shown here are calculated as averages of five ten-fold cross-validation experiments.     Tables 2-4. For the dataset given in the first column, the achieved sparsification/pruning factor is shown in column 2. The accuracy of non-sparsified and sparsified models is presented in columns 3 and 4, respectively, while the last two columns show corresponding size, expressed as the number of model parameters (weights/coefficients).
From Tables 2-4, one can conclude that the ANN predictive model performs better than the other two with the average accuracy of 84.58% and the average maximum achieved pruning factor of 81.33%. For the SVM predictive model, an average accuracy on the given datasets is 79.64%, with an average maximum achieved sparsification factor of 57.67%. Finally, the DT classifier can be sparsified 61.71% on average, with a slightly worse average accuracy of 77.06%. Columns 3 and 4 from Tables 2, 3 and 4 also show that, for all three classifier types, even heavily sparsified predictive models can score a better predicting accuracy, when compared to a non-sparsified model, for some datasets.

Comparison with RMLC Architecture
In order to compare the resource utilization and scalability of the RMLC and SRMLC architectures, both of them were implemented by using FPGA technology. Xilinx Vivado Design Suite [65] was used for the implementation of both architectures, targeting the ZU9 FPGA device, with default values of settings for the synthesis and implementation. In order to measure an achievable instance processing performance, Zynq Ultrascale+ MPSoC ZCU102 Evaluation Board [66] was used as the test platform for conducting experiments. Please note that the ZCU102 evaluation board was used to conduct the experiments due to its availability. However, as results presented in Table 5 illustrate, smaller instances of the SRMLC architecture can easily fit even the entry level FPGA devices.  Table 5 presents a power consumption dependency from the number of used multiplyaccumulate blocks within the design. The first column (DOT#) shows the number of used multiply-accumulate blocks, the second one shows the operating frequency after the synthesis and the following three columns show the usage of LUTs, BRAMs and DSPs, respectively. The last column (power) shows the estimated power consumption of the implementation. From Table 5, it can be seen that the SRMLC architecture is highly scalable. The smallest SRMLC instances (with 32 and 64 DOT modules) can be fitted to the costeffective FPGA devices from the Spartan-7, Kintex-7 and Zynq-7000 families. If more performance is required, larger SRMLC instances can be used.
For comparison purposes, Table 6 shows resource utilization after the implementation of RMLC FPGA [33]. The first column (RB#) shows the number of used reconfigurable blocks in the RMLC architecture, which have a similar function to DOT modules in the SRMLC architecture. As it can be seen from Tables 5 and 6, the SRMLC architecture, presented in this study, provides significantly better scalability when compared to the RMLC architecture. In the SRMLC, a single DOT unit requires only 1 DSP block, 1 BRAM and 300 LUTs, which ensures an optimal utilization of available FPGA resources. This is not the case for the architecture proposed in [33], where the exceeding utilization of BRAM blocks is a limiting factor for efficient scalability of the RMLC architecture, clearly seen from Table 6, where the largest instance of the RMLC architecture that can fit inside the ZU9 FPGA device only has 96 RB units. This is significantly less than the largest SRMLC architecture that fits inside ZU9 FPGA and contains 448 DOT units, as can be seen from Table 5. This improved scalability on FPGA platforms was one of the main architecture design goals during the development of the SRMLC. The ASIC implementation layout is shown in Figure 23, while the summary is provided in Table 7 for comparison purposes. Our block-level implementation was developed in the Genus [67] and Innovus [68] Cadence tools, using 40 nm TSMC process standard libraries.  A scalability comparison of two architectures is shown in Figure 24. All architecture configurations are implemented for the ZCU102 development board. We set the smallest configurations to have 32 RB / DOT modules. We then increased the number of RB / DOT modules with a step of 32. For RMLC configurations we stopped at 96, as this is the largest configuration that can fit on a ZCU102 development board. The graphs also show the limits for different FPGA devices from the Xilinx Ultrascale+ family. When the graph intersects one of these lines, it represents the largest possible number of RB / DOT modules that can be implemented on the corresponding FPGA device.
The weak point of the RMLC architecture is the consumption of BRAM resources. Therefore, we quickly approach the point on the graph (Figure 24) where the limit for the XCZU9EG device is crossed (XCZU9EG FPGA is used on the ZCU102 development board). As seen from Figure 24, all configurations for the SRMLC architecture can be implemented on all FPGA devices. The SRMLC architecture allows a significantly larger number of computing blocks to be accommodated because the resources required for implementation are balanced significantly better.

Comparison of the Achieved Throughput
From Table 8, we can see that the throughput of the proposed architecture is significantly improved, compared to RMLC architecture. The SRMLC architecture was parameterized so that the CALC module had 64 DOT modules, while the RECONF module had 8 RC-LANE modules. The RMLC architecture was parameterized to have 64 reconfigurable blocks.
For DTs, throughput is from 1.167 to 2 times higher (1.58 times on average), for SVMs it is from 0.467 to 2.3 times higher (1.65 times on average) and for MLP ANNs it is from 8.381 to 38 times higher (15.2 times on average), all compared to the corresponding RMLC implementations [33]. The results in Table 8 show, assuming that architectures contain the same number of compute blocks, so called RBs in RMLC and DOTs in SRMLC, that better throughput for DTs and SVMs in the SRMLC is achieved through the classifier sparsification.
When considering SVM models, for some datasets there was a decrease in throughput, as can be seen from Table 8. In these test cases, the bottleneck of the system is the RECONF module, which is not configured to have sufficient width. This can be seen from Equation (12) as the case where the value of N r is higher than the value of N c ; hence, Equation (19) does not hold. It can be seen that this happens rarely and only for small datasets, showing that it is not a good compromise to spend additional resources to expand this module for scenarios that seldom happen. Additionally, the results show that for MLP models the throughput is more improved. The reason is that the blocks in the RMLC architecture processed whole layers, while in the SRMLC architecture they processed individual neurons. This allows more than one neuron to be processed simultaneously when using the SRMLC architecture, which significantly contributes to increasing the architecture throughput.
The SRMLC architecture has another significant advantage, which these results do not show. When we implement both architectures (RMLC and SRMLC) on the same FPGA device, due to better SRMLC architecture scalability, it is possible to instantiate more DOTs, compared to RBs from RMLC architecture.

Comparison of the Achieved Processing Latency
As it can be seen from Table 9, the processing latency, as one of the most important performance metric parameters for most of the applications, is drastically improved in the SRMLC when compared to the previously published RMLC architecture [33]. For comparison purposes, the SRMLC architecture was parameterized so that the CALC module had 64 DOT modules, while the RECONF module had 8 RC-LANE modules. The RMLC architecture was parameterized to have 64 reconfigurable blocks. From Table 9, we can see that the latency during the processing of DTs in the SRMLC is reduced from 2.66 to 4.43 times (3.33 on average), for MLP ANNs the latency is reduced from 3.1 to 22.2 times (7.84 on average) and for SMVs the processing latency can be from 14.9 to 84.1 times shorter (40.08 on average). This significant latency reduction is obtained through the parallelization and sparsification. Most of the latency reduction was achieved from implementation in which a large number of DOTs processed a given input classification instance in parallel. This approach is in complete contrast to the RMLC architecture [33] and leads to significantly improved latency for all classifier types. Sparsification has less effect on improving latency compared to parallelization. The impact that sparsification has on improving throughput is proportional to the impact it has on reducing latency. For example, if the throughput is improved two times and the latency is reduced eight times, that means that the sparsification improved the latency two times, while the additional improvement of four times can be attributed to parallelization. Table 10 shows the reduction in memory requirements for storing classifier coefficients needed by the SRMLC, when compared to the RMLC architecture, proposed in [33]. Negative values in Table 10 indicate that the SRMLC requires fewer memory resources compared to RMLC [33], while positive values indicate that, for the corresponding datasets, the SRMLC requires additional memory resources. As it can be seen from Table 10, a memory reduction for DTs can be positive, at least for some datasets used in the experiments, which can be explained by the fact that sparsification of hyperplane coefficients leads to larger and "deeper" DTs, with a large number of nodes, where each node must store one hyperplane coefficients set. Table 10 also shows that for the majority of datasets, especially for SVMs and MLP ANNs, the required memory for storing model parameters is reduced and for several datasets it is severely reduced.

Comparison of the Energy Consumption
In the case of SVM classifiers, memory usage is generally improved, sometimes drastically, except in some cases. In these cases, the level of sparsification is small, so the extra space needed to accommodate the increments actually increases memory usage.
As a consequence of the model parameters sparsification, the energy consumption is significantly reduced as well. It is a well known fact from the available literature that data transfers from external DRAM are the most expensive in terms of the energy consumption [39]. As a result, reduced storage requirements in the SRMLC lead to significant energy saving, compared to solutions where dense data representation is used.
Due to the reduced number of DRAM accesses, as a result of sparsification, the energy consumption for DRAM data transfers is reduced from −49.88% up to 50.16% in the case of DT processing (5.16% on average), from −9.83% up to 93.65% in the case of SVM processing (48.22% on average) and from 25.07% up to 93.75% in the case of ANN processing (76.69% on average), compared to the scenario when non-sparsified models are being processed.

Comparison with Embedded Software Implementation
Tables 11-13 present the processing latency of different classifiers when they are being executed on the SRMLC hardware accelerator, compared to their software implementations being executed on the embedded processor. All three embedded software applications, providing results from Tables 11-13, were developed as GCC embedded Linux applications. The hardware platform used for testing was the ZCU102 evaluation board [66], so all embedded applications were executed on a quad-core Arm ® Cortex ® -A53 processor. The DT benchmarking application was implemented as a plain GCC application, without the usage of any specific libraries, where all underlying data structures were developed from scratch. However, SVM benchmarking GCC application was based on the LIBSVM library for Support Vector Machines [69], while the GCC embedded application for benchmarking ANNs used Tensorflow-lite framework [70] to model MLP ANNs. Column Dtst shows the dataset which was used. Tests were conducted on different numbers of input classification instances and those are shown in column inst#. SWmin, SWmax and SWavg stand for the minimum, maximum and average processing latency of classifying one instance when executed in software, respectively. HWb shows the instance processing latency when the classifier is running on the SRMLC, while using the batch mode of processing, in which multiple input instances are read from external DRAM, which significantly reduces classification time. For the chosen datasets, the size of the batch was set to be the maximum for the corresponding experiments, which means that it was equal to the value of inst#. On the other side, the HW column presents processing latencies of the running classifier on the SRMLC in normal mode, where a single input instance is fetched each time, before starting classification. All processing latencies presented in Tables 11-13 are expressed in nanoseconds. The last two columns show the gain of the running classifier on the SRMLC, compared to the corresponding embedded software implementations, where the Gb column shows the gain of batch-mode processing latency and the G column presents the normal-mode gain. Table 11 shows that the processing latency gain of the SRMLC accelerated DT classifier in normal mode ranges from 3.98 up to 14.99 (7.11 on average). Similarly, from Table 12 it can be seen that, when run on the SRMLC in normal mode, the processing latency gain of the SVM classifier ranges from 11.34 up to 25.46 times (17.05 on average). Finally, Table 13 shows that the MLP ANN classifier, run in normal mode on the SRMLC accelerator, outperforms the one running in the software from 10.99 up to 31.88 times (16.74 on average), when the processing latency gain is used as a comparison metric. As it can be seen from the columns Gb in Tables 11-13 that when classifiers are run on the SRMLC accelerator in a batch mode, the gain is even higher: from 20.61 up to 77. 35  As can be seen from the results presented in Tables 11-13, the SRMLC architecture offers high-performance results when compared with traditional software implementations of ANN, DT, and SVM classifiers. There are several reasons why this was achieved. The SRMLC architecture uses a pipelining technique at several levels to increase the instance of processing throughput. Similarly, due to the parallelization of the main operations, implemented within the array of DOT modules, an additional increase in the processing throughput, but even more importantly, a decrease in the instance processing latency, has been achieved. Finally, by the fact that the SRMLC is able to directly process sparsified ML models, additional performance improvement was attained.
Apart from the obvious performance improvement due to the significant processing latency reduction, an additional benefit of using the SRMLC hardware accelerator is related to the fact that the classification processing latancy is deterministic, always taking exactly the same amount of time, which can be crucial for real-time applications in which the response latency is critical.

Conclusions
In this study, the universal reconfigurable hardware accelerator for sparse machine learning classifiers is proposed, which supports three classifier types: decision trees, artificial neural networks and support vector machines. While hardware accelerators for these classifiers can be found in the available literature, to the author's best knowledge, the accelerator presented in this study is the first hardware accelerator that is optimized to work with sparse classifier models, which results in a higher throughput, reduced processing latency, smaller memory footprint and decreased energy consumption due to reduced data movements, compared to the hardware implementations of traditional predictive models. In order to benchmark the proposed hardware accelerating platform, we have also presented the algorithms for sparse decision trees induction, sparsification of support vector machines and artificial neural networks pruning. Experiments carried out on standard benchmark datasets from the UCI Machine Learning Repository database show that the proposed sparsification algorithms allow a significant predictive models compression, with a negligible prediction accuracy drop: on average 61.7% for decision trees, 39.12% for support vector machines and 81.3% for artificial neural networks. Experimental results also show that using such compressed classifier models increases throughput up to 38 times, decreases processing latency up to 84 times and reduces energy consumption for DRAM data transfers up to 76.69%. Our hardware accelerator, as it is shown in the experimental results section, significantly outperforms machine learning classifiers implemented in embedded software. DT classification, accelerated by the SRMLC, is up to 77 times faster when compared to the corresponding DT classification implemented in the software. Similarly, when run on our accelerator, the SVM classification relative speedup is up to 85 times faster and the MLP ANN classification relative speedup is up to 62 times faster, when compared to corresponding embedded software implementations.