Abstract
Embedded systems technology is undergoing a phase of transformation owing to the novel advancements in computer architecture and the breakthroughs in machine learning applications. The areas of applications of embedded machine learning (EML) include accurate computer vision schemes, reliable speech recognition, innovative healthcare, robotics, and more. However, there exists a critical drawback in the efficient implementation of ML algorithms targeting embedded applications. Machine learning algorithms are generally computationally and memory intensive, making them unsuitable for resource-constrained environments such as embedded and mobile devices. In order to efficiently implement these compute and memory-intensive algorithms within the embedded and mobile computing space, innovative optimization techniques are required at the algorithm and hardware levels. To this end, this survey aims at exploring current research trends within this circumference. First, we present a brief overview of compute intensive machine learning algorithms such as hidden Markov models (HMM), k-nearest neighbors (k-NNs), support vector machines (SVMs), Gaussian mixture models (GMMs), and deep neural networks (DNNs). Furthermore, we consider different optimization techniques currently adopted to squeeze these computational and memory-intensive algorithms within resource-limited embedded and mobile environments. Additionally, we discuss the implementation of these algorithms in microcontroller units, mobile devices, and hardware accelerators. Conclusively, we give a comprehensive overview of key application areas of EML technology, point out key research directions and highlight key take-away lessons for future research exploration in the embedded machine learning domain.
1. Introduction
Embedded computing systems are fast proliferating every aspect of human endeavor today, finding useful application in areas such as wearable systems for health monitoring, wireless systems for military surveillance, networked systems as found in the internet of things (IoT), smart appliances for home automation, antilock braking systems in automobiles, amongst others [1]. Recent research trends in computing technology have seen a merger of machine learning methods and embedded computing for diverse applications. For example, to target the hostility and dynamism of mobile ad hoc networks (MANETs), Haigh et al. [2] explored enhancing the self-configuration of a MANET using machine learning techniques. Besides, the recent breakthroughs of deep learning models in application areas such as computer vision [3,4,5,6,7], speech recognition [8,9], language translation, and processing [10,11], robotics, and healthcare [12] make this overlap a key research direction for the development of next-generation embedded devices. Thus, this has opened a research thrust between embedded devices and machine learning models termed “Embedded Machine Learning” where machine learning models are executed within resource-constrained environments [13]. This research surveys key issues within this convergence of embedded systems and machine learning.
Machine learning methods such as SVMs for feature classification [14], CNNs for intrusion detection [15], and other deep learning techniques, require high computational and memory resources for effective training and inferencing [16,17,18,19]. General-purpose CPUs, even with their architectural modification over the years, including pipelining, deep cache memory hierarchies, multicore enhancements, etc., cannot meet the high computational demand of deep learning models. However, graphic processing units (GPUs), due to their high floating-point performance and thread-level parallelism, are more suitable for training deep learning models [13]. Extensive research is actively being carried out to develop suitable hardware acceleration units using FPGAs [20,21,22,23,24,25,26], GPUs, ASICs, and TPUs to create heterogeneous and sometimes distributed systems to meet up the high computational demand of deep learning models. At both the algorithm and hardware levels, optimization techniques for classical machine learning and deep learning algorithms are being investigated such as pruning, quantization, reduced precision, hardware acceleration, etc. to enable the efficient execution of machine learning models in mobile devices and other embedded systems [27,28,29].
The convergence of machine learning methods and embedded systems in which computationally intensive machine learning models target the resource-constrained embedded environment has opened a plethora of opportunities for research in computing technology. Although EML is just in its cradle, quite some work has been done to: (1) optimize different machine learning models to fit into resource-limited environments, (2) develop efficient hardware architectures (acceleration units) using custom chipsets to accelerate the implementation of these algorithms, and (3) create novel and innovative specialized hardware architectures to meet the high-performance requirements of these models. Thus, there is a need to bring these perspectives together to provide the interested researcher with the fundamental concepts of EML and further provide the computer architect with insights and possibilities within this space.
Interestingly, several surveys have been carried out to achieve this. For example references [30,31] survey deep learning concepts, models, and optimizations. In these surveys, little consideration is given to the hardware architectural design, which is a key concern in developing efficient machine learning systems. Pooja [32] surveys recent trends in the hardware architectural design for machine learning applications, using the tensor processing unit as a case study. However, the research did not explore the different DNN architectures and just skimmed through some deep learning optimization techniques. Jiasi and Xukan [33], in their review, explored deep learning concepts, narrowing down on inference at end devices but do not compare different embedded chipset architectures to inform which architecture or optimization is appropriate for the different DNN models. They also present applications of deep learning in end devices. They, however, only explored a type of deep learning model (DNNs) and did not discuss other deep learning models (CNN, RNN), which have gained attention in recent times. Sergio et al. [24] have carried out a comprehensive survey on ML in embedded and mobile devices, presenting ML concepts and techniques for optimization and also investigated different application areas. They, however, also do not explore other models of DNNs or make appropriate trade-offs. To address these drawbacks, this survey presents key compute and memory intensive machine learning algorithms, which are the HMM, k-NN, SVM, GMM, and the different shades of DNNs (CNN and RNN), and present hardware-based and algorithm-based optimization techniques required to compress these algorithms within resource-constrained environments. To sum up, the authors decided to consider diverse application areas where machine learning has been utilized in proffering solutions to stringent problems in this big data era. A comprehensive layout of this survey is presented in Figure 1.
Figure 1.
The layout of Embedded Machine Learning Computing Architectures and Machine Learning and Optimization Techniques.
The key contributions of this survey are as follows:
- We present a survey of machine learning models commonly used in embedded systems applications.
- We describe an overview of compute-intensive machine learning models such as HMMs, k-NNs, SVMs, GMMs, and DNNs.
- We provide an overview of different optimization schemes adopted for these algorithms.
- We present an overview of the implementation of these algorithms within resource-limited environments such as MCUs, mobile devices, hardware accelerators, and TinyML.
- We survey the challenges faced in embedded machine learning and review different optimization techniques to enhance the execution of deep learning models within resource-constrained environments.
- We present diverse application areas of embedded machine learning, identify open issues and highlight key lessons learned for future research exploration.
The remainder of this paper is organized as follows. Section 2 presents embedded machine learning algorithms and specific optimization techniques, while Section 3 describes machine learning in resource-constrained environments (MCUs, mobile devices, acceleration units, and TinyML). Section 4 presents challenges and possible optimization opportunities in embedded machine learning. Section 5 provides diverse areas of applications of embedded machine learning technology, while Section 6 presents plausible research directions, open issues, and lessons learned. In Section 7, a concise conclusion is presented.
2. Embedded Machine Learning Techniques
Machine learning is a branch of artificial intelligence that describes techniques through which systems learn and make intelligent decisions from available data. Machine learning techniques can be classified under three major groups, which are supervised learning, unsupervised learning, and reinforcement learning as described in Table 1. In supervised learning, labeled data can be learned while in unsupervised learning, hidden patterns can be discovered from unlabeled data, and in reinforcement learning, a system may learn from its immediate environment through the trial and error method [34,35,36]. The process of learning is referred to as the training phase of the model and is often carried out using computer architectures with high computational resources such as multiple GPUs. After learning, the trained model is then used to make intelligent decisions on new data. This process is referred to as the inference phase of the implementation. The inference is often intended to be carried out within user devices with low computational resources such as IoT and mobile devices.
Table 1.
Machine learning techniques.
2.1. Scope of ML Techniques Overview
In recent times, machine learning techniques have been finding useful applications in various research areas and particularly in embedded computing systems. In this research, we surveyed recent works of literature concerning machine learning techniques implemented within resource-scarce environments such as mobile devices and other IoT devices between 2014 and 2020. We present the results of this survey in a tabular form given in Table 2. Our survey revealed that of all available machine learning techniques, SVMs, GMMs, DNNs, k-NNs, HMMs, decision trees, logistic regression, k-means, and naïve Bayes are common techniques adopted for embedded and mobile applications. Naïve Bayes and decision trees have low complexity in terms of computation and memory costs and thus do not require innovative optimizations as pointed out by Sayali and Channe [37]. Logistic regression algorithms are computationally cheaper than naïve Bayes and decision trees, meaning they have even lower complexity [38]. HMMs, k-NNs, SVMs, GMMs, and DNNs are however computationally and memory intensive and hence, require novel optimization techniques to be carried out to be efficiently squeezed within resource-limited environments. We have thus limited our focus to these compute intensive ML models and discuss state-of-the-art optimization techniques through which these algorithms may be efficiently implemented within resource-constrained environments.
Table 2.
Machine Learning Techniques in Resource-Constrained Environments.
2.2. Hidden Markov Models
Hidden Markov Model is an unsupervised machine learning technique based on augmenting the Markov chain [85]. The Markov chain is a technique that describes the probability of a sequence of events from a set of random variables. HMMs have been successfully adopted for speech recognition, activity recognition, and gesture tracking applications [86]. In [56], Patil and Thorat adopt HMM within an embedded device for detecting diseases in grapes. Charissa and Song-bae in [61] implemented HMM using a smartphone for recognizing human activities. HMMs are however compute and memory intensive and thus require some optimization techniques for effective execution in resource-limited environments [87].
2.2.1. The HMM Algorithm
The HMM is an algorithm that extracts meaningful information from available data through observing a sequence of “hidden states” or “hidden classes” in the data and can subsequently make accurate predictions of future states based on the current state [85]. Five important components that make up the hidden Markov model are the number of states, number of distinct observations, the state transition model, observation model, and initial state distribution [85]. To determine the probability of observations, a forward algorithm is adopted, while to predict the sequence of hidden states in the available data, the Viterbi algorithm is used. The learning phase of the HMM is carried out using the Baum-Welch Algorithm or the forward-backward algorithm [86]. An overview of these problems and algorithms is given by Equations (1)–(3), as defined in Table 3.
Table 3.
Problems and Algorithms of HMM.
2.2.2. Some HMM Optimization Schemes
Although HMMs are suitable for different applications, they require a large amount of computational and memory resources for efficient implementation. Embedded and mobile devices are however resource-scarce environments and thus require novel optimization schemes to be carried out for the efficient execution of HMMs. In [87], Toth and Nemeth presented an optimized HMM to target smartphone environments for speech synthesis. They optimize the HMM by selecting optimal parameters and implemented the model using fixed-point arithmetic instead of computation-intensive floating-point arithmetic. Fu et al. [88] proposed a series of optimization techniques including parameter reduction using decision tree-based clustering, model compression, feature size reduction, and fixed-point arithmetic implementation. Their optimized HMM target resource-scarce embedded platforms for speech synthesis. A list of optimizations is presented in Table 4.
Table 4.
Optimization Schemes for HMM.
2.3. k-Nearest Neighbours
k-NN is a non-parametric and lazy supervised machine technique often adopted for classification problems e.g., text categorization [89]. k-NN algorithms have been adopted in several embedded applications. For example, Hristo et al. [41] develop a mobile device fingerprinting system based on the k-NN algorithm. Additionally, Sudip et al. [76] develop a smartphone-based health monitoring system using the k-NN model. A smartphone-based data mining system is presented in [60], for fall detection using a k-NN algorithm. k-NN algorithms are also memory and compute intensive and require appropriate optimizations for resource-scarce environments.
2.3.1. The k-NN Algorithm
The k-NN algorithm unlike other ML approaches is a lazy learner because it does not use specialized training data to generalize, rather it uses all available data to classify. It is also non-parametric because it does not make assumptions from available data [40]. The Algorithm is such that computes the distance between the input data and other data points and using a predefined “k value”, classification is done by estimating proximity. The important Equations (4) and (5) that describe the k-NN model are defined in Table 5.
Table 5.
Important equations involved in k-NNs.
2.3.2. Some k-NN Optimization Schemes
k-NN models require the entire input data to be stored in memory for prediction to be done hence they are memory and compute intensive [70]. To improve the efficiency of the k-NN model, Li et al. [89] proposed an improved k-NN that uses different k values for different categories instead of a predefined fixed k value. Norouzi et al. [90] investigated the optimization of k-NN algorithms by mapping input features to binary codes which are very memory efficient. There have also been some hardware-oriented optimization schemes to accelerate k-NN models. Saikia et al. [91]. Mohsin and Perera [55] developed an acceleration unit to increase the execution speed of k-NN models in resource-scarce environments. Also, Gupta et al. [70] proposed a modified k-NN model termed ProtoNN which is a highly compressed k-NN model suitable for resource-limited IoT devices. Table 6 describes some k-NN optimization schemes.
Table 6.
Some kNN optimization Schemes.
2.4. Support Vector Machines
Support vector machine is a supervised machine learning technique based on Vapnik’s statistical learning theory often adopted for object classification problems. SVMs have been successfully applied to regression, ranking, and clustering problems [92]. Also, SVMs have been useful in the prediction of the power and performance, auto-tuning, and runtime scheduling of high-performance applications [93]. In [2] for example, SVM is adopted in maintaining a near-optimal configuration of a MANET. Also, SVMs are used in the design and development of a low-cost and energy-efficient intelligent sensor [94]. SVMs are, however, computationally and memory intensive and thus require hardware acceleration units to be effectively executed in resource-limited situations. In [95], the FPGA hardware implementation of an SVM is surveyed with optimization techniques.
2.4.1. The SVM Algorithm
An SVM is a linear or non-linear classifier that can identify two distinct objects by separating them into two unique classes with high accuracy. The SVM is then trained, during which a hyperplane is developed to separate the data belonging to each unique class. These hyperplane samples are referred to as “support vectors,” which are then used to classify new data. The problem equation for training an SVM is given in Equation (6):
where are Lagrange Multipliers, are the kernel functions, x and y are positions, W is the quadratic function.
The algorithm flow is given in the pseudo-code, as shown in Algorithm 1.
| Algorithm 1. Pseudocode for training a support vector machine |
| Require:X and y loaded with training labeled data, or α partially trained SVM 1: C some value (10 for example) 2: repeat 3: for all do 4: Optimize 5: end for 6: until no changes in or other resource constraint criteria met Ensure: Retain only the support vectors () |
The implementation of training, testing, and predicting phases of an SVM involves kernel functions that are the dot products and the Gaussian kernel functions. These computations make SVM computationally intensive. Additionally, having to train much data to inform accurate prediction makes SVM models memory intensive. Thus, efficient optimizations are required both at the hardware architecture and algorithm levels. To efficiently compute kernel functions, hardware acceleration units have been developed using FPGAs so that these computationally intensive operations can be moved to the hardware [92]. Also, at the algorithm level, a sequential minimal optimization method may be used to reduce the memory usage [96].
2.4.2. Some SVM Optimizations Schemes
SVM techniques are computing and memory intensive and thus require appropriate optimization methods to be successfully executed in resource-constrained environments. Some works of literature have investigated the optimization of SVM models. The training of the SVM involves solving a quadratic programming (QP) problem, and thus to solve this problem optimally, optimization techniques involving chunking, decomposition or sequential minimal optimizations, etc. may be carried out to reduce the memory footprint required for training the model. For inference, bit precision techniques, Logarithm number representations, quantization, etc., are some optimization techniques that may be applied to fit SVM models within resource-constrained environments. In [94], Boni et al. develop a model selection technique targeted at reducing SVMs for resource-constrained environments. Table 7 presents a comprehensive list of SVM optimization schemes for both the training and classification phases. Interestingly, the Kernel selection also informs the computational requirement of SVM models. Some kernel types are the Laplacian kernel, the Gaussian kernel, the sigmoid kernel, the linear kernel, etc. Of these kernel types, the most suitable for resource-constrained environments is the Laplacian kernel because it can be implemented using shifters [97].
Table 7.
SVM Optimization Schemes.
2.5. Gaussian Mixture Model
GMMs are density models capable of representing a large class of sample distributions. They are used in finding the traffic density patterns in a large set of data. This characteristic makes them suitable for analyzing large sensor data in IoT devices and biometric systems, particularly for speaker recognition systems [105]. In [50], GMM is adopted within an embedded board for analyzing the volume of sensor data at run time to monitor certain conditions of a system. Although GMMs are efficient, deep learning models pose a more effective method of analyzing raw sensor data.
2.5.1. The GMM Algorithm
A GMM is a weighted sum of M component Gaussian densities as shown in Equations (7a) and (7b). The parameters of a GMM are retrieved during the training phase using an expectation-maximization (EM) algorithm or maximum a posteriori (MAP) estimation technique. The accuracy of a GMM hugely depends on the amount of computational power and memory bandwidth required to implement the model:
where x is a D-dimensional continuous-valued data vector (features), i = 1,…, M, are the mixture weights, is the probability density function, and are the component Gaussian densities, is the mean vector, is the covariance matrix.
2.5.2. Some GMM Optimization Schemes
GMMs are used in representing and modeling large volumes of data as in sensor data systems and in background modeling tasks [105]. This characteristic makes GMMs highly computationally and memory intensive, and unsuitable for real-time oriented applications. Some literature has explored optimization techniques for reducing the computational requirement and memory footprint of GMMs. Pushkar and Bharadwaj [106], propose an enhanced GMM algorithm by minimizing the floating-point computations, modifying the switching schedule, and automatically selecting the number of modes to target resource-constrained embedded environments. Additionally, Shen et al. in [107] propose an optimized GMM model based on compressive sensing to reduce the dimensionality of the data while still retaining the relevant information. This technique is computationally and memory efficient. In another publication, Salvadori et al. [39] proposed a GMM optimization technique based on integers to target processors with no floating point unit (FPU). This work showed low computation and a highly reduced memory footprint. A list of these optimization schemes is presented in Table 8.
Table 8.
Some GMM Optimization Schemes.
2.6. Deep Learning Models
Deep learning models are machine-learning techniques that model the human brain [30]. They use a hierarchical array of layers to learn from available data and make new predictions based on the information they extract from the set of raw data [13,30,31]. The primary layer types that make up a deep learning model are pooling layers, convolutional layers, classifier layers, and local response normalization layers. The high accuracy of deep learning algorithms in various areas of applications has made them very attractive for use in recent times. However, hardware architectural computational power is pushing hard to meet up the computational demand of these models to inform their optimal implementation. In this survey, we explore the three main classes of deep learning models, which are DNNs, CNNs, and RNNs. Table 9 describes popular DNN models with their parameters.
Table 9.
Description of some DNN models and their parameters.
Furthermore, like other machine learning models, deep learning models go through three phases; train, test, and predict. The training phase of deep learning models is carried out using a feedforward technique that entails sequentially passing data through the entire network for a prediction to be made and back-propagating the error through the network. The technique for backpropagation is called stochastic gradient descent (SGD), which adjusts the weights or synapses of each layer in the network using a non-linear activation function (tanh, sigmoid, rectified linear unit (ReLU)) [108,109,110]. Lin and Juan [111] carry out research where they explore the possibility of developing an efficient hardware architecture to accelerate the activation function of the network. The training process is often carried out many times for the model to efficiently learn, and then using the trained model, the prediction is made on new data. The training is very computationally and memory intensive and is often carried out offline using very high-performance computing resources mostly found in large data centers, while the inference targets low cost and resource-constrained environments.
2.6.1. Convolution Layers
The convolution layers are the input layers of most DNNs, and they are used to extract characteristic features from a given input data using a set of filters. These filters are vector products (kernels), and their coefficients form the synaptic layer weights [112]. Convolution layers thus perform the major number of multiply and accumulate operations in the entire network, which means they are computationally intensive and are the major drawback to the real-time performance of deep learning models. Hardware acceleration units can be developed to accelerate these layers to reduce the latency of implementing deep learning models. The output neuron of a convolution is described in Equation (8):
where represent the output neuron and represents the input neuron, in x and y directions, represent the synaptic weight and represent the kernel position, are the input features. In addition, fi and f0 represent the input and output feature maps, respectively.
2.6.2. Pooling Layers
The pooling layers are used in subsampling the feature maps obtained for convolutions and computing the maximum/average of neighboring neurons in the same feature map. In summary, this layer helps reduce the input layer dimensionality, thereby reducing the total number of inputs into subsequent layers and we traverse down the neural network [112]. The pooling layer has no synaptic weight attached. Some research prunes away this layer from the entire network to reduce computation time. The equation to evaluate a pooling layer is given in Equation (9):
where is the output neurons at positions x and y, K is the number of feature maps in x and y directions, are kernel positions in x and y directions respectively and represents the output feature maps.
2.6.3. Normalization Layers
These layers inform competition between neurons at the same location but in different feature maps. They perform a process very similar to lateral inhibition in biological neurons. Their values may be evaluated from Equation (10). Some research works skip this layer in DNN implementation to accelerate the entire network.
where are output neurons, are input neurons, are the input features, f and g are input and output feature maps respectively, k is the number of adjacent feature maps, and c, α, and β are constants.
2.6.4. Fully-Connected Layers
These are layers where all the output features are fully connected to all the input features using synaptic weights. Each output neuron is a weighted sum of all the input neurons. Equation (11) describes the value of each output neuron. These layers are often used for classification and output. Interestingly, because this layer is fed with the processed output features of the previous layers, i.e., the Convolution, Pooling, and Normalization layers, the input features are always lower than those for other layers i.e., they perform a reduced amount of matrix multiplications. However, the full connection using synaptic weights makes them very memory hungry as they make up for the largest share of all synaptic weights:
where “t” is the non-linear activation function, are the input features, are the synaptic weights, i and j are the input and output feature maps respectively.
2.6.5. Fully-Connected Deep Neural Networks
DNNs are a class of neural networks that are used for speech recognition applications, extracting high-level human behaviors, etc. The DNN architecture is such that all the layers in the network are fully connected, and the output layer with the highest activation value gives the required prediction. This characteristic feature makes them suitable for learning from unstructured data. Owing to the full connectivity of all the layers in the network, DNNs are computationally intensive than CNNs but highly memory intensive. More explicitly, because they have to perform routine multiply-accumulate computations, their computational logic is not complex but they require large enough memory to store the synapses and activation functions. Therefore, optimizations for FC DNNs concern more memory-centric thrusts like model compression, sparsity, pruning, and quantization. In [113], DNN models are developed to be implemented in mobile devices by distributing the computation across different processors within the mobile device. A fully connected DNN is also referred to as a multilayer perceptron (MLP).
2.6.6. Convolutional Neural Networks
CNNs are a class of neural networks that are suitable for computer vision applications, object recognition applications, etc. [114,115,116]. In ConvNets, key features in an image are extracted then converted into complex representations using the pooling layer, and subsequently, the fully-connected layers classify the image and identify the image appropriately. The CNN architecture is primarily made up of a large portion of convolution layers, followed by few fully connected layers. Some CNNs sometimes add pooling and normalization layers in-between the convolution and fully connected layers. The presence of convolution layers that perform kernel functions (vector-matrix multiplications) makes CNNs very computation-intensive and less memory-hungry due to the few fully connected layers. Therefore, optimizations for CNN models concern more computer-centric directions such as innovative hardware acceleration developments, processor technology, tiling, and data reuse, reduced precision, quantization, etc. Popular CNNs are ResNets, LeNets, AlexNets, GoogLeNets, VGGNets, etc. The computation and memory requirements of some CNN models are given in Figure 2 and summarized in Table 10.

Figure 2.
Description of a spectrum of certain CNN models to reveal their compute and memory demand: (a) Describes the Memory Demand of these models in terms of the number of weight parameters in (millions) (b) Computational Demand of these models in terms of their number of operations (GOPs).
Table 10.
Parameters for popular CNN models.
2.6.7. Recurrent Neural Networks
RNNs are a class of Neural Networks used for natural language translation applications, speech recognition, etc. RNNs, unlike CNNs, process input data sequentially and store the previous element. They thus retain and use past information which makes them suitable for text prediction applications and also for suggesting words in a sentence. The architecture of an RNN model is primarily made up of fully connected layers and normalization layers. This organization makes RNNs memory-centric in operation since they have to store weight synapses in available memory. A type of RNN referred to as Long Short Term Memory (LSTM) is gaining interest in recent times and process to be more effective than conventional RNNs [117,118,119,120]. Khan et al. [121] developed a data analytic framework with the real-world application using Spark Machine Learning and LSTM techniques. There are great hardware acceleration opportunities in implementing RNNs [122,123,124], and [23] explores the FPGA acceleration of an RNN. Table 11 describes some of the DNN models and their architecture.
Table 11.
Some DNN Optimization Schemes.
3. Machine Learning in Resource-Constrained Environments
Machine learning techniques are currently targeting resource-scarce environments such as mobile devices, embedded devices, and other internet of things devices. In this section, we present an overview of different resource-limited environments like microcontroller units (MCUs) and mobile devices. We also discuss the option of hardware acceleration units used to speed up the execution of these algorithms in resource-scarce environments.
3.1. Machine Learning Using Microcontrollers
Microcontrollers are at the front end of provisional hardware to implement diverse embedded systems and other IoT applications [126]. An MCU consists of a microprocessor, memory, I/O ports, and other peripherals all integrated into one chip. At the processor core of MCUs are general-purpose CPUs for adequate computation. Table 12 describes a list of popular MCUs using their compute and memory resources. From this list, we can observe the resource limitation of most MCUs in terms of available power and on-chip memory (flash + SRAM). This resource limitation is a critical drawback in the implementation of machine learning models. For example, some typical CNN model sizes are AlexNet (240 MB) [27], VGG 16 (528 MB) [127], VGG 19 (549 MB) [127], etc. Model size describes the number of bytes required to store all the parameters of the model. In this study, we survey techniques for compressing these models to fit into the available memory of MCUs for efficient computation to be appropriately done.
Table 12.
Microcontroller units comparison.
Table 12 presents a list of different microcontrollers categorized using their clock frequency, available flash memory, SRAM, and their current consumption. As can be observed in this table, microcontrollers have very limited hardware resources. This scarcity of resources makes them unsuitable for high-end machine learning applications, except the machine learning models are heavily optimized to fit within this space [24].
3.2. Machine Learning Using Hardware Accelerators
General-purpose CPU chipsets, although ubiquitous, do not possess enough computational ability to process compute-intensive ML algorithms. To address this drawback, hardware acceleration units may be developed using GPUs, FPGAs, and even ASICs. However, the most popular accelerators are FPGA-based accelerators owing to the programmability of FPGAs. Hardware acceleration is developed such that compute-intensive segments of the ML algorithm, e.g., kernel operations, are offloaded to the specialized accelerator, thereby relieving the CPU to process much simpler operations, improving the overall computation speed of the system. Some machine learning accelerators are Arm Ethos NPUs, Intel’s Movidus NCS, Nvidia’s Jetson nano, etc. Table 13 presents some of these accelerators. A critical drawback in adopting these accelerators is cost. Also, some works of literature have explored the development of FPGA-based accelerators for machine learning algorithms. Wang et al. [22] proposed a scalable deep learning accelerator to accelerate the kernel computations of deep learning. The authors introduce some optimization schemes such as tiling techniques, pipelining, FIFO buffers, and data reuse to further improved their proposed architecture.
Table 13.
Machine learning accelerators.
3.3. Machine Learning in Mobile Devices
Machine learning techniques are gradually permeating mobile devices for applications such as speech recognition, computer vision, etc. Mobile devices may be categorized under resource-constrained systems owing to their limited computational and memory resources. Hence, for machine learning algorithms to be successfully implemented within these devices, appropriate optimizations must be carried out. Lane et al. [137] developed a software accelerator for accelerating the execution of deep learning models within mobile devices. We present a survey of some mobile machine learning applications in the literature as tabulated in Table 14.
Table 14.
Some literature on mobile machine learning.
3.4. TinyML
Machine learning inference at the edge particularly within very low power MCUs is gaining increased interest amongst the ML community. This interest pivots on creating a suitable platform where ML models may be efficiently executed within IoT devices. This has thus opened a growing research area in embedded machine learning termed TinyML. TinyML is a machine learning technique that integrates compressed and optimized machine learning to suit very low-power MCUs [141]. TinyML primarily differs from cloud machine learning (where compute intensive models are implemented using high-end computers in large datacenters like Facebook [142]), Mobile machine learning in terms of their very low power consumption (averagely 0.1 W) as shown in Table 15. TinyML creates a platform whereby machine learning models are pushed to user devices to inform good user experience for diverse applications and it has advantages such as energy efficiency, reduced costs, data security, low latency, etc., which are major concerns in contemporary cloud computing technology [141]. Colby et al. [143] presented a survey where neural network architectures (MicroNets) target commodity microcontroller units. The authors efficiently ported MicroNets to MCUs using the TensorFlow Lite Micro platform. There are different platforms developed to easily port ML algorithms to resource-constrained environments. Table 16 presents a list of available TinyML frameworks commonly adopted to push ML models into different compatible resource-limited devices.
Table 15.
Comparison of CloudML and TinyML Platforms.
Table 16.
TinyML Framework.
4. Challenges and Optimization Opportunities in Embedded Machine Learning
Embedded computing systems are generally limited in terms of available computational power and memory requirements. Furthermore, they are required to consume very low power and to meet real-time constraints. Thus, for these computationally intensive machine learning models to be executed efficiently in the embedded systems space, appropriate optimizations are required both at the hardware architecture and algorithm levels [148,149]. In this section, we survey optimization methods to tackle bottlenecks in terms of power consumption, memory footprint, latency concerns, and throughput and accuracy loss.
4.1. Power Consumption
The total energy consumed by an embedded computing application is the sum of the energy required to fetch data from the available memory storage and the energy required to perform the necessary computation in the processor. Table 17 shows the energy required to perform different operations in an ASIC. It can be observed from Table 17 that the amount of energy required to fetch data from the SRAM is much less, than when fetching data from the off-chip DRAM and very minimal if the computation is done at the register files. From this insight, we can conclude that computation should be done as close to the processor as possible to save energy. However, this is a bottleneck because the standard size of available on-chip memory in embedded architectures is very low compared to the size of deep learning models [124]. Algorithmic-based optimization techniques for model compression such as parameter pruning, sparsity, and quantization may be applied to address this challenge [150]. Also, hardware design-based optimizations such as Tiling and data reuse may be utilized [25]. The next section expatiates some of these optimization methods in further detail. Furthermore, most machine-learning models, especially deep learning models, require huge amounts of multiply and accumulate (MAC) operations for effective training and inference. Figure 3 describes the power consumed by the MAC unit as a function of the bit precision adopted by the system. We may observe that the higher the number of bits, the higher the power consumed. Thus, to reduce the power consumed during computation, reduced bit precision arithmetic and data quantization may be utilized [151].
Table 17.
Energy Consumption in (pJ) of performing operations.
Figure 3.
This graph describes the energy consumption and prediction accuracy of a DNN as a function of the Arithmetic Precision adopted for a single MAC unit in a 45 nm CMOS [124]. It may be deduced from the graph that lower number precisions consume less power than high precisions with no loss in prediction accuracy. However, we can observe that when precision is reduced below a particular threshold (16 bit fp), the accuracy of the model is greatly affected. Thus, quantization may be performed successfully to conserve energy but quantizing below 16-bit fp may require retraining and fine-tuning to restore the accuracy of the model.
4.2. Memory Footprint
The available on-chip and off-chip memory in embedded systems are very limited compared to the size of ML parameters (synapses and activations) [27]. Thus, there is a bottleneck for storing model parameters and activations within this constrained memory. Network pruning (removing redundant parameters) [150] and data quantization [151] (reducing the number of bits used to represent model parameters) are the primary optimization techniques adopted to significantly compress the overall model size such that they can fit into the standard memory sizes of embedded computers.
4.3. Latency and Throughput Concerns
Embedded systems are required to meet real-time deadlines. Thus, latency and overall throughput can be a major concern as an inability to meet these tight constraints could sometimes result in devastating consequences. The parameters of deep learning models are very large and are often stored off-chip or in external SDCARDs, which introduces latency concerns. Latency results from the time required to fetch model parameters from off-chip DRAM or external SDCARDs before appropriate computation can be performed on these parameters [150]. Thus, storing the parameters as close as possible to the computation unit using Tiling and data reuse, hardware-oriented direct memory access (DMA) optimization techniques would reduce the latency and thus, inform high computation speed [152]. In addition, because ML models require a high level of parallelism for efficient performance, throughput is a major issue. Memory throughput can be optimized by introducing pipelining [20].
4.4. Prediction Accuracy
Although deep learning models are tolerant of low bit precision [153], reducing the bit precision below a certain threshold could significantly affect the prediction accuracy of these models and introduce no little errors, which could be costly for the embedded application. To address the errors which model compression techniques such as reduced precision or quantization introduce, the compressed model can be retrained or fine-tuned to improve precision accuracy [124,150,154,155].
4.5. Some Hardware-Oriented and Algorithm-Based Optimization Techniques
Hardware acceleration units may be designed using custom FPGAs or ASICs to inform low latency and high throughput. These designs are such that they may optimize the data access from external memory and/or introduce an efficient pipeline structure using buffers to increase the throughput of the architecture. In sum, some hardware-based optimization techniques are presented in this section to guide computer architects in designing and developing highly efficient acceleration units to inform high performance
4.5.1. Tiling and Data Reuse
Tiling is a technique that involves decomposing a large volume of data into small tiles that can be cached on-chip [25,156]. This technique targets the bottleneck of memory footprint in resource-constrained environments. This technique also introduces scalability to the entire architecture as different sizes of volume data can be easily broken down into bits that may be stored on-chip. Much more, this technique reduces the latency of the system as tiled inputs may easily be reused for computation without having to re-fetch parameters from off-chip. Furthermore, since tiled data can be stored on-chip, energy consumption is reduced. Hardware accelerators may be designed and developed to integrate a tile unit to carry out the tiling process [22]. The pseudocode of the tiling process is given in Algorithm 2.
| Algorithm 2 Pseudocode of the Tiling Process |
| Require: Ni: the number of the input neurons No: the number of the output neurons Tile_Size: the tile size of the input data batchsize: the batch size of the input data forn = 0; n < batchsize; n ++ do for k = 0; k < Ni; k+ = Tile_Size do for j = 0; j < No; j ++ do y[n][j] = 0; for i = k; i < k + Tile_Size&&i < Ni; i ++ do y[n][j] + = w[i][j] * x[n][i] if i == Ni − 1 then y[n][j] = f(y[n][j]); end if end for end for end for end for |
4.5.2. Direct Memory Access and On-Chip Buffers
More recent FPGA architectures owing to the limitation in computation and memory of custom FPGAs, provide general-purpose processors and external memory to offload computation from the FPGA processing logic to the CPU [157]. This architectural organization if not properly utilized, can result in latency concerns. DMAs are units that transfer data between the external memory and the on-chip buffers in the processing logic of the FPGA [152]. Thus, optimizing this process would lead to an efficient performance in execution speed.
4.5.3. Layer Acceleration
A deep learning network is made up of different kinds of layers (pooling, normalization, fully connected, convolution, etc.). A technique for speeding up the rate of execution of a network and saving memory storage is to design and develop specialized architectures to accelerate particular layers in the network layer [26]. In [22], an accelerator is designed using FPGA technology to accelerate certain parts of a deep neural network. Also, in [51], an accelerator is designed to accelerate convolution layers in a CNN. Network layer acceleration is a hardware-oriented optimization scheme and could pose challenges such as hardware design and high time-to-market since specialized architectures are often considered [148].
4.5.4. Network Pruning
Network pruning is concerned with the removal of certain parts of the network, ranging from weights to layers that do not contribute to the overall efficiency of the network [158]. It entails rounding off certain unused weights and activations to zero to reduce the total memory and computation resource required for efficient computation. These weights and activations are such that they would not alter the accuracy of the execution of the model if avoided. Pruning can either be structured or unstructured [158]. In [124], the pruning technique is employed to compress a neural network within the resource-constrained environments of embedded and mobile devices. The pruning entailed removing i.e., set to zero, weights that are lower than a certain threshold value, and the pruned network is then retrained to improve accuracy.
4.5.5. Reduced Precision
Machine learning algorithms, particularly deep learning models, were originally implemented using a high floating-point number representation format [159,160]. The floating-point number system is made up of a sign, an exponent, and a mantissa [160]. A single floating-point value can be computed using the formula presented in Equation (12). The floating-point number system is, however, power-hungry and thus unsuitable for embedded machine learning applications owing to the resource constraints [161,162]. More so, floating-point arithmetic currently faces drawbacks like manipulating overflow, underflow, and exceptions [163]. This has thus made the fixed-point number system a better alternative owing to the reduced complexity and power consumption of its implementation with the combined range given in Equation (13), [164,165]. Hwang and Sung [166] Investigate the ternary quantization of a Feedforward deep neural network using fixed-point arithmetic. Additionally, [167] considers training a deep network using 16-bit fixed-point arithmetic with stochastic rounding to minimize accuracy loss. Although fixed point arithmetic is power efficient, it is not suitable for representing deep learning parameters that are non-linear. Gustafson and Yonemoto [168] present a new number system called Posit given in Equation (14). Langroudi et al. [163] adopt the posit number system in training a deep convolutional neural network.
where value is the floating-point value, sign is the sign bit, mantissa is the mantissa bit.
where represents the input integer, QI = # of integer bits and QF = # of fractional bits and is the resolution of the fixed-point number.
where X represents the Posit value, represents the number regime and represents the scale factor.
4.5.6. Quantization
SVM and DNN model parameters are often represented using 32-bit floating-point values [92,158], which are highly computationally and memory intensive. However, research shows that these models can be implemented efficiently using low precision parameters (8-bit or 16-bit) with minimal accuracy loss [113,169]. Quantization describes techniques aimed at reducing the bit width of the weights and activations of a machine-learning model to reduce the memory storage and communication overhead required for computation. This process thereby reduces the bandwidth required for communication, overall power consumption, area, and circuitry required to implement the design. Many research works have considered different quantization techniques for deep learning models. Courbariaux et al. [170] consider training a deep model using binary representation (+1 and −1), of the model parameters using a binarization function given in Equation (15). Also, [166] proposes a quantization scheme using ternary values (+1, 0, −1). The proposed equation is given in Equations (16) and (17). Other quantization techniques involving Bayesian quantization, weighted entropy-based quantization, vector quantization, and two-bit networks are adopted in [50,51,158], and [171], respectively. Although quantization techniques increase execution speed, the algorithm requires fine-tuning to avoid accuracy loss [172,173]:
where is the binarized variable (weights and activations) and x is the real-valued variable.
where is the new ternarized weight, is the learning rate E is the output error, is the output signal and is the error signal.
where are the new ternarized activation functions and y is the output signal.
5. Areas of Applications of Intelligent Embedded Systems
Indeed, numerous application areas of embedded systems and the breakthrough of machine learning methods have further widened and deepened the range of applications where embedded systems and machine learning methods are currently actively adopted. Some areas of applications we consider in this survey are intelligent sensor systems and IoTs, deep learning in mobile devices, deep learning training using general-purpose GPUs, deep learning in heterogeneous computing systems, Embedded field programmable gate arrays, energy-efficient hardware design, and architectures.
5.1. Intelligent Sensor Systems and IoTs
There is a growing interest revolving around the efficient implementation of machine learning algorithms within embedded environments of sensor networks and the internet. Diverse machine learning algorithms such as SVMs, GMMs, and DNNs are finding useful applications in cogent areas such as the network configuration of mobile networks, analysis of sensor data, power consumption management, etc. A list of applications of machine learning executed within these environments is presented in Table 18. Although machine learning techniques have found useful applications in embedded systems domains, there are major drawbacks that entail limited available computational and memory resources in embedded computing systems.
Table 18.
Intelligent sensor systems and IoTs.
5.2. Deep Learning In Mobile Devices
DNNs are finding very useful applications in mobile devices for speech recognition, computer vision, and natural language processing, respectively, indoor navigation systems, etc. A list of application areas of deep learning models in mobile devices is presented in Table 19. The computational and memory demands required for training and inferencing deep learning models make current mobile devices unsuitable for these models. Thus, more research is being carried out towards the inferencing of the models on mobile devices. There are a lot of energy-intensive applications on current mobile devices which compete for the limited available power and thus, more research is being carried out to optimize these deep learning models so they can efficiently fit within mobile devices.
Table 19.
Deep learning in mobile devices.
5.3. Deep Learning Training Using Graphic Processors (GPUs)
DNNs are computationally intensive. State-of-the-art hardware for training deep learning models are graphic processors because of their high-level parallel processing and high floating-point capability. Deep learning algorithms are largely dependent on parallel processing operations, which GPUs are adequately developed to target. Although GPGPUs have very good performance, they are highly power-hungry and expensive to implement, and this thus makes them unsuitable for embedded systems design and development. This owes to the fact that a key design metric for embedded devices is that they must consume very low power and must be economical. Table 20 presents an area of application in training deep learning models on GPGPUs.
Table 20.
Deep learning training in general purpose graphic processing units (GPGPUs).
5.4. Deep Learning Using Heterogeneous Computing Systems
Multicore and many-core architectural enhancement is a modification made by computer architects to address the performance wall and memory wall facing CPU technology. Multicore technology is also called homogenous computing systems, while many core architectures are used in heterogeneous computing systems. Heterogeneous computing systems are systems with more than one type of processor core. Most heterogeneous computing systems are used as acceleration units for offloading computationally intensive operations from the CPU, thereby increasing the system’s overall execution speed. Table 21 presents an area of application of deep learning training in heterogeneous computing systems. A critical drawback in heterogeneous computing systems pivots around the sharing of memory resources, data bus, etc. If designed inefficiently, it can result in data traffic and thus increase latency and power consumption.
Table 21.
Deep Learning in heterogeneous computing systems.
5.5. Embedded Field Programmable Gate Arrays (FPGAs)
FPGAs are gaining popular interest in the computing world due to their low cost, high performance, energy efficiency, and flexibility. They are often used to design acceleration units and pre-implement ASIC architectures. Table 22 presents certain areas of applications where FPGA architectures are adopted to accelerate deep learning model execution. FPGAs, although programmable and reconfigurable, are difficult to program. This is a critical limitation to their ubiquitous utilization in the embedded computing design.
Table 22.
Embedded FPGAs: optimization and throughput.
5.6. Energy Efficient Hardware Design and Architectures
The urgency for novel energy-efficient hardware designs cannot be overemphasized. Table 23 presents different application-specific architectures and current research issues involved in targeting the high-performance applications required in this big data era. Application-Specific Architectures, although highly efficient, are difficult to design and implement, having high time-to-market. Their good performance owing to their specificity makes them very suitable for embedded and machine learning applications.
Table 23.
Energy-efficient hardware design and architectures.
6. Research Directions and Open Issues
Embedded machine learning research is still in its early days. Thus, there remains a large range of opportunities to explore in this research direction, which is critical to the development of IoT devices of the future. Some research directions in the areas of Computer Architecture, Deep Learning Optimizations, Hardware Security, Energy Efficiency, and Power Management are presented in Table 24. Additionally, the key lessons learned are highlighted in Section 6.1.
Table 24.
Future research directions.
6.1. Lessons Learned
In this section, we present a comprehensive summary of the lessons learned from this survey. Our summary covers embedded systems computing architectures, machine learning techniques, deep learning models, optimization techniques, and energy efficiency and power management techniques.
Lesson one: As of today, even the most expensive embedded system platforms do not have the computational and memory capacity to execute expensive machine learning algorithms efficiently. Thus, to bring these models into the embedded space where it becomes a part of our everyday life, which may be found in mobile devices and other IoTs, huge hardware architectural modifications and algorithm optimizations, are required. To approach the issue of overall efficiency properly, optimization approaches ought to tackle the key performance constraints of embedded systems: low power consumption, memory footprint, and latency and throughput concerns. Some optimization techniques are network pruning, data quantization, tiling, layer acceleration, etc.
Lesson two: The hardware architectural modifications required to accelerate state-of-the-art deep learning and other machine learning models greatly depend on the ML Model architecture. For example, the CNN model architecture is such that is computation-centric because there are many convolution layers in a CNN architecture, while a fully connected DNN is memory-centric because the algorithm architecture is such that it places great demand on the memory for both storage and throughput. Thus, when hardware acceleration units are being developed, it must be done with a sound understanding of the algorithm architecture to accelerate the right operations, whether vector products (convolutions) or multiply and accumulate (fully connected). This improves the overall efficiency of the hardware architecture.
Lesson three: Most machine learning models are much bigger than standard embedded on-chip and off-chip memory sizes. Thus, to address the memory concern, optimizations may be carried out using network model pruning to reduce the number of parameters, so they can be stored within the available memory, data quantization, which reduces the bit precision of the network parameters so they could fit into standard DRAM and SRAM sizes. Also, direct memory access units may be adopted to reduce the latency of data transfer from the external memory to the processing logic to inform high execution speeds.
Lesson four: To address the energy-efficiency bottleneck, which is primarily due to the type of computation carried out, and the power required to fetch parameters from off-chip memory, optimizations may involve reducing the total number of parameters of the network so that computation may be done as close as possible to the compute unit. Some techniques for compressing the model involve network pruning, clustering, and data quantization. Also, bit reduction using quantization introduces accuracy errors to the overall prediction process. It is worthy of note that precision should not be reduced below a particular threshold which should preserve the model accuracy. To address the accuracy concerns, the quantized parameters of the model may be retrained and fine-tuned to restore prediction confidence.
Lesson five: To tackle latency concerns that are a result of off-chip memory transfers, optimizations may be carried out such that model parameters may be cache on-chip for data reuse. This optimization can be done using techniques such as tiling or simple vector decomposition, where input data may be partitioned into bits or tiles that can fit into on-chip memory and may be reused for computation when required. This technique avoids frequent off-chip memory transfers, which is a major concern for both latency and power consumption. Hardware acceleration units may be designed to integrate a Tiling Unit to carry out this operation at the hardware level. Some other techniques to inform high throughput involve pipelining, on-chip buffer optimization, data access optimizations, etc.
Lesson six: Although hardware acceleration using custom FPGA logic, GPUs, or CPUs addresses compute power demands, a most promising solution is to develop application-specific architectures using ASICs. Interestingly, every processor architecture has its pros and cons such as the energy efficiency and reconfigurability of FPGAs, but they are slow and hard to program, the high performance of GPU processors but they are power-hungry, the flexibility of general-purpose CPU architectures but are slow with ML computations, etc. Of all these processor architectures, ASICs possess the best performance in terms of energy efficiency because they are hardwired designs to target a specific application. They consume very low power and incur very low costs too. They, however, trade-off flexibility for performance and take a lot of time to market. ASICs are thus gaining renewed interest in the design and development of application-specific machine learning architectures, with Google TPU being a successful case study.
7. Conclusions
Machine learning models are fast proliferating embedded devices with limited computational power and memory space. These machine learning models are compute and memory intensive and thus, face the critical limitation of available hardware resources in embedded and mobile devices. In this paper, optimization techniques and various applications of machine learning algorithms within resource-limited environments are presented. We first survey the embedded machine learning space to determine the common machine learning algorithms adopted and select key compute and memory-intensive models such as HMMs, k-NNs, SVMs, GMMs, and DNNs. We survey specialized optimization techniques commonly adopted to squeeze these algorithms within resource-limited environments. Also, we present different hardware platforms such as microcontroller units, mobile devices, accelerators, and even TinyML frameworks, which are used to port these algorithms to resource-limited MCUs. Furthermore, we survey the challenges encountered in embedded machine learning and present a more detailed exposition on certain hardware-oriented and algorithm-oriented optimization schemes to address these bottlenecks. Additionally, an exciting look is given to different hardware and algorithm-based optimization techniques, including model pruning, data quantization, reduced precision, tiling, and others to determine which optimization technique best suits the different ML algorithms. Interesting and viable application areas, open research issues, and key take-away lessons are presented in this intersection of embedded systems and machine learning. Conclusively, this survey attempts to create awareness for the passionately interested researcher to kick-start an adventure into this promising landscape of embedded machine learning.
Author Contributions
T.S.A. and A.L.I. were responsible for the Conceptualization of the topic; Article gathering and sorting were done by T.S.A. and A.L.I.; Manuscript writing and original drafting and formal analysis were carried out by T.S.A. and A.L.I.; Writing of reviews and editing was done by A.L.I. and A.A.A.; A.L.I. led the overall research activity. All authors have read and agreed to the published version of the manuscript.
Funding
Agbotiname Lucky Imoize is supported by the Nigerian Petroleum Technology Development Fund (PTDF) and the German Academic Exchange Service (DAAD) through the Nigerian-German Postgraduate Program under Grant 57473408.
Institutional Review Board Statement
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent Statement
Not applicable.
Data Availability Statement
Data sharing does not apply to this article.
Acknowledgments
This work was carried out in collaboration with the IoT-enabled Smart and Connected Communities (SmartCU) Research Cluster of Covenant University. The Article Processing Charges is sponsored by Covenant University Centre for Research, Innovation, and Development (CUCRID), Covenant University, Ota, Nigeria.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
| Abbreviation | Full Meaning |
| ANN | Artificial Neural Network |
| ASIC | Application Specific Integrated Circuit |
| ASIP | Application-Specific Instruction-set Processor |
| CPU | Central Processing Unit |
| CNN | Convolutional Neural Network |
| DCNN | Deep Convolutional Neural Network |
| DMA | Direct Memory Access |
| DNN | Deep Neural Network |
| DSA | Domain-Specific Architectures |
| DSP | Digital Signal Processor |
| EML | Embedded Machine Learning |
| FPAA | Field Programmable Analog Array |
| FPGA | Field Programmable Gate Array |
| FC | Fully Connected |
| GPGPU | General Purpose Graphic Processing Unit |
| GMM | Gaussian Mixture Model |
| HMM | Hidden Markov Model |
| IC | Integrated Circuit |
| I/O | Input/output |
| ISA | Instruction Set Architecture |
| ISF | Intelligent Sensor Framework |
| k-NN | k-Nearest Neighbors |
| LRN | Local Response Normalization |
| LSTM | Long Short Term Memory |
| IoT | Internet of Things |
| MANET | Mobile Adhoc Network |
| MCU | Microcontroller Unit |
| PE | Processing Element |
| RAM | Random Access Memory |
| RNN | Recurrent Neural Network |
| SoC | System on Chip |
| SVM | Support Vector Machines |
| SDG | Stochastic Descent Gradient |
| SVD | Singular Value Decomposition |
| TPU | Tensor Processing Unit |
| WSN | Wireless Sensor Network |
References
- Wayne, W. Praise of High-Performance Embedded Computing: Architectures, Applications, and Methodologies; Morgan Kaupmann Publishers: San Francisco, CA, USA, 2007. [Google Scholar]
- Haigh, K.Z.; Mackay, A.M.; Cook, M.R.; Lin, L.G. Machine Learning for Embedded Systems: A Case Study; BBN Technologies: Cambridge, MA, USA, 2015; Volume 8571, pp. 1–12. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks Alex. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, P.Y.; Reed, S.S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference Computer Vision Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference Machine Learning ICML, Sydney, Australia, 6–11 August 2017; pp. 4429–4446. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference Machine Learning ICML 2019, Long Beach, CA, USA, 10–15 June 2019; pp. 10691–10700. [Google Scholar]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal. Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Chan, W.; Jaitly, N.; Le, Q.V.; Vinyals, O. Listen, attend and spell. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the Gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Haj, R.B.; Orfanidis, C. A discreet wearable long-range emergency system based on embedded machine learning. In Proceedings of the 2021 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), Kassel, Germany, 22–26 March 2021. [Google Scholar]
- Dean, J. The deep learning revolution and its implications for computer architecture and chip design. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; pp. 8–14. [Google Scholar] [CrossRef]
- Cui, X.; Liu, H.; Fan, M.; Ai, B.; Ma, D.; Yang, F. Seafloor habitat mapping using multibeam bathymetric and backscatter intensity multi-features SVM classification framework. Appl. Acoust. 2020, 174, 107728. [Google Scholar] [CrossRef]
- Khan, M.A.; Kim, J. Toward developing efficient Conv-AE-based intrusion detection system using heterogeneous dataset. Electronics 2020, 9, 1771. [Google Scholar] [CrossRef]
- Li, P.; Luo, Y.; Zhang, N.; Cao, Y. HeteroSpark: A heterogeneous CPU/GPU spark platform for machine learning algorithms. In Proceedings of the 2015 IEEE International Conference Networking, Architecture Storage, NAS, Boston, MA, USA, 6–7 August 2015; pp. 347–348. [Google Scholar] [CrossRef]
- Raparti, V.Y.; Pasricha, S. RAPID: Memory-aware NoC for latency optimized GPGPU architectures. IEEE Trans. Multi-Scale Comput. Syst. 2018, 4, 874–887. [Google Scholar] [CrossRef]
- Cheng, X.; Zhao, Y.; Robaei, M.; Jiang, B.; Zhao, H.; Fang, J. A low-cost and energy-efficient noc architecture for GPGPUs. J. Nat. Gas Geosci. 2019, 4, 1–28. [Google Scholar] [CrossRef]
- Zhang, L.; Cheng, X.; Zhao, H.; Mohanty, S.P.; Fang, J. Exploration of system configuration in effective training of CNNs on GPGPUs. In Proceedings of the 2019 IEEE International Conferece Consumer Electronics ICCE, Las Vegas, NJ, USA, 11 January 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Yu, Q.; Wang, C.; Ma, X.; Li, X.; Zhou, X. A deep learning prediction process accelerator based FPGA. In Proceedings of the 2015 IEEE/ACM 15th International Symposium Cluster Cloud, Grid Computer CCGrid 2015, Shenzhen, China, 4–7 May 2015; pp. 1159–1162. [Google Scholar] [CrossRef]
- Noronha, D.H.; Zhao, R.; Goeders, J.; Luk, W.; Wilton, S.J.E. On-chip FPGA debug instrumentation for machine learning applications. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019. [Google Scholar] [CrossRef]
- Wang, C.; Gong, L.; Yu, Q.; Li, X.; Xie, Y.; Zhou, X. DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Trans. Comput. Des. Integr. Circuits Syst. 2016, 36, 513–517. [Google Scholar] [CrossRef]
- Chang, A.X.M.; Martini, B.; Culurciello, E. Recurrent Neural Networks Hardware Implementationon FPGA. Available online: http://arxiv.org/abs/1511.05552 (accessed on 15 January 2021).
- Branco, S.; Ferreira, A.G.; Cabral, J. Machine learning in resource-scarce embedded systems, FPGAs, and end-devices: A survey. Electronics 2019, 8, 1289. [Google Scholar] [CrossRef]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar] [CrossRef]
- Neshatpour, K.; Mokrani, H.M.; Sasan, A.; Ghasemzadeh, H.; Rafatirad, S.; Homayoun, H. Architectural considerations for FPGA acceleration of machine learning applications in MapReduce. In Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, Pythagorion, Greece, 15–19 July 2018. [Google Scholar] [CrossRef]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level Accuracy With 50× Fewer Parameters and <0.5 mb Model Size. Available online: http://arxiv.org/abs/1602.07360 (accessed on 15 February 2021).
- Deng, Y. Deep learning on mobile devices: A review. In Proceedings of the SPIE 10993, Mobile Multimedia/Image Processing, Security, and Applications 2019, 109930A, Baltimore, ML, USA, 14–18 April 2019. [Google Scholar] [CrossRef]
- Kim, D.; Ahn, J.; Yoo, S. A novel zero weight/activation-aware hardware architecture of convolutional neural network. In Proceedings of the 2017 Design, Automation and Test in Europe DATE 2017, Lausanne, Switzerland, 27–31 March 2017; pp. 1462–1467. [Google Scholar] [CrossRef]
- Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef]
- Jawandhiya, P. Hardware design for machine learning. Int. J. Artif. Intell. Appl. 2018, 9, 1–6. [Google Scholar] [CrossRef]
- Chen, J.; Ran, X. Deep learning with edge computing: A review. Proc. IEEE 2019, 107, 1655–1674. [Google Scholar] [CrossRef]
- Frank, M.; Drikakis, D.; Charissis, V. Machine-learning methods for computational science and engineering. Computation 2020, 8, 15. [Google Scholar] [CrossRef]
- Xiong, Z.; Zhang, Y.; Niyato, D.; Deng, R.; Wang, P.; Wang, L.C. Deep reinforcement learning for mobile 5G and beyond: Fundamentals, applications, and challenges. IEEE Veh. Technol. Mag. 2019, 14, 44–52. [Google Scholar] [CrossRef]
- Carbonell, J.G. Machine learning research. ACM SIGART Bull. 1981, 18, 29. [Google Scholar] [CrossRef]
- Jadhav, S.D.; Channe, H.P. Comparative STUDY of K-NN, naive bayes and decision tree classification techniques. Int. J. Sci. Res. 2016, 5, 1842–1845. [Google Scholar]
- Chapter 4 Logistic Regression as a Classifier. Available online: https://www.cs.cmu.edu/~kdeng/thesis/logistic.pdf (accessed on 29 December 2020).
- Salvadori, C.; Petracca, M.; del Rincon, J.M.; Velastin, S.A.; Makris, D. An optimisation of Gaussian mixture models for integer processing units. J. Real Time Image Process. 2017, 13, 273–289. [Google Scholar] [CrossRef]
- Das, A.; Borisov, N.; Caesar, M. Do you hear what i hear? Fingerprinting smart devices through embedded acoustic components. In Proceedings of the ACM Conference on Computer, Communication and Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 441–452. [Google Scholar] [CrossRef]
- Bojinov, H.; Michalevsky, Y.; Nakibly, G.; Boneh, D. Mobile Device Identification via Sensor Fingerprinting. Available online: http://arxiv.org/abs/1408.1416 (accessed on 12 January 2021).
- Huynh, M.; Nguyen, P.; Gruteser, M.; Vu, T. Mobile device identification by leveraging built-in capacitive signature. In Proceedings of the ACM Conference on Compututer, Communication and Security, Denver, CO, USA, 12–16 October 2015; pp. 1635–1637. [Google Scholar] [CrossRef]
- Dhar, S.; Sreeraj, K.P. FPGA implementation of feature extraction based on histopathalogical image and subsequent classification by support vector machine. IJISET Int. J. Innov. Sci. Eng. Technol. 2015, 2, 744–749. [Google Scholar]
- Yu, L.; Ukidave, Y.; Kaeli, D. GPU-accelerated HMM for speech recognition. In Proceedings of the International Conference Parallel Processing Work, Minneapolis, MN, USA, 9–12 September 2014; pp. 395–402. [Google Scholar] [CrossRef]
- Zubair, M.; Yoon, C.; Kim, H.; Kim, J.; Kim, J. Smart wearable band for stress detection. In Proceedings of the 2015 5th International Conference IT Converg. Secur. ICITCS, Kuala Lumpur, Malaysia, 24–27 August 2015; pp. 1–4. [Google Scholar] [CrossRef]
- Razavi, A.; Valkama, M.; Lohan, E.S. K-means fingerprint clustering for low-complexity floor estimation in indoor mobile localization. In Proceedings of the 2015 IEEE Globecom Work. GC Wkshps 2015, San Diego, CA, USA, 6–10 December 2015. [Google Scholar] [CrossRef]
- Bhide, V.H.; Wagh, S. I-learning IoT: An intelligent self learning system for home automation using IoT. In Proceedings of the 2015 International Conference Communication Signalling Process. ICCSP 2015, Melmaruvathur, India, 2–4 April 2015; pp. 1763–1767. [Google Scholar] [CrossRef]
- Munisami, T.; Ramsurn, M.; Kishnah, S.; Pudaruth, S. Plant Leaf recognition using shape features and colour histogram with K-nearest neighbour classifiers. Proc. Comput. Sci. 2015, 58, 740–747. [Google Scholar] [CrossRef]
- Sowjanya, K.; Singhal, A.; Choudhary, C. MobDBTest: A machine learning based system for predicting diabetes risk using mobile devices. In Proceedings of the Souvenir 2015 IEEE Int. Adv. Comput. Conference IACC 2015, Banglore, India, 12–13 June 2015; pp. 397–402. [Google Scholar] [CrossRef]
- Lee, J.; Stanley, M.; Spanias, A.; Tepedelenlioglu, C. Integrating machine learning in embedded sensor systems for Internet-of-Things applications. In Proceedings of the 2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Limassol, Cyprus, 12–14 December 2016; pp. 290–294. [Google Scholar] [CrossRef]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the FPGA 2016ACM/SIGDA International Symposium Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar] [CrossRef]
- Huynh, L.N.; Balan, R.K.; Lee, Y. DeepSense: A GPU-based deep convolutional neural network framework on commodity mobile devices. In Proceedings of the Workshop on Wearable Systems and Application Co-Located with MobiSys 2016, Singapore, 30 June 2016; pp. 25–30. [Google Scholar] [CrossRef]
- Tuama, A.; Comby, F.; Chaumont, M. Camera model identification based machine learning approach with high order statistics features. In Proceedings of the 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; pp. 1183–1187. [Google Scholar] [CrossRef]
- Kurtz, A.; Gascon, H.; Becker, T.; Rieck, K.; Freiling, F. Fingerprinting Mobile Devices Using Personalized Configurations. Proc. Priv. Enhanc. Technol. 2016, 1, 4–19. [Google Scholar] [CrossRef]
- Mohsin, M.A.; Perera, D.G. An FPGA-based hardware accelerator for k-nearest neighbor classification for machine learning on mobile devices. In Proceedings of the ACM International Conference Proceeding Series, HEART 2018, Toronto, ON, Canada, 20–22 June 2018; pp. 6–12. [Google Scholar] [CrossRef]
- Patil, S.S.; Thorat, S.A. Early detection of grapes diseases using machine learning and IoT. In Proceedings of the 2016 Second International Conference on Cognitive Computing and Information Processing (CCIP), Mysuru, India, 12–13 August 2016. [Google Scholar] [CrossRef]
- Ollander, S.; Godin, C.; Campagne, A.; Charbonnier, S. A comparison of wearable and stationary sensors for stress detection. In Proceedings of the IEEE International Conference System Man, and Cybernetic SMC 2016, Budapest, Hungary, 9–12 October 2016; pp. 4362–4366. [Google Scholar] [CrossRef]
- Moreira, M.W.L.; Rodrigues, J.J.P.C.; Oliveira, A.M.B.; Saleem, K. Smart mobile system for pregnancy care using body sensors. In Proceedings of the International Conference Sel. Top. Mob. Wirel. Networking, MoWNeT 2016, Cairo Egypt, 11–13 April 2016; pp. 1–4. [Google Scholar] [CrossRef]
- Shapsough, S.; Hesham, A.; Elkhorazaty, Y.; Zualkernan, I.A.; Aloul, F. Emotion recognition using mobile phones. In Proceedings of the 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom), Munich, Germany, 14–16 September 2016; pp. 276–281. [Google Scholar] [CrossRef]
- Hakim, A.; Huq, M.S.; Shanta, S.; Ibrahim, B.S.K.K. Smartphone based data mining for fall detection: Analysis and design. Proc. Comput. Sci. 2016, 105, 46–51. [Google Scholar] [CrossRef]
- Ronao, C.A.; Cho, S.B. Recognizing human activities from smartphone sensors using hierarchical continuous hidden Markov models. Int. J. Distrib. Sens. Netw. 2017, 13, 1–16. [Google Scholar] [CrossRef]
- Kodali, S.; Hansen, P.; Mulholland, N.; Whatmough, P.; Brooks, D.; Wei, G.Y. Applications of deep neural networks for ultra low power IoT. In Proceedings of the 35th IEEE International Conference on Computer Design ICCD 2017, Boston, MA, USA, 5–8 November 2017; pp. 589–592. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolution neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
- Baldini, G.; Dimc, F.; Kamnik, R.; Steri, G.; Giuliani, R.; Gentile, C. Identification of mobile phones using the built-in magnetometers stimulated by motion patterns. Sensors 2017, 17, 783. [Google Scholar] [CrossRef] [PubMed]
- Azimi, I.; Anzanpour, A.; Rahmani, A.M.; Pahikkala, T.; Levorato, M.; Liljeberg, P.; Dutt, N. HiCH: Hierarchical fog-assisted computing architecture for healthcare IoT. ACM Trans. Embed. Comput. Syst. 2017, 16, 1–20. [Google Scholar] [CrossRef]
- Pandey, P.S. Machine Learning and IoT for prediction and detection of stress. In Proceedings of the 17th International Conference on Computational Science and Its Applications ICCSA 2017, Trieste, Italy, 3–6 July 2017. [Google Scholar] [CrossRef]
- Sneha, H.R.; Rafi, M.; Kumar, M.V.M.; Thomas, L.; Annappa, B. Smartphone based emotion recognition and classification. In Proceedings of the 2nd IEEE International Conference on Electrical, Computer and Communication Technology ICECCT 2017, Coimbatore, India, 22–24 February 2017. [Google Scholar] [CrossRef]
- Al Mamun, M.A.; Puspo, J.A.; Das, A.K. An intelligent smartphone based approach using IoT for ensuring safe driving. In Proceedings of the 2017 International Conference on Electrical Engineering and Computer Science (ICECOS), Palembang, Indonesia, 22–23 August 2017; pp. 217–223. [Google Scholar] [CrossRef]
- Neyja, M.; Mumtaz, S.; Huq, K.M.S.; Busari, S.A.; Rodriguez, J.; Zhou, Z. An IoT-based e-health monitoring system using ECG signal. In Proceedings of the IEEE Global Communications Conference GLOBECOM 2017, Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar] [CrossRef]
- Gupta, C.; Suggala, A.S.; Goyal, A.; Simhadri, H.V.; Paranjape, B.; Kumar, A.; Goyal, S.; Udupa, R.; Varma, M.; Jain, P. ProtoNN: Compressed and accurate kNN for resource-scarce devices. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1331–1340. [Google Scholar]
- Fafoutis, X.; Marchegiani, L.; Elsts, A.; Pope, J.; Piechocki, R.; Craddock, I. Extending the battery lifetime of wearable sensors with embedded machine learning. In Proceedings of the IEEE World Forum on Internet Things, WF-IoT 2018, Singapore, 5–8 February 2018; pp. 269–274. [Google Scholar] [CrossRef]
- Damljanovic, A.; Lanza-Gutierrez, J.M. An embedded cascade SVM approach for face detection in the IoT edge layer. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; pp. 2809–2814. [Google Scholar] [CrossRef]
- Hochstetler, J.; Padidela, R.; Chen, Q.; Yang, Q.; Fu, S. Embedded deep learning for vehicular edge computing. In Proceedings of the 3rd ACM/IEEE Symposium on Edge Computing SEC 2018, Seattle, WA, USA, 25–27 October 2018; pp. 341–343. [Google Scholar] [CrossRef]
- Taylor, B.; Marco, V.S.; Wolff, W.; Elkhatib, Y.; Wang, Z. Adaptive deep learning model selection on embedded systems. ACM SIGPLAN Not. 2018, 53, 31–43. [Google Scholar] [CrossRef]
- Strielkina, A.; Kharchenko, V.; Uzun, D. A markov model of healthcare internet of things system considering failures of components. CEUR Workshop Proc. 2018, 2104, 530–543. [Google Scholar]
- Vhaduri, S.; van Kessel, T.; Ko, B.; Wood, D.; Wang, S.; Brunschwiler, T. Nocturnal cough and snore detection in noisy environments using smartphone-microphones. In Proceedings of the IEEE International Conference on Healthcare Informatics, ICHI 2019, Xi’an, China, 10–13 June 2019. [Google Scholar] [CrossRef]
- Sattar, H.; Bajwa, I.S.; Amin, R.U.; Sarwar, N.; Jamil, N.; Malik, M.A.; Mahmood, A.; Shafi, U. An IoT-based intelligent wound monitoring system. IEEE Access 2019, 7, 144500–144515. [Google Scholar] [CrossRef]
- Mengistu, D.; Frisk, F. Edge machine learning for energy efficiency of resource constrained IoT devices. In Proceedings of the Fifth International Conference on Smart Portable, Wearable, Implantable and Disabilityoriented Devices and Systems, SPWID 2019, Nice, France, 28 July–1 August 2019; pp. 9–14. [Google Scholar]
- Wang, S.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE J. Sel. Areas Commun. 2019, 37, 1205–1221. [Google Scholar] [CrossRef]
- Suresh, P.; Fernandez, S.G.; Vidyasagar, S.; Kalyanasundaram, V.; Vijayakumar, K.; Archana, V.; Chatterjee, S. Reduction of transients in switches using embedded machine learning. Int. J. Power Electron. Drive Syst. 2020, 11, 235–241. [Google Scholar] [CrossRef][Green Version]
- Giri, D.; Chiu, K.L.; di Guglielmo, G.; Mantovani, P.; Carloni, L.P. ESP4ML: Platform-based design of systems-on-chip for embedded machine learning. In Proceedings of the 2020 Design, Automation and Test in European Conference Exhibition DATE 2020, Grenoble, France, 9–13 March 2020; pp. 1049–1054. [Google Scholar] [CrossRef]
- Tiku, S.; Pasricha, S.; Notaros, B.; Han, Q. A hidden markov model based smartphone heterogeneity resilient portable indoor localization framework. J. Syst. Archit. 2020, 108, 101806. [Google Scholar] [CrossRef]
- Mazlan, N.; Ramli, N.A.; Awalin, L.; Ismail, M.; Kassim, A.; Menon, A. A smart building energy management using internet of things (IoT) and machine learning. Test. Eng. Manag. 2020, 83, 8083–8090. [Google Scholar]
- Cornetta, G.; Touhafi, A. Design and evaluation of a new machine learning framework for iot and embedded devices. Electronics 2021, 10, 600. [Google Scholar] [CrossRef]
- Rabiner, L.; Juang, B. An introduction to hidden Markov models. IEEE ASSP Mag. 1986, 3, 4–16. [Google Scholar] [CrossRef]
- Degirmenci, A. Introduction to hidden markov models. Harv. Univ. 2014, 3, 1–5. Available online: http://scholar.harvard.edu/files/adegirmenci/files/hmm_adegirmenci_2014.pdf (accessed on 10 October 2016).
- Tóth, B.; Németh, G. Optimizing HMM speech synthesis for low-resource devices. J. Adv. Comput. Intell. Intell. Inform. 2012, 16, 327–334. [Google Scholar] [CrossRef]
- Fu, R.; Zhao, Z.; Tu, Q. Reducing computational and memory cost for HMM-based embedded TTS system. Commun. Comput. Inf. Sci. 2011, 224, 602–610. [Google Scholar] [CrossRef]
- Baoli, L.; Shiwen, Y.; Qin, L. An improved K-nearest neighbor algorithm for text categorization. Dianzi Yu Xinxi Xuebao J. Electron. Inf. Technol. 2005, 27, 487–491. [Google Scholar]
- Norouzi, M.; Fleet, D.J.; Salakhutdinov, R. Hamming distance metric learning. Adv. Neural Inf. Process. Syst. 2012, 2, 1061–1069. [Google Scholar]
- Saikia, J.; Yin, S.; Jiang, Z.; Seok, M.; Seo, J.S. K-nearest neighbor hardware accelerator using in-memory computing SRAM. In Proceedings of the 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), Lausanne, Switzerland, 29–31 July 2019. [Google Scholar] [CrossRef]
- Pedersen, R.; Schoeberl, M. An embedded support vector machine. In Proceedings of the 2006 International Workshop on Intelligent Solutions in Embedded Systems, Vienna, Austria, 30 June 2006; pp. 79–89. [Google Scholar] [CrossRef]
- You, Y.; Fu, H.; Song, S.L.; Randles, A.; Kerbyson, D.; Marquez, A.; Yang, G.; Hoisie, A. Scaling support vector machines on modern HPC platforms. J. Parallel Distrib. Comput. 2015, 76, 16–31. [Google Scholar] [CrossRef]
- Boni, A.; Pianegiani, F.; Petri, D. Low-power and low-cost implementation of SVMs for smart sensors. IEEE Trans. Instrum. Meas. 2007, 56, 39–44. [Google Scholar] [CrossRef]
- Afifi, S.M.; Gholamhosseini, H.; Sinha, R. Hardware implementations of SVM on FPGA: A state-of-the-art review of current practice. Int. J. Innov. Sci. Eng. Technol. 2015, 2, 733–752. [Google Scholar]
- Zeng, Z.Q.; Yu, H.B.; Xu, H.R.; Xie, Y.Q.; Gao, J. Fast training support vector machines using parallel sequential minimal optimization. In Proceedings of the 2008 3rd International Conference on Intelligent System and Knowledge Engineering, Xiamen, China, 17–19 November 2008; pp. 997–1001. [Google Scholar] [CrossRef]
- Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. Lect. Notes Comput. Sci. 2012, 7657, 216–223. [Google Scholar] [CrossRef]
- Kudo, T.; Matsumoto, Y. Chunking with support vector machines. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics 2001, Pittsburgh, PA, USA, 2–7 June 2001; pp. 1–8. [Google Scholar] [CrossRef]
- Osuna, E.; Freund, R.; Girosi, F. Improved training algorithm for support vector machines. Neural Networks for Signal Processing VII. In Proceedings of the 1997 IEEE Signal Processing Society Workshop, Amelia Island, FL, USA, 24–26 September 1997; pp. 276–285. [Google Scholar] [CrossRef]
- Lee, Y.J.; Mangasarian, O. RSVM: Reduced Support vector machines. In Proceedings of the Proceedings of the 2001 SIAM International Conference on Data Mining, Chicago, IL, USA, 5–7 April 2001; pp. 1–17. [Google Scholar] [CrossRef]
- Anguita, D.; Ghio, A.; Pischiutta, S.; Ridella, S. A hardware-friendly support vector machine for embedded automotive applications. In Proceedings of the 2007 International Joint Conference on Neural Networks, Orlando, FL, USA, 12–17 August 2007; pp. 1360–1364. [Google Scholar] [CrossRef]
- Anguita, D.; Bozza, G. The effect of quantization on support vector machines with Gaussian kernel. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005. [Google Scholar] [CrossRef]
- Khan, F.M.; Arnold, M.G.; Pottenger, W.M. Hardware-based support vector machine classification in logarithmic number systems. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems, Kobe, Japan, 23–26 May 2005; pp. 5154–5157. [Google Scholar] [CrossRef]
- Anguita, D.; Pischiutta, S.; Ridella, S.; Sterpi, D. Feed-forward support vector machine without multipliers. IEEE Trans. Neural Netw. 2006, 17, 1328–1331. [Google Scholar] [CrossRef]
- Reynolds, D. Gaussian mixture models. Encycl. Biometr. 2009, 741, 659–663. [Google Scholar] [CrossRef]
- Gorur, P.; Amrutur, B. Speeded up Gaussian mixture model algorithm for background subtraction. In Proceedings of the 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Klagenfurt, Austria, 30 August –2 September 2011; pp. 386–391. [Google Scholar] [CrossRef]
- Shen, Y.; Hu, W.; Liu, J.; Yang, M.; Wei, B.; Chou, C.T. Efficient background subtraction for real-time tracking in embedded camera networks. In Proceedings of the 10th ACM Conference on Embedded Networked Sensor System, Toronto, ON, Canada, 6–9 November 2012; pp. 295–308. [Google Scholar] [CrossRef]
- Bottou, L. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
- Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 2013, 1, 1–9. [Google Scholar]
- Bottou, L. Stochastic gradient learning in neural networks, Proc. Neuro-Nımes 1991, 8, 1–12. [Google Scholar]
- Li, L.; Zhang, S.; Wu, J. An efficient hardware architecture for activation function in deep learning processor. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; pp. 911–918. [Google Scholar] [CrossRef]
- Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.S.; Cao, Y. Throughput-optimized OpenCL-based FPGA Accelerator for large-scale convolutional neural networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 16–25. [Google Scholar] [CrossRef]
- Learning, S.D. Smartphones devices. IEEE Pervasive Comput. 2017, 16, 82–88. [Google Scholar]
- Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar] [CrossRef]
- O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. Available online: http://arxiv.org/abs/1511.08458 (accessed on 2 March 2021).
- Lawrence, S.; Giles, L.; Tsoi, C.; Back, A. Face recognition: A convolutional neural-network approach. IEEE Trans. Neural Netw. 1997, 8, 98–112. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Shah, S.; Haghi, B.; Kellis, S.; Bashford, L.; Kramer, D.; Lee, B.; Liu, C.; Andersen, R.; Emami, A. Decoding kinematics from human parietal cortex using neural networks. In Proceedings of the 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), San Francisco, CA, USA, 20–23 March 2019; pp. 1138–1141. [Google Scholar] [CrossRef]
- Lee, D.; Lim, M.; Park, H.; Kang, Y.; Park, J.S.; Jang, G.J.; Kim, J.H. Long short-term memory recurrent neural network-based acoustic model using connectionist temporal classification on a large-scale training corpus. Chin. Commun. 2017, 14, 23–31. [Google Scholar] [CrossRef]
- Yu, Y.; Si, X.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
- Khan, M.A.; Karim, M.R.; Kim, Y. A two-stage big data analytics framework with real world applications using spark machine learning and long short-term memory network. Symmetry 2018, 10, 485. [Google Scholar] [CrossRef]
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D. A domain-specific architecture for deep neural networks. Commun. ACM 2018, 61, 50–59. [Google Scholar] [CrossRef]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. Lect. Notes Comput. Sci. 2014, 8689, 818–833. [Google Scholar] [CrossRef]
- Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural networks. In Proceedings of the NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing Systems; ACM: New York, NY, USA, 2015; Volume 1, pp. 1135–1143. [Google Scholar]
- Khoram, S.; Li, J. Adaptive quantization of neural networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–13. [Google Scholar]
- Al-Kofahi, M.M.; Al-Shorman, M.Y.; Al-Kofahi, O.M. Toward energy efficient microcontrollers and Internet-of-Things systems. Comput. Electr. Eng. 2019, 79. [Google Scholar] [CrossRef]
- Keras, A. Keras API Reference/Keras Applications. Available online: https://keras.io/api/applications/ (accessed on 14 March 2021).
- Atmel. ATMEL—ATmega48P/88P/168P/328P. Available online: https://www.sparkfun.com/datasheets/Components/SMD/ATMega328.pdf (accessed on 14 March 2021).
- Atmel Corporation. ATMEL—ATmega640/V-1280/V-1281/V-2560/V-2561/V. Available online: https://ww1.microchip.com/downloads/en/devicedoc/atmel-2549-8-bit-avr-microcontroller-atmega640-1280-1281-2560-2561_datasheet.pdf (accessed on 14 March 2021).
- STMicroelectronics. STM32L073x8 STM32L073xB. Available online: https://www.st.com/resource/en/datasheet/stm32l073v8.pdf (accessed on 15 March 2021).
- Atmel Corporation. 32-Bit ARM-Based Microcontrollers SAM D21E/SAM D21G/SAM D21J Summary. Available online: www.microchip.com (accessed on 15 March 2021).
- Atmel. SAM3X / SAM3A Series datasheet. Available online: http://www.atmel.com/Images/Atmel-11057-32-bit-Cortex-M3-Microcontroller-SAM3X-SAM3A_Datasheet.pdf (accessed on 15 March 2021).
- STMicroelectronics. STM32F215xx STM32F217xx. Available online: https://www.st.com/resource/en/datasheet/stm32f215re.pdf (accessed on 15 March 2021).
- STMicroelectronics. STM32F469xx. Available online: https://www.st.com/resource/en/datasheet/stm32f469ae.pdf (accessed on 15 March 2021).
- Raspberry Pi Dramble. Power Consumption Benchmarks. Available online: https://www.pidramble.com/wiki/benchmarks/power-consumption (accessed on 15 March 2021).
- The First Affordable RISC-V Computer Designed to Run Linux. Available online: https://www.seeedstudio.com/blog/2021/01/13/meet-beaglev-the-first-affordable-risc-v-single-board-computer-designed-to-run-linux/ (accessed on 20 April 2021).
- Lane, N.D.; Bhattacharya, S.; Georgiev, P.; Forlivesi, C.; Jiao, L.; Qendro, L.; Kawsar, F. DeepX: A Software accelerator for low-power deep learning inference on mobile devices. In Proceedings of the 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Vienna, Austria, 11–14 April 2016. [Google Scholar] [CrossRef]
- Li, D.; Wang, X.; Kong, D. DeepRebirth: Accelerating deep neural network execution on mobile devices. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2017; pp. 2322–2330. [Google Scholar]
- Ren, T.I.; Cavalcanti, G.D.C.; Gabriel, D.; Pinheiro, H.N.B. A Hybrid GMM Speaker Verification System for Mobile Devices in Variable Environments. In Intelligent Data Engineering and Automated Learning—IDEAL 2012; Lecture Notes in Computer Science; Yin, H., Costa, J.A.F., Barreto, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
- Lei, X.; Senior, A.; Gruenstein, A.; Sorensen, J. Accurate and compact large vocabulary speech recognition on mobile devices. In Proceedings of the Annual Conference of the International Speech Communication Association INTERSPEECH, Lyon, France, 25–29 August 2013; pp. 662–665. [Google Scholar]
- Sanchez-Iborra, R.; Skarmeta, A.F. TinyML-enabled frugal smart objects: Challenges and opportunities. IEEE Circuits Syst. Mag. 2020, 20, 4–18. [Google Scholar] [CrossRef]
- Park, J.; Naumov, M.; Basu, P.; Deng, S.; Kalaiah, A.; Khudia, D.; Law, J.; Malani, P.; Malevich, A.; Nadathur, S.; et al. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv 2018, arXiv:1811.09886. [Google Scholar]
- Banbury, C.; Zhou, C.; Fedorov, I.; Matas, R.; Thakker, U.; Gope, D.; Janapa Reddi, V.; Mattina, M.; Whatmough, P. MicroNets: Neural network architectures for deploying TinyML Applications on commodity microcontrollers. In Proceedings of the 4th MLSys Conference, San Jose, CA, USA, 4–7 April 2021; Available online: https://proceedings.mlsys.org/paper/2021/file/a3c65c2974270fd093ee8a9bf8ae7d0b-Paper.pdf (accessed on 20 April 2021).
- NVIDIA. NVIDIA V100 Tensor Core GPU. Available online: https://www.nvidia.com/en-us/data-center/v100/ (accessed on 20 February 2021).
- NVIDIA. The Ultimate PC GPU Nvidia Titan RTX. Available online: https://www.nvidia.com/content/dam/en-zz/Solutions/titan/documents/titan-rtx-for-creators-us-nvidia-1011126-r6-web.pdf (accessed on 16 February 2021).
- ST Microelectronics. STM32F745xx STM32F746xx Datasheet. Available online: http://www.st.com/content/ccc/resource/technical/document/datasheet/96/ed/61/9b/e0/6c/45/0b/DM00166116.pdf/files/DM00166116.pdf/jcr:content/translations/en.DM00166116.pdf (accessed on 22 January 2021).
- ST Microelectronics Inc. STM32F765xx, STM32F767xx Datasheet. Available online: https://pdf1.alldatasheet.com/datasheet-pdf/view/933989/STMICROELECTRONICS/STM32F767ZI.html (accessed on 17 January 2021).
- Capra, M.; Bussolino, B.; Marchisio, A.; Shafique, M.; Masera, G.; Martina, M. An Updated survey of efficient hardware architectures for accelerating deep convolutional neural networks. Future Internet 2020, 12, 113. [Google Scholar] [CrossRef]
- Sun, S.; Cao, Z.; Zhu, H.; Zhao, J. A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern. 2020, 50, 3668–3681. [Google Scholar] [CrossRef] [PubMed]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–14. Available online: https://arxiv.org/abs/1510.00149 (accessed on 17 January 2021).
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 2018, 18, 1–30. [Google Scholar]
- Tanaka, K.; Arikawa, Y.; Ito, T.; Morita, K.; Nemoto, N.; Miura, F.; Terada, K.; Teramoto, J.; Sakamoto, T. Communication-efficient distributed deep learning with GPU-FPGA heterogeneous computing. In Proceedings of the 2020 IEEE Symposium on High-Performance Interconnects (HOTI), Piscataway, NJ, USA, 19–21 August 2020; pp. 43–46. [Google Scholar] [CrossRef]
- Lane, N.; Bhattacharya, S.; Georgiev, P.; Forlivesi, C. Squeezing deep learning into mobile and embedded devices. IEEE Pervasive Comput. 2017, 16, 82–88. [Google Scholar] [CrossRef]
- Gysel, P. Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks. Available online: http://arxiv.org/abs/1605.06402 (accessed on 20 February 2021).
- Moons, B.; Goetschalckx, K.; van Berckelaer, N.; Verhelst, M. Minimum energy quantized neural networks. In Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers ACSSC 2017, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 1921–1925. [Google Scholar] [CrossRef]
- Xu, C.; Kirk, S.R.; Jenkins, S. Tiling for performance tuning on different models of GPUs. In Proceedings of the 2009 Second International Symposium on Information Science and Engineering ISISE 2009, Shanghai, China, 26–28 December 2009; pp. 500–504. [Google Scholar] [CrossRef]
- Sun, F.; Li, X.; Wang, Q.; Tang, C. FPGA-based embedded system design. In Proceedings of the IEEE Asia-Pacific Conference Circuits Systems APCCAS, Macao, China, 30 November–3 December 2008. [Google Scholar] [CrossRef]
- Roth, W.; Schindler, G.; Zöhrer, M.; Pfeifenberger, L.; Peharz, R.; Tschiatschek, S.; Fröning, H.; Pernkopf, F.; Ghahramani, Z. Resource-Efficient Neural Networks for Embedded Systems. Available online: http://arxiv.org/abs/2001.03048 (accessed on 27 March 2021).
- Courbariaux, M.; Bengio, Y.; David, J.P. Low Precision Storage for Deep Learning. Available online: http://arxiv.org/abs/1511.00363%5Cnhttp://arxiv.org/abs/1412.7024 (accessed on 10 February 2021).
- Courbariaux, M.; David, J.P.; Bengio, Y. Training deep neural networks with low precision multiplications. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–10. Available online: https://arxiv.org/abs/1412.7024 (accessed on 20 February 2021).
- Tong, J.Y.F.; Nagle, D.; Rutenbar, R.A. Reducing power by optimizing the necessary precision/range of floating-point arithmetic. IEEE Trans. Very Large Scale Integr. Syst. 2000, 8, 273–286. [Google Scholar] [CrossRef]
- Tagliavini, G.; Mach, S.; Rossi, D.; Marongiu, A.; Benin, L. A transprecision floating-point platform for ultra-low power computing. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19–23 March 2018; pp. 151–1056. [Google Scholar] [CrossRef]
- Langroudi, S.H.F.; Pandit, T.; Kudithipudi, D. Deep Learning inference on embedded devices: Fixed-point vs posit. In Proceedings of the 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Williamsburg, VA, USA, 25–25 March 2018; pp. 19–23. [Google Scholar] [CrossRef]
- Oberstar, E. Fixed-Point Representation & Fractional Math. Available online: http://www.superkits.net/whitepapers/Fixed%20Point%20Representation%20&%20Fractional%20Math.pdf (accessed on 2 February 2021).
- Yates, R. Fixed-point arithmetic: An introduction. Technical Reference. Available online: https://courses.cs.washington.edu/courses/cse467/08au/labs/l5/fp.pdf (accessed on 15 February 2021).
- Hwang, K.; Sung, W. Fixed-point feedforward deep neural network design using weights +1, 0, and −1. In Proceedings of the 2014 IEEE Workshop on Signal Processing Systems (SiPS), Belfast, UK, 20–22 October 2014. [Google Scholar] [CrossRef]
- Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning ICML 2015, Lille, France, 6–11 July 2015; pp. 1737–1746. [Google Scholar]
- Gustafson, J.L.; Yonemoto, I. Beating floating point at its own game: Posit arithmetic. Supercomput. Front. Innov. 2017, 4, 71–86. [Google Scholar]
- Hammerstrom, D. A VLSI architecture for high-performance, low-cost, on-chip learning. In Proceedings of the IJCNN. International JT Conference Neural Network, San Diego, CA, USA, 17–21 June 1990; pp. 537–544. [Google Scholar] [CrossRef]
- Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. Available online: http://arxiv.org/abs/1602.02830 (accessed on 22 January 2021).
- Meng, W.; Gu, Z.; Zhang, M.; Wu, Z. Two-Bit Networks for Deep Learning on Resource-Constrained Embedded Devices. Available online: http://arxiv.org/abs/1701.00485 (accessed on 3 February 2021).
- Park, E.; Ahn, J.; Yoo, S. Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5456–5464. [Google Scholar] [CrossRef]
- Burrascano, P. Learning vector quantization for the probabilistic neural network. IEEE Trans. Neural Netw. 1991, 2, 458–461. [Google Scholar] [CrossRef]
- Mittal, A.; Tiku, S.; Pasricha, S. Adapting convolutional neural networks for indoor localization with smart mobile devices. In Proceedings of the 2018 on Great Lakes Symposium on VLSI, 2018; GLSVLSI’18, Chicago, IL, USA, 23–25 May 2018; pp. 117–122. [Google Scholar] [CrossRef]
- Hu, R.; Tian, B.; Yin, S.; Wei, S. Efficient hardware architecture of softmax layer in deep neural network. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; pp. 323–326. [Google Scholar] [CrossRef]
- Hennessy, J.L.; Patterson, D.A. A new golden age for computer architecture. Commun. ACM 2019, 62, 48–60. [Google Scholar] [CrossRef]
- Kim, R.G.; Doppa, J.R.; Pande, P.P.; Marculescu, D.; Marculescu, R. Machine learning and manycore systems design: A Serendipitous symbiosis. Computer 2018, 51, 66–77. [Google Scholar] [CrossRef]
- Kim, R.G.; Doppa, J.R.; Pande, P.P. Machine learning for design space exploration and optimization of manycore systems. In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, CA, USA, 5–8 November 2018. [Google Scholar] [CrossRef]
- Vazquez, R.; Gordon-Ross, A.; Stitt, G. Machine learning-based prediction for dynamic architectural optimizations. In Proceedings of the 10th International Green and Sustainability Computing Conference IGSC 2019, Alexandria, VA, USA, 21–24 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Papp, D.; Ma, Z.; Buttyan, L. Embedded systems security: Threats, vulnerabilities, and attack taxonomy. In Proceedings of the 2015 13th Annual Conference on Privacy, Security and Trust (PST), Izmir, Turkey, 21–23 July 2015; pp. 145–152. [Google Scholar] [CrossRef]
- Ogbebor, J.O.; Imoize, A.L.; Atayero, A.A.-A. Energy Efficient Design Techniques in Next-Generation Wireless Communication Networks: Emerging Trends and Future Directions. Wirel. Commun. Mob. Comput. 2020, 2020, 19. [Google Scholar] [CrossRef]
- Imoize, A.L.; Ibhaze, A.E.; Atayero, A.A.; Kavitha, K.V.N. Standard Propagation Channel Models for MIMO Communication Systems. Wirel. Commun. Mob. Comput. 2021, 2021, 36. [Google Scholar] [CrossRef]
- Popoola, S.I.; Jefia, A.; Atayero, A.A.; Kingsley, O.; Faruk, N.; Oseni, O.F.; Abolade, R.O. Determination of neural network parameters for path loss prediction in very high frequency wireless channel. IEEE Access 2019, 7, 150462–150483. [Google Scholar] [CrossRef]
- Faruk, N.; Popoola, S.I.; Surajudeen-Bakinde, N.T.; Oloyede, A.A.; Abdulkarim, A.; Olawoyin, L.A.; Ali, M.; Calafate, C.T.; Atayero, A.A. Path loss predictions in the VHF and UHF bands within urban environments: Experimental investigation of empirical, heuristics and geospatial models. IEEE Access 2019, 7, 77293–77307. [Google Scholar] [CrossRef]
- Pasricha, S.; Nikdast, M. A Survey of Silicon Photonics for Energy-Efficient Manycore Computing. IEEE Des. Test 2020, 37, 60–81. [Google Scholar] [CrossRef]
- Soref, R. The past, present, and future of silicon photonics. IEEE J. Sel. Top. Quantum Electron. 2006, 12, 1678–1687. [Google Scholar] [CrossRef]
- Chittamuru, S.V.R.; Dang, D.; Pasricha, S.; Mahapatra, R. BiGNoC: Accelerating big data computing with application-specific photonic network-on-chip architectures. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 2402–2415. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).